This repository contains the code to build subword embeddings from the arXiv dataset of 1.7M+ scholarly papers.
[Download the arXiv dataset], decompress archive.zip and place the file arxiv-metadata-oai-snapshot.json into the data/ directory.
Install required Python modules:
pip3 install -r requirements.txtFollow the instructions to build and install SentencePiece command line tools from C++ source.
Follow the instructions to build and install GloVe.
We follow the idea of pre-trained subword embbeddings from (Heinzerling and Strube, 2018).
# Extract the textual content from the arXiv dataset
# this creates a one-sentence-per-line raw corpus file
# 12,807,583 lines
python3 src/extract.py data/arxiv-metadata-oai-snapshot.json \
data/arxiv-metadata-oai-snapshot.txt
# Train a sentencePiece model from the corpus file
spm_train --input=data/arxiv-metadata-oai-snapshot.txt \
--model_prefix=data/arxiv-metadata-oai-snapshot \
--vocab_size=10000
# Encode the corpus file using the sentencePiece model
spm_encode --model=data/arxiv-metadata-oai-snapshot \
--output_format=piece \
< data/arxiv-metadata-oai-snapshot.txt \
> data/arxiv-metadata-oai-snapshot.piece
# Train the subword GloVe vectors
# script adapted from https://github.com/stanfordnlp/GloVe/blob/master/demo.sh
./src/train-glove.shPre-trained models are available in the data/ directory.
data/arxiv-metadata-oai-snapshot.modelis the sentencePiece model.data/arxiv-metadata-oai-snapshot.vocabis the sentencePiece vocabulary file.data/vectors.txtanddata/vectors.binare learned GloVe vectors (50 dim).