subword embeddings trained on arXiv

This repository contains the code to build subword embeddings from the arXiv dataset of 1.7M+ scholarly papers.

Prerequisites

[Download the arXiv dataset], decompress archive.zip and place the file arxiv-metadata-oai-snapshot.json into the data/ directory.

Install required Python modules:

pip3 install -r requirements.txt

Follow the instructions to build and install SentencePiece command line tools from C++ source.

Follow the instructions to build and install GloVe.

Train subword embeddings from the arXiv dataset

We follow the idea of pre-trained subword embbeddings from (Heinzerling and Strube, 2018).

# Extract the textual content from the arXiv dataset
# this creates a one-sentence-per-line raw corpus file
# 12,807,583 lines
python3 src/extract.py data/arxiv-metadata-oai-snapshot.json \
        data/arxiv-metadata-oai-snapshot.txt

# Train a sentencePiece model from the corpus file
spm_train --input=data/arxiv-metadata-oai-snapshot.txt \
          --model_prefix=data/arxiv-metadata-oai-snapshot \
          --vocab_size=10000

# Encode the corpus file using the sentencePiece model
spm_encode --model=data/arxiv-metadata-oai-snapshot \
           --output_format=piece \
           < data/arxiv-metadata-oai-snapshot.txt \
           > data/arxiv-metadata-oai-snapshot.piece

# Train the subword GloVe vectors
# script adapted from https://github.com/stanfordnlp/GloVe/blob/master/demo.sh
./src/train-glove.sh

Download pre-trained models

Pre-trained models are available in the data/ directory.

data/arxiv-metadata-oai-snapshot.model is the sentencePiece model.
data/arxiv-metadata-oai-snapshot.vocab is the sentencePiece vocabulary file.
data/vectors.txt and data/vectors.bin are learned GloVe vectors (50 dim).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

subword embeddings trained on arXiv

Prerequisites

Train subword embeddings from the arXiv dataset

Download pre-trained models

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

subword embeddings trained on arXiv

Prerequisites

Train subword embeddings from the arXiv dataset

Download pre-trained models

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages