Skip to content
This repository has been archived by the owner on Feb 3, 2021. It is now read-only.
/ source2vec Public archive

Source code embeddings for various programming languages

License

Notifications You must be signed in to change notification settings

Jur1cek/source2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Source2vec

This repository contains source code embeddings for various programming languages.

Source code files are preprocessed using standard tokenization (it could not be ideal solution for source code), this is work in progress. Also we are working on enhancing embeddings with ASTs and PDGs.

Java

Created from 1720 million tokens
Window (context) size 5
Minimum number of occurrences 10
Vocabulary http://dizp.fufygen.eu/embeddings/java/java_vocab.txt.zip
Sample visualisation (FastText) java fasttext vis

Word2vec

http://dizp.fufygen.eu/embeddings/java/word2vec/java_word2vec.zip

FastText

http://dizp.fufygen.eu/embeddings/java/fasttext/java_fasttext_model.bin.zip http://dizp.fufygen.eu/embeddings/java/fasttext/java_fasttext_model.vec.zip

GloVe

http://dizp.fufygen.eu/embeddings/java/glove/java_glove_vectors.bin.zip
http://dizp.fufygen.eu/embeddings/java/glove/java_glove_vectors.txt.zip

Python

Created from 838 million tokens
Window (context) size 5
Minimum number of occurrences 10
Vocabulary http://dizp.fufygen.eu/embeddings/python/python_vocab.txt.zip
Sample visualisation (FastText, 128 vector size) python fasttext vis

Word2vec

http://dizp.fufygen.eu/embeddings/python/word2vec/python_word2vec.zip

FastText

http://dizp.fufygen.eu/embeddings/python/fasttext/python_fasttext_model.bin.zip
http://dizp.fufygen.eu/embeddings/python/fasttext/python_fasttext_model.vec.zip
http://dizp.fufygen.eu/embeddings/python/fasttext/python_fasttext_model_128.bin.zip (vector size 128)
http://dizp.fufygen.eu/embeddings/python/fasttext/python_fasttext_model_128.vec.zip (vector size 128)

GloVe

http://dizp.fufygen.eu/embeddings/python/glove/python_glove_vectors.bin.zip
http://dizp.fufygen.eu/embeddings/python/glove/python_glove_vectors.txt.zip

C

Created from 6589 million tokens
Window (context) size 7
Minimum number of occurrences 20
Vocabulary http://dizp.fufygen.eu/embeddings/c/c_vocab.txt.zip
Sample visualisation (FastText) c fasttext vis

Word2vec

http://dizp.fufygen.eu/embeddings/c/word2vec/c_word2vec.zip

FastText

http://dizp.fufygen.eu/embeddings/c/fasttext/c_fasttext_model.bin.zip
http://dizp.fufygen.eu/embeddings/c/fasttext/c_fasttext_model.vec.zip

GloVe

http://dizp.fufygen.eu/embeddings/c/glove/c_glove_vectors.bin.zip
http://dizp.fufygen.eu/embeddings/c/glove/c_glove_vectors.txt.zip

...if you need another languages or different params feel free to open issue

Common parameters

Vector size 64
Word2vec vectors are created using skipgram method
FastText vectors are created using 2 to 6 character ngrams

How to load embeddings (python)

see load_embeddings.ipynb

Dimensionality reduction and visualisation of embeddings

see visualise_embeddings.ipynb

Paper comming out soon.

Releases

No releases published

Packages

No packages published