This repository serves as a hub for followng resources used in “Type- and Token-based Word Embeddings in the Digital Humanities”:
- Scripts and artifacts to fully reproduce the used test dataset,
- Scripts to perform the embedding and evaluation.
To fully reproduce the used test dataset, follow these steps. Note that we restrict our test set to those tokens present in file evaluation/evaluation_vocabulary
.
# 0. Switch to testsets directory
cd testsets/
# 1. Install requirements
pip install -r requirements.txt
# 2. Download the Schm280 and TOEFL dataset from IMS Stuttgart,
# download the MSimlex999 dataset from the Project's website,
# and download the affection rating dataset from IMS Stuttgart
wget https://www.ims.uni-stuttgart.de/documents/ressourcen/lexika/analogies_ims/analogies.zip
unzip analogies.zip
wget https://leviants.com/wp-content/uploads/2020/01/SimLex_ALL_Langs_TXT_Format.zip
unzip SimLex_ALL_Langs_TXT_Format.zip
wget https://www.ims.uni-stuttgart.de/documents/ressourcen/experiment-daten/affective_norms.txt.gz
gunzip affective_norms.txt.gz
# 3. Acquire a copy of Germanet v14, place it in folder "GNV140", and verify its integrity
cat $(find GN_V140/GN_V140_XML/* | sort) | sha256sum
# > 09ca06d178edf193648807cb983181670fd443b458e8c148a012808780962925 -
# 4. Generate all Germanet Relations from the Corpus
python germanet_extraction.py --germanet ./GN_V140/GN_V140_XML
# 5. Acquire a copy of "Das Wörterbuch der Synonyme", verify its integrity, and extract
# its content
sha256sum duden_synonym_woerterbuch.epub
# > 8389728c500fc8653bc5a7804e6c4fa2fe93eb5e8ef81679d4ac02ce00916407 duden_synonym_woerterbuch.epub
unzip duden_synonym_woerterbuch.epub -d duden_sources
# 6. Generate Duden synonymy prompts
python duden_extraction.py --duden duden_sources --ratings affective_norms.txt
# 7. Download the 2021-07-01 German Wiktionary Database dump
wget https://dumps.wikimedia.org/dewiktionary/20210701/dewiktionary-20210701-pages-articles.xml.bz2
# 8. Generate Wiktionary relation pairs
python wiktionary_extraction.py
# 9. Merge all datasets and filter with our evaluation vocabulary
python generate_testsets.py --vocab ./evaluation_vocabulary
# Done: The full testset table is stored as testsets.tsv
To generate the (full) GermaNet relation dataset, you need to get a copy of the GermaNet database. You can apply for a license for GermaNet at the website of the University of Tübingen. See also the papers Birgit Hamp and Helmut Feldweg, “GermaNet – a Lexical-Semantic Net for German” and Verena Henrich and Erhard Hinrichs, “GernEdiT – The GermaNet Editing Tool”. We use Release 14.
You can then generate the full dataset of all relations by invoking python evaluation/germanet_extraction.py --germanet GN_V140_Root/GN_V140_XML --output output_germanet.tsv
We have machine-translated the original MEN Test Collection by Bruni, Tran, and Baroni. See the separate README for further details and license information. See also the publication Elia Bruni, Nam Khanh Tran and Marco Baroni, “Multimodal Distributional Semantics”.
Please refer to the paper “Multilingual Reliability and ‘Semantic’ Structure of Continuous Word Spaces” for a detailed description of the datasets. Both Schm280 and TOEFL can be separately downloaded from the website of the University of Stuttgart.
To generate the (full) datast, you need (a) to buy a digital copy of Das Wörterbuch der Synonyme (EAN: 9783411913169) and (b) need to download the (free) Affective Norms Wordlist by Koper and Schute im Walde. The Wordlist is described in the paper Maximilian Köper and Sabine Schulte im Walde, “Automatically Generated Norms of Abstractness, Arousal, Imageability and Valence for 350 000 German Lemmas”.
You can then generate the full dataset by invoking python evaluation/duden_extraction.py --duden extracted_epub_root --ratings affective_norms.txt --output output_duden.tsv
.
You can generate a (full) dataset of relations from a specific German Wiktionary database dump. We use the July 2021 dump, which you can download here (“Articles, templates, media/file descriptions, and primary meta-pages”).
After downloading, you can generate the full dataset by invoking python evaluation/wiktionary_extraction.py --wiktionary wiktionary_dump.xml.bz2 --output output_wiktionary.tsv
.
See the paper Ira Leviant and Roi Reichart, “Separated by an Un-common Language: Towards Judgment Language Informed Vector Space Modeling” for more information and the corresponding website to download the dataset.
Here, we describe the steps to generate a distilled type-based embedding from BERT's Transformer output for a set of query words, sampling contextualized sentences from a corpus.
-
Install requirements:
pip install -r embedding/requirements.txt
-
Tokenize the entire corpus for BERT and generate frequency statistics.
python ./embedding/corpus_tokenizer.py --input CORPUS_FILE \ --output processed_corpus.txt --vocab-out corpus_vocab.txt
-
Sample (100) context sentences for query words in line-separated list
query_words.txt
.python ./embedding/corpus_tokenizer.py --corpus processed_corpus.txt \ --vocab corpus_vocab.txt --query-words query_words.txt --count 100 \ --output contexts.txt
-
Perform the BERT forward passes. You can pass a comma-separated list of desired vectorization/pooling/aggregation methods to use.
python ./embedding/embedder.py --contexts context.txt \ --vectorizations sum --poolings nopooling,mean --aggregations mean,median \ --output-prefix embeddings_
This will write an embedding for each distillation combination, e.g.
embeddings_sum-nopooling-mean.bin
,embeddings_sum-nopooling-median.bin
, ... The embedding is saved in binary word2vec format. -
Evaluate embedding(s) on a testsets file, generated by
generate_testsets.py
as described above. This will compute scores (F1, Spearman rank, ...) on the respective datasets of the testsets file.python ./embedding/evaluation.py --testsets testsets.tsv --output evaluation_output.tsv embeddings_*.bin
This repository (with exception to the German translation of the MEN Test Collection) is licensed under the MIT license.
When you use this work in a publication, please cite our paper:
A. Ehrmanntraut, T. Hagen, L. Konle, F. Jannidis, “Type- and Token-based Word Embeddings in the Digital Humanities”, in: Proceedings of the Second Conference on Computational Humanities Research (CHR 2021), Amsterdam, 2021.