Preprocessing steps to extend the multilingual verb lexicon SynSemClass (Czech and English) with German entries. For our experiments, we use the English-German paracrawl dataset from https://paracrawl.eu/.
The repository provides scripts for:
- preprocessing
- word alignment with MGIZA tool
- creating a dictionary with most common word alignments
Using the paracrawl dataset, you can split the tab-separated file into two files using the command line:
cut -f1 -d$'\t' file.txt> output-file.txt
cut -f2 -d$'\t' file.txt> output-file.txt
In our case:
cut -f2 -d$'\t' en-de.txt> paracrawl.en
cut -f1 -d$'\t' en-de.txt> paracrawl.de
Because of potential memory issues, it is recommended to reduce the size of dataset, for example to 20mio lines:
head -n 20000000 paracrawl.de > paracrawl.de
head -n 20000000 paracrawl.en > paracrawl.en
Filter corpus for lines not ending with a dot (titles, chopped sentences, other parts from websites) and lines including more than one sentence:
python preprocCorpus.py --input_src $PATH_TO_CORPUS_SRC_LANG --input_trg $PATH_TO_CORPUS_TRG_LANG
In our example:
python preprocCorpus.py --input_src paracrawl.en --input_trg paracrawl.de
For further preprocessing the files (tokenizing and lowercasing) we recommend to follow the tutorial: https://fabioticconi.wordpress.com/2011/01/17/how-to-do-a-word-alignment-with-giza-or-mgiza-from-parallel-corpus/.
Save the preprocessed files inside the ./dataset folder.
For working with MGIZA, you can continue to follow the tutorial or use our Dockerfile. The Dockerfile takes the preprocessed input files, makes classes, installs and compiles MGIZA and, finally, creates the word alignments.
docker build -f Dockerfile -t mgiza-tool .
docker run -it --rm mgiza-tool
If you use the Dockerfile, you need to adjust:
-
in Dockerfile: The local path to the ./dataset directory with input files in line:
COPY $local_dir /mgiza/mgizapp/bin
. This is only necessary if the folder is not mounteddocker run -v /dataset:/mgiza/mgizapp/bin -it --rm mgiza-tool .
-
in Dockerfile: Input file names (for example, we used
paracrawl.en
for English as source language andparacrawl.de
for German as target language). -
Adjust the file-names in
configfile.txt
: in line “ncpus”, set the number of CPUs you want to use for processing.
findWord.py
takes the MGIZA output-files and creates a table with the most frequent alignments from English to German (threshold 0.2%). You can use the file via CL:
python3 findWord.py -c $PATH_TO_MGIZA_OUTPUT_FILE -w $VERB
In our example, searching for the most common German alignments for the verb "absorb":
python3 findWord.py -c ./output/en_de.dict.A3.final.part000 -w absorb
For the SynSemClass, we need two output files per synonym class. createOutputFiles.py
takes the MGIZA output-file and filtered corpus files and creates:
- a file with most common word alignments:
candidate_verbs_{VERB_NAME}_{CLASS_NAME}.csv
- a file with verbs and candidate example sentences:
candidate_sentences_{EN-VERB_NAME}_{CLASS_NAME}.csv
- a logfile
logfile_{classname}_{class_id}.csv
python3 createOutputFiles.py \
--classname $SYNSEM_CLASS_VERB_NAME \
--classid $SYNSEM_CLASS_NAME \
--mgiza_output $PATH_TO_MGIZA_OUTPUTFILE \
--input_src $PATH_TO_CORPUS_SRC_LANG \
--input_trg $PATH_TO_CORPUS_TRG_LANG \
--ref_verbs $VERBLIST_SRC_LANG # comma-separated verb list without spaces
python3 createOutputFiles.py \
--classname base \
--classid vec00179 \
--mgiza_output ./SynSemClass_Ger_Extension/input_files/en_de_mini.dict.A3.final.part000 \
--input_trg ./SynSemClass_Ger_Extension/input_files/paracrawl.de \
--input_src ./SynSemClass_Ger_Extension/input_files/paracrawl.en \
--list arrange,assemble,base,build,construct,create,develop
-
-cn, --classname: include verb classname
-
-cid, --classid: include classid vec00XXX
-
-mo, --mgiza_output: path to mgiza output dictionary
-
-i1, --input_src: the path to the input file with source language (corpus before tokenization or lowercasing)
-
-i2, --input_trg: the path to the input file with target language (corpus before tokenization or lowercasing)
-
-l, --list: comma-delimited list input with no spaces
For --list
, include those English verbs belonging to the class as comma-separated list (no spaces). To create the list, you can:
-
go to source code of synsemclass website, search for tag
<span class="cms_label" title="Classmembers for class"...
-
copy list of english verbs + IDs
-
you can serialize english verbs as list without id's via commandline using this example as template:
echo "ask (EngVallex-ID-ev-w141f2), inquire (EngVallex-ID-ev-w1710f1), interview (EngVallex-ID-ev-w1741f1), poll (EngVallex-ID-ev-w2324f1), question (EngVallex-ID-ev-w2465f2)" | sed 's/([^()]*)//g' | tr -d ' '
Zdenka Uresova, Karolina Zaczynska, Peter Bourgonje, Eva Fučíková, Georg Rehm, and Jan Hajic. Making a Semantic Event-type Ontology Multilingual. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christopher Cieri, Khalid Choukri, Thierry Declerck, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), pages 1332-1343, Marseille, France, 6 2022. European Language Resources Association (ELRA). June 20-25, 2022.
Peter Bourgonje, Karolina Zaczynska, Julián Moreno Schneider, Georg Rehm, Zdenka Uresova, and Jan Hajic. SynSemClass for German: Extending a Multilingual Verb Lexicon. In Adrian Paschke, Georg Rehm, Jamal Al Qundus, Clemens Neudecker, and Lydia Pintscher, editors, Proceedings of QURATOR 2021 - Conference on Digital Curation Technologies, Berlin, Germany, 02 2021. CEUR Workshop Proceedings, Volume 2836. 11/12 February 2021.