-
Notifications
You must be signed in to change notification settings - Fork 36
A Step by step manual of NiuTrans.Phrase
-
The NiuTrans system is a "data-driven" MT system which requries "data" for training and tuning the system. It requries users to prepare the following data files before running the system.
a).Training data: bilingual sentence-pairs and word alignments.
b).Tuning data: source sentences with one or more reference translations.
c).Test data: some new sentences.
d).Evaluation data: reference translations of test sentences.
In the NiuTrans package, some sample files are offered for experimenting with the system and studying the format requirement. They are located in "NiuTrans/sample-data/sample-submission-version".
sample-submission-version/
-- TM-training-set/ # word-aligned bilingual corpus (100,000 sentence-pairs)
-- chinese.txt # source sentences
-- english.txt # target sentences (case-removed)
-- Alignment.txt # word alignments of the sentence-pairs
-- LM-training-set/
-- e.lm.txt # monolingual corpus for training language model (100K target sentences)
-- Dev-set/
-- Niu.dev.txt # development dataset for weight tuning (400 sentences)
-- Test-set/
-- Niu.test.txt # test dataset (1K sentences)
-- Reference-for-evaluation/
-- Niu.test.reference # references of the test sentences (1K sentences)
-- Recaser-training-set/
-- english.keepcase.txt # monolingual corpus for training recasing model (10K sentences)
-- description-of-the-sample-data # a description of the sample data
-
Format: please unpack "NiuTrans/sample-data/sample.tar.gz", and check "description-of-the-sample-data" to find more information about data format.
-
In the following, the above data files are used to illustrate how to run the NiuTrans system (e.g. how to train MT models, tune feature weights, and decode test sentences).
- Instructions (perl is required. Also, Cygwin is required for Windows users)
$> cd NiuTrans/sample-data/
$> tar xzf sample.tar.gz
$> cd ../
$> mkdir work/model.phrase/ -p
$> cd scripts/
$> perl NiuTrans-phrase-train-model.pl \
-tmdir ../work/model.phrase/ \
-s ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
-t ../sample-data/sample-submission-version/TM-training-set/english.txt \
-a ../sample-data/sample-submission-version/TM-training-set/Alignment.txt
"-tmdir" specifies the target directory for generating various table and model files.
"-s", "-t" and "-a" specify the source sentences, the target sentences and the alignments between them (one sentence per line).
- Output: three files are generated in "NiuTrans/work/model.phrase/":
- me.reordering.table # ME reorder model
- msd.reordering.table # MSD reorder model
- phrase.translation.table # phrase translation model
- Note: Please enter "scripts/" before running the script "NiuTrans-phrase-train-model.pl".
- Instructions
$> cd ../
$> mkdir work/lm/
$> cd scripts/
$> perl NiuTrans-training-ngram-LM.pl \
-corpus ../sample-data/sample-submission-version/LM-training-set/e.lm.txt \
-ngram 3 \
-vocab ../work/lm/lm.vocab \
-lmbin ../work/lm/lm.trie.data
"-ngram" specifies the order of n-gram LM. E.g. "-ngram 3" indicates a 3-gram language model.
"-vocab" specifies where the target-side vocabulary is generated.
"-lmbin" specifies where the language model file is generated.
- Output: two files are generated and placed in "NiuTrans/work/lm/":
- lm.vocab # target-side vocabulary
- lm.trie.data # binary-encoded language model
- Instructions
$> cd scripts/
$> perl NiuTrans-phrase-generate-mert-config.pl \
-tmdir ../work/model.phrase/ \
-lmdir ../work/lm/ \
-ngram 3 \
-o ../work/NiuTrans.phrase.user.config
"-tmdir" specifies the directory that holds the translation table and the reordering model files.
"-lmdir" specifies the directory that holds the n-gram language model and the target-side vocabulary.
"-ngram" specifies the order of n-gram language model.
"-o" specifies the output (i.e. a config file).
- Output: a config file is generated and placed in "NiuTrans/work/":
- NiuTrans.phrase.user.config # configuration file for MERT and decoding
-
If the dev and test data is prepared in advance, a popular trick to improve the system efficiency is to prune the translation table by throwing away those "useless" phrases containing n-grams that never appear in the dev/test sentences. It is suggested to carry out the following step for a faster experiment in your reserach work. Note that this step is not necessary and thus can be skiped as needed.
-
Instructions (perl is required)
$> cd ..
$> cat sample-data/sample-submission-version/Dev-set/Niu.dev.txt \
sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
> sample-data/sample-submission-version/Dev-set/Niu.dev.and.test.txt
$> bin/NiuTrans.PhraseExtractor --FILPD \
-dev sample-data/sample-submission-version/Dev-set/Niu.dev.and.test.txt \
-in work/model.phrase/phrase.translation.table \
-out work/model.phrase/phrase.translation.table.filterDevAndTest \
-maxlen 10 \
-rnum 1
$> perl scripts/filter.msd.model.pl \
work/model.phrase/phrase.translation.table.filterDevAndTest \
work/model.phrase/msd.reordering.table \
> work/model.phrase/msd.reordering.table.filterDevAndTest
“-dev” specifies the data set for filtering. Usually we merge the development dataset and test dataset to form a such dataset.
“-in” specifies the translation table to be filtered.
“-out” specifies the resulting table (i.g., pruned translation table).
“-maxlen” specifies the maximum length of phrase in phrase translation table.
“-rnum” specifies how many reference translations per source-sentence are provided.
- Output: “NiuTrans/work/NiuTrans.phrase.user.config” is updated. E.g., we can view the file of "NiuTrans/work/NiuTrans.phrase.user.config" and check the result
param="MSD-Reordering-Model" value="../work/model.phrase/msd.reordering.table"
is replaced with
param="MSD-Reordering-Model" value="../work/model.phrase/msd.reordering.table.filterDevAndTest"
param="Phrase-Table" value="../work/model.phrase/phrase.translation.table"
is replaced with
param="Phrase-Table" value="../work/model.phrase/phrase.translation.table.filterDevAndTest"
- Instructions (perl is required)
$> perl NiuTrans-phrase-mert-model.pl \
-dev ../sample-data/sample-submission-version/Dev-set/Niu.dev.txt \
-c ../work/NiuTrans.phrase.user.config \
-nref 1 \
-r 2 \
-l ../work/mert-model.log
"-dev" specifies the development dataset (or tuning set) for weight tuning.
"-c" specifies the configuration file generated in the previous steps.
"-nref" specifies how many reference translations per source-sentence are provided.
"-r" specifies how many rounds the MERT performs (by default, 1 round = 15 MERT iterations).
"-l" specifies the log file generated by MERT.
- Output: the optimized feature weights are recorded in the configuration file "NiuTrans/work/NiuTrans.phrase.user.config". They will then be used in decoding the test sentences.
- Instructions (perl is required)
$> perl NiuTrans-phrase-decoder-model.pl \
-test ../sample-data/sample-submission-version/Test-set/Niu.test.txt \
-c ../work/NiuTrans.phrase.user.config \
-output 1best.out
"-test" specifies the test dataset (one sentence per line).
"-c" specifies the configuration file.
"-output" specifies the translation result file (the result is dumped to "stdout" if this option is not specified).
- Output: a new file is generated in "NiuTrans/scripts/":
- 1best.out # 1-best translation of the test sentences
- Instructions (perl is required)
$> mkdir ../work/model.recasing -p
$> perl NiuTrans-training-recase-model.pl \
-corpus ../sample-data/sample-submission-version/Recaser-training-set/english.keepcase.txt \
-modelDir ../work/model.recasing
$> perl NiuTrans-recaser.pl \
-config ../work/model.recasing/recased.config.file \
-test 1best.out \
-output 1best.out.recased
“-corpus” specifies the training dataset (one sentence per line).
“-modelDir” specifies the directory that holds the model and config file.
- Output: a config file and several model file are generated and placed in “NiuTrans/work/model/recasing”.
- recased.config.file # Recasing config file
- recased.lm.trie.data
- recased.lm.vocab
- recased.null
- recased.phrase.translation.table
Output: a new file is generated in “NiuTrans/scripts”:
- 1best.out.recased
- Note: this step makes no sense when the target language does not have case information (such as English-Chinese translation).
- Instructions (perl is required)
$> perl NiuTrans-detokenizer.pl \
-in 1best.out.recased \
-out 1best.out.recased.detoken
“-in” specifies the inputted file.
“-output” specifies the outputted file.
- Output: a new file is generated in “NiuTrans/scripts/”.
- 1best.out.recased.detoken
- Note: Again, this step is only available for the translation tasks where the target language needs tokenization in data prepration.
- Instructions (perl is required. Suppose that the MT result is in "1best.out.recased.detoken")
$> perl NiuTrans-generate-xml-for-mteval.pl \
-1f 1best.out \
-tf ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
-rnum 1
$> perl mteval-v13a.pl \
-r ref.xml \
-s src.xml \
-t tst.xml
"-1f" specifies the file of the 1-best translations of the test dataset.
"-tf" specifies the file of the source sentences and their reference translations of the test dataset.
"-r" specifies the file of the reference translations.
"-s" specifies the file of source sentences.
"-t" specifies the file of (1-best) translations generated by the MT system.
-
Output: The IBM-version BLEU score is displayed. If everything goes well, you will obtain a score of about 0.2412 for the sample data set.
-
Note: script mteval-v13a.pl relies on the package XML::Parser. If XML::Parser is not installed on your system, please follow the following commands to install it.
$> su root
$> tar xzf XML-Parser-2.41.tar.gz
$> cd XML-Parser-2.41/
$> perl Makefile.PL
$> make install