A Step by step manual of NiuTrans.Phrase

1.Data Preparation

The NiuTrans system is a "data-driven" MT system which requries "data" for training and tuning the system. It requries users to prepare the following data files before running the system.

a).Training data: bilingual sentence-pairs and word alignments.

b).Tuning data: source sentences with one or more reference translations.

c).Test data: some new sentences.

d).Evaluation data: reference translations of test sentences.

In the NiuTrans package, some sample files are offered for experimenting with the system and studying the format requirement. They are located in "NiuTrans/sample-data/sample-submission-version".

sample-submission-version/
  -- TM-training-set/                   # word-aligned bilingual corpus (100,000 sentence-pairs)
       -- chinese.txt                   # source sentences
       -- english.txt                   # target sentences (case-removed)
       -- Alignment.txt                 # word alignments of the sentence-pairs
  -- LM-training-set/
       -- e.lm.txt                      # monolingual corpus for training language model (100K target sentences)
  -- Dev-set/
       -- Niu.dev.txt                   # development dataset for weight tuning (400 sentences)
  -- Test-set/
       -- Niu.test.txt                  # test dataset (1K sentences)
  -- Reference-for-evaluation/
       -- Niu.test.reference            # references of the test sentences (1K sentences)
  -- Recaser-training-set/
       -- english.keepcase.txt          # monolingual corpus for training recasing model (10K sentences)
  -- description-of-the-sample-data     # a description of the sample data

Format: please unpack "NiuTrans/sample-data/sample.tar.gz", and check "description-of-the-sample-data" to find more information about data format.
In the following, the above data files are used to illustrate how to run the NiuTrans system (e.g. how to train MT models, tune feature weights, and decode test sentences).

2.Training Translation Model

Instructions (perl is required. Also, Cygwin is required for Windows users)

$> cd NiuTrans/sample-data/
$> tar xzf sample.tar.gz              
$> cd ../
$> mkdir work/model.phrase/ -p
$> cd scripts/
$> perl NiuTrans-phrase-train-model.pl \
        -tmdir ../work/model.phrase/ \
        -s ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
        -t ../sample-data/sample-submission-version/TM-training-set/english.txt \
        -a ../sample-data/sample-submission-version/TM-training-set/Alignment.txt

"-tmdir" specifies the target directory for generating various table and model files.

"-s", "-t" and "-a" specify the source sentences, the target sentences and the alignments between them (one sentence per line).

Output: three files are generated in "NiuTrans/work/model.phrase/":

- me.reordering.table                 # ME reorder model
- msd.reordering.table                # MSD reorder model
- phrase.translation.table            # phrase translation model

Note: Please enter "scripts/" before running the script "NiuTrans-phrase-train-model.pl".

3.Training n-gram language model

Instructions

$> cd ../
$> mkdir work/lm/
$> cd scripts/
$> perl NiuTrans-training-ngram-LM.pl \
        -corpus ../sample-data/sample-submission-version/LM-training-set/e.lm.txt \
        -ngram  3 \
        -vocab  ../work/lm/lm.vocab \
        -lmbin  ../work/lm/lm.trie.data

"-ngram" specifies the order of n-gram LM. E.g. "-ngram 3" indicates a 3-gram language model.

"-vocab" specifies where the target-side vocabulary is generated.

"-lmbin" specifies where the language model file is generated.

Output: two files are generated and placed in "NiuTrans/work/lm/":

- lm.vocab                            # target-side vocabulary
- lm.trie.data                        # binary-encoded language model

4.Generating Configuration File

Instructions

$> cd scripts/
$> perl NiuTrans-phrase-generate-mert-config.pl \ 
        -tmdir  ../work/model.phrase/ \
        -lmdir  ../work/lm/ \
        -ngram  3 \
        -o      ../work/NiuTrans.phrase.user.config

"-tmdir" specifies the directory that holds the translation table and the reordering model files.

"-lmdir" specifies the directory that holds the n-gram language model and the target-side vocabulary.

"-ngram" specifies the order of n-gram language model.

"-o" specifies the output (i.e. a config file).

Output: a config file is generated and placed in "NiuTrans/work/":

- NiuTrans.phrase.user.config           # configuration file for MERT and decoding

5.Table Filtering (Optional)

If the dev and test data is prepared in advance, a popular trick to improve the system efficiency is to prune the translation table by throwing away those "useless" phrases containing n-grams that never appear in the dev/test sentences. It is suggested to carry out the following step for a faster experiment in your reserach work. Note that this step is not necessary and thus can be skiped as needed.
Instructions (perl is required)

$> cd ..
$> cat sample-data/sample-submission-version/Dev-set/Niu.dev.txt \
       sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
       > sample-data/sample-submission-version/Dev-set/Niu.dev.and.test.txt
$> bin/NiuTrans.PhraseExtractor --FILPD \
        -dev     sample-data/sample-submission-version/Dev-set/Niu.dev.and.test.txt \
        -in      work/model.phrase/phrase.translation.table \
        -out     work/model.phrase/phrase.translation.table.filterDevAndTest \
        -maxlen  10 \
        -rnum    1
$> perl scripts/filter.msd.model.pl \
        work/model.phrase/phrase.translation.table.filterDevAndTest \  
        work/model.phrase/msd.reordering.table \
        > work/model.phrase/msd.reordering.table.filterDevAndTest

“-dev” specifies the data set for filtering. Usually we merge the development dataset and test dataset to form a such dataset.

“-in” specifies the translation table to be filtered.

“-out” specifies the resulting table (i.g., pruned translation table).

“-maxlen” specifies the maximum length of phrase in phrase translation table.

“-rnum” specifies how many reference translations per source-sentence are provided.

Output: “NiuTrans/work/NiuTrans.phrase.user.config” is updated. E.g., we can view the file of "NiuTrans/work/NiuTrans.phrase.user.config" and check the result

param="MSD-Reordering-Model"   value="../work/model.phrase/msd.reordering.table"
is replaced with
param="MSD-Reordering-Model"   value="../work/model.phrase/msd.reordering.table.filterDevAndTest"
   
param="Phrase-Table"           value="../work/model.phrase/phrase.translation.table"
is replaced with
param="Phrase-Table"           value="../work/model.phrase/phrase.translation.table.filterDevAndTest"

6.Weight Tuning

Instructions (perl is required)

$> perl NiuTrans-phrase-mert-model.pl \
        -dev  ../sample-data/sample-submission-version/Dev-set/Niu.dev.txt \
        -c    ../work/NiuTrans.phrase.user.config \
        -nref 1 \
        -r    2 \
        -l    ../work/mert-model.log

"-dev" specifies the development dataset (or tuning set) for weight tuning.

"-c" specifies the configuration file generated in the previous steps.

"-nref" specifies how many reference translations per source-sentence are provided.

"-r" specifies how many rounds the MERT performs (by default, 1 round = 15 MERT iterations).

"-l" specifies the log file generated by MERT.

Output: the optimized feature weights are recorded in the configuration file "NiuTrans/work/NiuTrans.phrase.user.config". They will then be used in decoding the test sentences.

7.Decoding Test Sentences

Instructions (perl is required)

$> perl NiuTrans-phrase-decoder-model.pl \
        -test   ../sample-data/sample-submission-version/Test-set/Niu.test.txt \
        -c      ../work/NiuTrans.phrase.user.config \
        -output 1best.out

"-test" specifies the test dataset (one sentence per line).

"-c" specifies the configuration file.

"-output" specifies the translation result file (the result is dumped to "stdout" if this option is not specified).

Output: a new file is generated in "NiuTrans/scripts/":

- 1best.out                         # 1-best translation of the test sentences

8.Recasing (Suppose that the target language is English)

Instructions (perl is required)

$> mkdir ../work/model.recasing -p
$> perl NiuTrans-training-recase-model.pl \
        -corpus   ../sample-data/sample-submission-version/Recaser-training-set/english.keepcase.txt \
        -modelDir ../work/model.recasing
$> perl NiuTrans-recaser.pl \
        -config ../work/model.recasing/recased.config.file \
        -test   1best.out \
        -output 1best.out.recased

“-corpus” specifies the training dataset (one sentence per line).

“-modelDir” specifies the directory that holds the model and config file.

Output: a config file and several model file are generated and placed in “NiuTrans/work/model/recasing”.

- recased.config.file               # Recasing config file
- recased.lm.trie.data              
- recased.lm.vocab
- recased.null
- recased.phrase.translation.table

Output: a new file is generated in “NiuTrans/scripts”:

- 1best.out.recased

Note: this step makes no sense when the target language does not have case information (such as English-Chinese translation).

9. Detokenizer

Instructions (perl is required)

$> perl NiuTrans-detokenizer.pl \
        -in  1best.out.recased \
        -out 1best.out.recased.detoken

“-in” specifies the inputted file.

“-output” specifies the outputted file.

Output: a new file is generated in “NiuTrans/scripts/”.

- 1best.out.recased.detoken

Note: Again, this step is only available for the translation tasks where the target language needs tokenization in data prepration.

10. Evaluation

Instructions (perl is required. Suppose that the MT result is in "1best.out.recased.detoken")

$> perl NiuTrans-generate-xml-for-mteval.pl \
        -1f   1best.out \
        -tf   ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
        -rnum 1
$> perl mteval-v13a.pl \
        -r ref.xml \
        -s src.xml \
        -t tst.xml

"-1f" specifies the file of the 1-best translations of the test dataset.

"-tf" specifies the file of the source sentences and their reference translations of the test dataset.

"-r" specifies the file of the reference translations.

"-s" specifies the file of source sentences.

"-t" specifies the file of (1-best) translations generated by the MT system.

Output: The IBM-version BLEU score is displayed. If everything goes well, you will obtain a score of about 0.2412 for the sample data set.
Note: script mteval-v13a.pl relies on the package XML::Parser. If XML::Parser is not installed on your system, please follow the following commands to install it.

$> su root
$> tar xzf XML-Parser-2.41.tar.gz
$> cd XML-Parser-2.41/
$> perl Makefile.PL
$> make install

Provide feedback

Saved searches

Use saved searches to filter your results more quickly