Skip to content

Latest commit

 

History

History
19 lines (10 loc) · 1.6 KB

091075.md

File metadata and controls

19 lines (10 loc) · 1.6 KB

Unsupervised Tokenization for Machine Translation, Chung and Gildea, 2009

Paper, Tags: #machine-translation

In languages like Chinese and Korean, optimal token boundaries for MT are often unclear. Our methods incorporate information available from parallel corpus to determine a good tokenization for MT. We have built an unsupervised method for tokenization with the goal of automatically finding an appropriate tokenization for MT.

A single word in languages like Chinese, Korean or Hungarian os composed of several morphenes, and can also form compound nouns.

We find a benefit from using bilingual information, with unsupervised segmentation rivaling and in some cases surpassing supervised segmentation.

In Chinese, there are no spaces between words but there is less ambiguity where word boundaries lie in a given sentence. In Hungarian, Japanese and Korean, what constitutes an optimal token boundary is more ambiguous.

Results

Consistency in tokenization is important for MT. The bilingual MT model consistently performs better than the monolingual model in all language pairs (English - Chinese, Korean). We can learn better token boundaries by using information from the target langauge.

Our highest unsupervised segmentation result makes use of bilingual information.

We show that unsupervised tokenization for MT is feasible and can outperform rule-based methods that rely on lexical analysis or supervised statistical segmentations. This approach can be applied to both morphological analysis of Korean and the segmentation of sentences into words for Chinese.