Tingchen Fu‡†, Lemao Liu‡, Deng Cai‡, Guoping Huang‡, Shuming Shi‡, Rui Yan†
† Gaoling School of Artificial Intelligence, Renmin University of China ‡ Tencent AI Lab
- THE REASONABLENESS BEHIND UNREASONABLE TRANSLATION CAPABILITY OF LARGE LANGUAGE MODEL
- Contents
- Overview
- Data Release
- Post-train/Pre-train
- Evaluation
- License
Large language models (LLMs) exhibit non-trivial or even state-or-the-art capacity in neural machine transtion, violating the conventional wisdom that translation ability highly relies on large-scale parallel corpus. To understand the mechanism behind the unreasonable translation ability, we propose that three types of unintentional bilingual data make crucial contribution to the translation ability of LLM. Specifically, three common types of unintentional bilingual data includes:
-
sentence alignment The co-occurence of a sentence and its translation within close proximity in a document.
-
word alignment The co-occurrence of one or more words (though not an entire sentence) and their translations within close proximity in a single document.
-
code-switching The co-occurrence of two languages within close proximity in a document, where the content in the two languages is semantically related rather than bearing a direct translation relationship.
We release the excavated unintentional bilingual data from mC4.en and mC4.zh. The data is available on 🤗 Huggingface Datasets. Each sample is a structured data file in the JSON format. It consists of a list of dictionaries, with each dictionary containing the following fields:
file_no
:str
, the name of file in mC4 from which the case is found.doc_no
:str
, the document number in the current file.text
:str
, the unintentional bilingual text.
The statistics of the
en | zh | ||
---|---|---|---|
sentence alignment | # Document | 210,931 | 2462 |
# Sequence | 355,320 | 432 | |
word alignment | # Document | 658643 | 1972764 |
# Sequence | 500,550 | 659,456 | |
code switch | # Document | 2021502 | 5086373 |
# Sequence | 903,810 | 997,376 |
To post-train BLOOM-560m on unintentional bililingual data, you can use the following command:
export NCCL_IB_GID_INDEX=3
accelerate launch \
--machine_rank 0 \
--num_machines 1 \
--num_processes 8 \
--config_file accelerate_zero2.yaml \
${RUN_DIR}/posttrain_ubd.py \
--debug False \
--streaming True \
--train_file PATH_TO_UNINTENTIONAL_BILINGUAL_DATA \
--from_scratch False \
--config_name PATH_TO_BLOOM_CONFIG \
--model_name_or_path bigscience/bloom-560m \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 32 \
--learning_rate 3e-4 \
--num_warmup_steps 500 \
--window_size 1024 \
--special_name posttrain_wordalign \
--num_train_epochs 1 \
--with_tracking False \
--max_train_steps -1 \
--checkpoint_step 4000 \
--seed 0 \
To post-train BLOOM-7b1 with PEFT technique on unintentional bililingual data, you can use the following command:
accelerate launch \
--machine_rank 0 \
--num_machines 1 \
--num_processes 8 \
--config_file accelerate_zero2.yaml \
${RUN_DIR}/posttrain_ubd_peft.py \
--debug False \
--streaming True \
--from_scratch False \
--train_file file_sent.txt \
--config_name PATH_TO_BLOOM_CONFIG \
--model_name_or_path bigscience/bloom-7b1 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--learning_rate 1e-4 \
--num_warmup_steps 0 \
--special_name peft_sentalign \
--num_train_epochs 1 \
--with_tracking False \
--lora_target_module query_key_value,dense \
--max_train_steps -1 \
--checkpoint_step 2000 \
For post-training expeirments, we measure the translation performance with BLEURT and COMET. To generate translation hypothesis, we may use the following command:
python3 -u llm_generate.py
--model_name_or_path bigscience/bloom-560m
--ckpt_path PATH_TO_THE_POST-TRAINED_MODEL.
--n_example 3
--source_language en
--target_language zh
--dataset WMTnews21
For pre-training experiment, since BLEURT and COMET is issensitive to minor improvement when the translation performance is poor, we measure the perplexity using the following command.
python3 -u plm_ppl.py
--model_name_or_path bigscience/bloom-560m
--ckpt_path PATH_TO_THE_MODEL_TRAINED_FROM_SCRATCH
--n_example 3
--source_language en
--target_language zh
--architecture target-only
--use_prompt True
The work is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.