GitHub - TingchenFu/ICLR24-TransContamination: code for the paper: The Reasonableness Behind Unreasonable Translation Capability of Large Language Model

THE REASONABLENESS BEHIND UNREASONABLE TRANSLATION CAPABILITY OF LARGE LANGUAGE MODEL

Tingchen Fu^‡†, Lemao Liu^‡, Deng Cai^‡, Guoping Huang^‡, Shuming Shi^‡, Rui Yan^†

^† Gaoling School of Artificial Intelligence, Renmin University of China ^‡ Tencent AI Lab

Overview

Large language models (LLMs) exhibit non-trivial or even state-or-the-art capacity in neural machine transtion, violating the conventional wisdom that translation ability highly relies on large-scale parallel corpus. To understand the mechanism behind the unreasonable translation ability, we propose that three types of unintentional bilingual data make crucial contribution to the translation ability of LLM. Specifically, three common types of unintentional bilingual data includes:

sentence alignment The co-occurence of a sentence and its translation within close proximity in a document.
word alignment The co-occurrence of one or more words (though not an entire sentence) and their translations within close proximity in a single document.
code-switching The co-occurrence of two languages within close proximity in a document, where the content in the two languages is semantically related rather than bearing a direct translation relationship.

Data Release

We release the excavated unintentional bilingual data from mC4.en and mC4.zh. The data is available on 🤗 Huggingface Datasets. Each sample is a structured data file in the JSON format. It consists of a list of dictionaries, with each dictionary containing the following fields:

file_no: str, the name of file in mC4 from which the case is found.
doc_no: str, the document number in the current file.
text: str, the unintentional bilingual text.

The statistics of the

		en	zh
sentence alignment	# Document	210,931	2462
	# Sequence	355,320	432
word alignment	# Document	658643	1972764
	# Sequence	500,550	659,456
code switch	# Document	2021502	5086373
	# Sequence	903,810	997,376

Post-train/Pre-train

To post-train BLOOM-560m on unintentional bililingual data, you can use the following command:

export NCCL_IB_GID_INDEX=3
accelerate launch  \
--machine_rank 0  \
--num_machines 1  \
--num_processes 8  \
--config_file  accelerate_zero2.yaml  \
 ${RUN_DIR}/posttrain_ubd.py  \
--debug False \
--streaming True   \
--train_file   PATH_TO_UNINTENTIONAL_BILINGUAL_DATA   \
--from_scratch False \
--config_name PATH_TO_BLOOM_CONFIG  \
--model_name_or_path  bigscience/bloom-560m    \
--per_device_train_batch_size  2  \
--gradient_accumulation_steps  32  \
--learning_rate 3e-4    \
--num_warmup_steps  500   \
--window_size 1024          \
--special_name   posttrain_wordalign     \
--num_train_epochs  1      \
--with_tracking False      \
--max_train_steps  -1    \
--checkpoint_step 4000   \
--seed 0  \

To post-train BLOOM-7b1 with PEFT technique on unintentional bililingual data, you can use the following command:

accelerate launch  \
--machine_rank 0 \
--num_machines  1 \
--num_processes  8 \
--config_file  accelerate_zero2.yaml  \
 ${RUN_DIR}/posttrain_ubd_peft.py  \
--debug False \
--streaming True   \
--from_scratch False \
--train_file  file_sent.txt   \
--config_name PATH_TO_BLOOM_CONFIG  \
--model_name_or_path bigscience/bloom-7b1  \
--per_device_train_batch_size  1  \
--gradient_accumulation_steps  16  \
--learning_rate 1e-4    \
--num_warmup_steps  0   \
--special_name   peft_sentalign     \
--num_train_epochs  1      \
--with_tracking False      \
--lora_target_module query_key_value,dense   \
--max_train_steps -1  \
--checkpoint_step 2000   \

Evaluation

For post-training expeirments, we measure the translation performance with BLEURT and COMET. To generate translation hypothesis, we may use the following command:

python3 -u llm_generate.py  
--model_name_or_path   bigscience/bloom-560m    
--ckpt_path  PATH_TO_THE_POST-TRAINED_MODEL.   
--n_example 3  
--source_language en 
--target_language zh   
--dataset WMTnews21

For pre-training experiment, since BLEURT and COMET is issensitive to minor improvement when the translation performance is poor, we measure the perplexity using the following command.

python3  -u plm_ppl.py  
--model_name_or_path bigscience/bloom-560m
--ckpt_path PATH_TO_THE_MODEL_TRAINED_FROM_SCRATCH
--n_example  3  
--source_language en 
--target_language zh 
--architecture target-only  
--use_prompt True

License

The work is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
model		model
preprocess		preprocess
script		script
LICENSE.txt		LICENSE.txt
README.md		README.md
accelerate_nodeepspeed.yaml		accelerate_nodeepspeed.yaml
accelerate_zero2.yaml		accelerate_zero2.yaml
accelerate_zero2_multinode.yaml		accelerate_zero2_multinode.yaml
accelerate_zero3.yaml		accelerate_zero3.yaml
dataset.py		dataset.py
deepspeed_zero2.json		deepspeed_zero2.json
deepspeed_zero3.json		deepspeed_zero3.json
llm_generate.py		llm_generate.py
llm_ppl.py		llm_ppl.py
posttrain_random.py		posttrain_random.py
posttrain_random_peft.py		posttrain_random_peft.py
posttrain_ubd.py		posttrain_ubd.py
posttrain_ubd_peft.py		posttrain_ubd_peft.py
pretrain_partial.py		pretrain_partial.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

THE REASONABLENESS BEHIND UNREASONABLE TRANSLATION CAPABILITY OF LARGE LANGUAGE MODEL

Contents

Overview

Data Release

Post-train/Pre-train

Evaluation

License

About

Releases

Packages

Languages

License

TingchenFu/ICLR24-TransContamination

Folders and files

Latest commit

History

Repository files navigation

THE REASONABLENESS BEHIND UNREASONABLE TRANSLATION CAPABILITY OF LARGE LANGUAGE MODEL

Contents

Overview

Data Release

Post-train/Pre-train

Evaluation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages