Skip to content

code for the paper: The Reasonableness Behind Unreasonable Translation Capability of Large Language Model

License

Notifications You must be signed in to change notification settings

TingchenFu/ICLR24-TransContamination

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

THE REASONABLENESS BEHIND UNREASONABLE TRANSLATION CAPABILITY OF LARGE LANGUAGE MODEL

Code License Data License Weight License Python 3.8+ Code style: black

Tingchen Fu‡†, Lemao Liu, Deng Cai, Guoping Huang, Shuming Shi, Rui Yan

Gaoling School of Artificial Intelligence, Renmin University of China Tencent AI Lab

Contents

Overview

Large language models (LLMs) exhibit non-trivial or even state-or-the-art capacity in neural machine transtion, violating the conventional wisdom that translation ability highly relies on large-scale parallel corpus. To understand the mechanism behind the unreasonable translation ability, we propose that three types of unintentional bilingual data make crucial contribution to the translation ability of LLM. Specifically, three common types of unintentional bilingual data includes:

  • sentence alignment The co-occurence of a sentence and its translation within close proximity in a document.

  • word alignment The co-occurrence of one or more words (though not an entire sentence) and their translations within close proximity in a single document.

  • code-switching The co-occurrence of two languages within close proximity in a document, where the content in the two languages is semantically related rather than bearing a direct translation relationship.


Data Release

We release the excavated unintentional bilingual data from mC4.en and mC4.zh. The data is available on 🤗 Huggingface Datasets. Each sample is a structured data file in the JSON format. It consists of a list of dictionaries, with each dictionary containing the following fields:

  • file_no: str, the name of file in mC4 from which the case is found.
  • doc_no: str, the document number in the current file.
  • text: str, the unintentional bilingual text.

The statistics of the


en zh
sentence alignment # Document 210,931 2462
# Sequence 355,320 432
word alignment # Document 658643 1972764
# Sequence 500,550 659,456
code switch # Document 2021502 5086373
# Sequence 903,810 997,376

Post-train/Pre-train

To post-train BLOOM-560m on unintentional bililingual data, you can use the following command:

export NCCL_IB_GID_INDEX=3
accelerate launch  \
--machine_rank 0  \
--num_machines 1  \
--num_processes 8  \
--config_file  accelerate_zero2.yaml  \
 ${RUN_DIR}/posttrain_ubd.py  \
--debug False \
--streaming True   \
--train_file   PATH_TO_UNINTENTIONAL_BILINGUAL_DATA   \
--from_scratch False \
--config_name PATH_TO_BLOOM_CONFIG  \
--model_name_or_path  bigscience/bloom-560m    \
--per_device_train_batch_size  2  \
--gradient_accumulation_steps  32  \
--learning_rate 3e-4    \
--num_warmup_steps  500   \
--window_size 1024          \
--special_name   posttrain_wordalign     \
--num_train_epochs  1      \
--with_tracking False      \
--max_train_steps  -1    \
--checkpoint_step 4000   \
--seed 0  \

To post-train BLOOM-7b1 with PEFT technique on unintentional bililingual data, you can use the following command:

accelerate launch  \
--machine_rank 0 \
--num_machines  1 \
--num_processes  8 \
--config_file  accelerate_zero2.yaml  \
 ${RUN_DIR}/posttrain_ubd_peft.py  \
--debug False \
--streaming True   \
--from_scratch False \
--train_file  file_sent.txt   \
--config_name PATH_TO_BLOOM_CONFIG  \
--model_name_or_path bigscience/bloom-7b1  \
--per_device_train_batch_size  1  \
--gradient_accumulation_steps  16  \
--learning_rate 1e-4    \
--num_warmup_steps  0   \
--special_name   peft_sentalign     \
--num_train_epochs  1      \
--with_tracking False      \
--lora_target_module query_key_value,dense   \
--max_train_steps -1  \
--checkpoint_step 2000   \

Evaluation

For post-training expeirments, we measure the translation performance with BLEURT and COMET. To generate translation hypothesis, we may use the following command:

python3 -u llm_generate.py  
--model_name_or_path   bigscience/bloom-560m    
--ckpt_path  PATH_TO_THE_POST-TRAINED_MODEL.   
--n_example 3  
--source_language en 
--target_language zh   
--dataset WMTnews21

For pre-training experiment, since BLEURT and COMET is issensitive to minor improvement when the translation performance is poor, we measure the perplexity using the following command.

python3  -u plm_ppl.py  
--model_name_or_path bigscience/bloom-560m
--ckpt_path PATH_TO_THE_MODEL_TRAINED_FROM_SCRATCH
--n_example  3  
--source_language en 
--target_language zh 
--architecture target-only  
--use_prompt True

License

The work is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

About

code for the paper: The Reasonableness Behind Unreasonable Translation Capability of Large Language Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published