-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
13 changed files
with
765 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,5 @@ | ||
.env | ||
/config/* | ||
data/ | ||
config/ | ||
src/*/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,105 @@ | ||
# repo-template | ||
This repository template contains formatting configurations to be used with [pre-commit](https://pre-commit.com/) | ||
# Multilingual Translator | ||
This repository exposes the code to train multilingual translators using either NLLB or mT5 models. | ||
|
||
# Pre-commit usage | ||
Install the requirements located inside requirements/dev.txt and before making commits use the following command: | ||
# How to train | ||
Inside `src`, there is a `train.py` file that you can call as a command providing a configuration file. | ||
|
||
`pre-commit run --all-files` | ||
To train on one single CUDA device: | ||
```bash | ||
python train.py <config-file> | ||
``` | ||
|
||
It will format the staged files using the hooks detailed inside the .pre-commit-config.yaml file | ||
To train on multiple CUDA devices, we use [https://github.com/ghanvert/AcceleratorModule](accmt): | ||
```bash | ||
accmt launch train.py <config-file> | ||
``` | ||
|
||
# Configuration file | ||
The configuration file consists of different settings to adjust your training. This consists of a YAML file with the following keys: | ||
| Key | Definition | | ||
|--------------------------|---------------------------------------------------------------------| | ||
| `track_name` | Track/Experiment name on MLFlow. | | ||
| `run_name` | Run name in experiment on MLFlow. | | ||
| `log_every` | Log every N steps to MLFlow. | | ||
| `evaluate_every_n_steps` | Do evaluation every N steps. | | ||
| `model` | Path to model to finetune. | | ||
| `model_path` | Output model path where to save best model and progress. | | ||
| `model_type` | Model type: **nllb** or **mt5**. | | ||
| `tokenizer` | Path to tokenizer to use. | | ||
| `compile` | Compile model for training. | | ||
| `dropout` | Dropout rate. | | ||
| `rdrop` | Enable RDROP regularization technique. | | ||
| `rdrop_alpha` | RDROP alpha value. | | ||
| `label_smoothing` | Label Smoothing value. | | ||
| `max_length` | Max length for model inputs/outputs during training. | | ||
| `train_dataset` | Train JSON dataset path. | | ||
| `validation_dataset` | Validation JSON dataset path. | | ||
| `maps` | Map JSON keys in dataset to the corresponding language tokens. | | ||
| `directions` | Directions to train as a list. Example: `eng-spa`, `spa-eng`, etc. | | ||
| `resume` | Resume training. If not specified, this will be done automatically. | | ||
| `hps` | Hyperparameters for training. See `example_config.yaml`. | | ||
|
||
See `examples/example_config.yaml` for more details. | ||
|
||
# Dataset format | ||
Here we show the dataset format both for train and validation. | ||
|
||
## Train dataset | ||
This must be a JSON file with a list of only pairs. See `examples/example_train_dataset.jsonl`. | ||
|
||
## Validation dataset | ||
This must be a JSON file with a list of a single sentence with its various translations. See `examples/example_validation_dataset.jsonl`. | ||
|
||
Only the `directions` in the configuration file will be evaluated. Other ones will be ignored. | ||
|
||
# MLFlow Setup | ||
You can setup MLFlow locally: | ||
```bash | ||
mlflow server --host=localhost --port=5000 | ||
``` | ||
Then you can go to your browser: https://localhost:5000/ | ||
|
||
Also, you must have a `.env` file in this directory (`multilingual-translator/`) with the `MLFLOW_TRACKING_URI` variable defined. This can be `localhost:5000` or any other address to your MLFlow server. | ||
|
||
# mT5 | ||
## Training | ||
Before training mT5 models, you need to make sure to add language tokens to both the tokenizer and the model's embeddings. For this, you can use the script `convert_model.py`: | ||
```bash | ||
python convert_model <path-to-mt5-model> --new-tokens=<list-of-tokens> -O <output-path> | ||
``` | ||
|
||
## Inference | ||
Example: | ||
```python | ||
from transformers import T5TokenizerFast, AutoModelForSeq2SeqLM | ||
|
||
tokenizer = T5TokenizerFast.from_pretrained("path-to-your-model-or-tokenizer") | ||
model = AutoModelForSeq2SeqLM.from_pretrained("path-to-your-model") | ||
|
||
def translate(sentence: str, translate_from="spa_Latn", translate_to="eng_Latn") -> str: | ||
inputs = tokenizer(translate_from + sentence, return_tensors="pt") | ||
result = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids(translate_to)) | ||
decoded = tokenizer.batch_decode(result, skip_special_tokens=True)[0] | ||
return decoded | ||
``` | ||
|
||
# NLLB | ||
## Training | ||
For language tokens, make sure to check available languages in [https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json](NLLB's tokenizer). | ||
|
||
## Inference | ||
```python | ||
from transformers import NllbTokenizerFast, AutoModelForSeq2SeqLM | ||
|
||
tokenizer = NllbTokenizerFast.from_pretrained("path-to-your-model-or-tokenizer") | ||
model = AutoModelForSeq2SeqLM.from_pretrained("path-to-your-model") | ||
|
||
def translate(sentence: str, translate_from="spa_Latn", translate_to="eng_Latn") -> str: | ||
tokenizer.src_lang = translate_from | ||
tokenizer.tgt_lang = translate_to | ||
|
||
inputs = tokenizer(sentence, return_tensors="pt") | ||
result = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids(translate_to)) | ||
decoded = tokenizer.batch_decode(result, skip_special_tokens=True)[0] | ||
return decoded | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
#resume: false # <-- force resuming from checkpoint | ||
|
||
######### MLFlow Setup ######### | ||
track_name: your-experiment-name | ||
run_name: your-run-name | ||
|
||
log_every: 10 # steps | ||
evaluate_every_n_steps: 400 | ||
|
||
######### Model Configuration ######### | ||
model: facebook/nllb-200-distilled-600M | ||
tokenizer: facebook/nllb-200-distilled-600M | ||
model_path: your-new-nllb-model | ||
compile: true | ||
|
||
dropout: 0.1 | ||
rdrop: false | ||
rdrop_alpha: 5 | ||
label_smoothing: 0.1 | ||
max_length: 200 | ||
|
||
######### Dataset setup ######### | ||
train_dataset: path-to-your-training-data | ||
validation_dataset: path-to-your-validation-data | ||
|
||
maps: | ||
spa: spa_Latn | ||
eng: eng_Latn | ||
|
||
# If not specified, directions to evaluate will be inferred automatically. | ||
# Directions will be based on the key of the validation dataset. | ||
directions: | ||
# Directions are separated by '-' character, meaning: 'source-target'. | ||
- spa-eng | ||
- eng-spa | ||
|
||
######### Hyper Parameters configuration ######### | ||
# Check https://github.com/ghanvert/AcceleratorModule for different optimizers and schedulers available. | ||
hps: | ||
epochs: 10 | ||
batch_size: 32 | ||
optim: | ||
type: Adam | ||
lr: 1e-3 | ||
scheduler: | ||
type: LinearWithWarmup | ||
warmup_ratio: 0.2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
[ | ||
{ | ||
"spa": "Ejemplo número 1", | ||
"eng": "Example number 1" | ||
}, | ||
{ | ||
"spa": "Ejemplo número 1", | ||
"jap": "例番号 1" | ||
}, | ||
{ | ||
"spa": "Ejemplo número 1", | ||
"deu": "Beispiel Nummer 1" | ||
}, | ||
{ | ||
"spa": "Ejemplo número 2", | ||
"eng": "Example number 2" | ||
}, | ||
{ | ||
"spa": "Ejemplo número 2", | ||
"jap": "例番号 2" | ||
}, | ||
{ | ||
"spa": "Ejemplo número 2", | ||
"deu": "Beispiel Nummer 2" | ||
} | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
[ | ||
{ | ||
"spa": "Ejemplo número 1", | ||
"eng": "Example number 1", | ||
"deu": "Beispiel Nummer 1", | ||
"jap": "例番号 1" | ||
}, | ||
{ | ||
"spa": "Ejemplo número 2", | ||
"eng": "Example number 2", | ||
"deu": "Beispiel Nummer 2", | ||
"jap": "例番号 2" | ||
} | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
from argparse import ArgumentParser | ||
from typing import Optional | ||
|
||
import yaml | ||
|
||
|
||
class TrainingArguments: | ||
""" | ||
Get training arguments from a YAML or JSON file. | ||
The following arguments are defined inside this class: | ||
RESUME (`bool`, *optional*): | ||
If set to `True` or `False`, it forces to resume training. If not defined, | ||
automatically detects if a checkpoint exists. | ||
MODEL (`str`): | ||
Model to be finetuned. This should be a path to a directory containing the | ||
model itself in HuggingFace's format. | ||
MODEL_TYPE (`str`, *optional*, defaults to `nllb`): | ||
Model type. Available types are: `nllb` and `mt5`. | ||
TOKENIZER (`str`): | ||
Tokenizer path. | ||
MODEL_PATH (`str`): | ||
Output model path where to save the model. | ||
TRAIN_DATASET (`str`): | ||
Train dataset path. | ||
VALIDATION_DATASET (`str`): | ||
Validation dataset path. | ||
TRACK_NAME (`str`): | ||
Track name in MLFlow. | ||
RUN_NAME (`str`, *optional*): | ||
Run name inside `TRACK_NAME` in MLFlow. If not defined, a name will be | ||
generated for this run. | ||
DROPOUT (`float`, *optional*): | ||
Set a custom Dropout value for the model. | ||
RDROP (`bool`): | ||
Applies RDROP regularization technique (https://arxiv.org/abs/2106.14448). | ||
RDROP_ALPHA (`float`, *optional*, defaults to `5`): | ||
Applies an alpha factor to the RDROP regularization. | ||
LABEL_SMOOTHING (`float`, *optional*): | ||
Applies Label Smoothing regularization technique | ||
(https://arxiv.org/pdf/1512.00567). | ||
MAPS (`dict`, *optional*): | ||
Dictionary key to corresponding language token. | ||
DIRECTIONS (`list`, *optional*): | ||
List of directions. | ||
LOG_EVERY (`int`, *optional*, defaults to `10`): | ||
Log train loss to MLFlow every N steps. | ||
EVALUATE_EVERY_N_STEPS (`int`, *optional*): | ||
Evaluate every N steps. | ||
COMPILE (`bool`, *optional*, defaults to `False`): | ||
Compile model. | ||
IGNORE (`set`, *optional*): | ||
Ignore certain pairs. | ||
Args: | ||
path (`str`): | ||
Path to the YAML or JSON file. | ||
""" | ||
|
||
_mandatory_keys = { | ||
"model", | ||
"tokenizer", | ||
"model_path", | ||
"track_name", | ||
"train_dataset", | ||
"validation_dataset", | ||
} | ||
|
||
def __init__(self, path: str): | ||
self._data: dict = yaml.safe_load(open(path)) | ||
self._check_arguments() | ||
|
||
self.RESUME: Optional[bool] = self._data.get("resume") | ||
self.MODEL: str = self._data.get("model") | ||
self.MODEL_TYPE: str = self._data.get("model_type", "nllb").lower() | ||
self.TOKENIZER: str = self._data.get("tokenizer") | ||
self.MODEL_PATH: str = self._data.get("model_path") | ||
self.TRACK_NAME: str = self._data.get("track_name") | ||
self.RUN_NAME: Optional[str] = self._data.get("run_name") | ||
self.MAX_LENGTH: int = self._data.get("max_length", 200) | ||
self.DROPOUT: Optional[float] = self._data.get("dropout") | ||
self.RDROP: bool = self._data.get("rdrop", False) | ||
self.RDROP_ALPHA: float = self._data.get("rdrop_alpha", 5) | ||
self.LABEL_SMOOTHING: Optional[float] = self._data.get("label_smoothing") | ||
self.TRAIN_DATASET: str = self._data.get("train_dataset") | ||
self.VALIDATION_DATASET: str = self._data.get("validation_dataset") | ||
self.MAPS: dict = self._data.get("maps") | ||
self.DIRECTIONS: list = self._data.get("directions") | ||
self.LOG_EVERY: int = self._data.get("log_every", 10) | ||
self.EVALUATE_EVERY_N_STEPS: Optional[int] = self._data.get( | ||
"evaluate_every_n_steps" | ||
) | ||
self.COMPILE: Optional[bool] = self._data.get("compile", False) | ||
self.IGNORE: Optional[list[str]] = self._data.get("ignore") | ||
|
||
def _check_arguments(self): | ||
keys = set(self._data.keys()) | ||
intersection = keys & self._mandatory_keys | ||
|
||
if intersection != self._mandatory_keys: | ||
missing_keys = list(self._mandatory_keys - intersection) | ||
raise ValueError(f"Missing keys: {missing_keys}") | ||
|
||
|
||
def get_config(): | ||
parser = ArgumentParser( | ||
description="Train a translation model with a given configuration file." | ||
) | ||
|
||
parser.add_argument("config", type=str, help="Configuration YAML or JSON file.") | ||
args = parser.parse_args() | ||
|
||
return args.config |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
from argparse import ArgumentParser | ||
|
||
from transformers import AutoModelForSeq2SeqLM, T5TokenizerFast | ||
|
||
from utils import add_new_languages | ||
|
||
parser = ArgumentParser(description="Convert model and tokenizer.") | ||
parser.add_argument("model", type=str, help="Path to the model.") | ||
parser.add_argument( | ||
"--tokenizer", | ||
type=str, | ||
help="Path to a tokenizer. If not provided, 'model' path will be used.", | ||
) | ||
parser.add_argument( | ||
"--new-tokens", type=str, nargs="+", required=True, help="New tokens to add." | ||
) | ||
parser.add_argument("--type", type=str, default="mt5", help="Model type.") | ||
parser.add_argument("--output", "-O", required=True, type=str, help="Output path.") | ||
args = parser.parse_args() | ||
|
||
available_types = ["mt5"] | ||
model_type = args.type.lower() | ||
assert model_type in available_types, f"Available models types are: {available_types}" | ||
|
||
# TODO We're only supporting mT5 models for now. | ||
model_path = args.model | ||
tokenizer_path = model_path if args.tokenizer is None else args.tokenizer | ||
model = AutoModelForSeq2SeqLM.from_pretrained(model_path) | ||
tokenizer = T5TokenizerFast.from_pretrained(tokenizer_path) | ||
|
||
add_new_languages(tokenizer, model, args.new_tokens) | ||
model.save_pretrained(args.output, safe_serialization=False) | ||
tokenizer.save_pretrained(args.output) |
Oops, something went wrong.