Skip to content

Commit

Permalink
uploaded files
Browse files Browse the repository at this point in the history
  • Loading branch information
ghanvert committed Dec 16, 2024
1 parent 8c3bcf5 commit 8412e9e
Show file tree
Hide file tree
Showing 13 changed files with 765 additions and 6 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
.env
/config/*
data/
config/
src/*/
108 changes: 102 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,105 @@
# repo-template
This repository template contains formatting configurations to be used with [pre-commit](https://pre-commit.com/)
# Multilingual Translator
This repository exposes the code to train multilingual translators using either NLLB or mT5 models.

# Pre-commit usage
Install the requirements located inside requirements/dev.txt and before making commits use the following command:
# How to train
Inside `src`, there is a `train.py` file that you can call as a command providing a configuration file.

`pre-commit run --all-files`
To train on one single CUDA device:
```bash
python train.py <config-file>
```

It will format the staged files using the hooks detailed inside the .pre-commit-config.yaml file
To train on multiple CUDA devices, we use [https://github.com/ghanvert/AcceleratorModule](accmt):
```bash
accmt launch train.py <config-file>
```

# Configuration file
The configuration file consists of different settings to adjust your training. This consists of a YAML file with the following keys:
| Key | Definition |
|--------------------------|---------------------------------------------------------------------|
| `track_name` | Track/Experiment name on MLFlow. |
| `run_name` | Run name in experiment on MLFlow. |
| `log_every` | Log every N steps to MLFlow. |
| `evaluate_every_n_steps` | Do evaluation every N steps. |
| `model` | Path to model to finetune. |
| `model_path` | Output model path where to save best model and progress. |
| `model_type` | Model type: **nllb** or **mt5**. |
| `tokenizer` | Path to tokenizer to use. |
| `compile` | Compile model for training. |
| `dropout` | Dropout rate. |
| `rdrop` | Enable RDROP regularization technique. |
| `rdrop_alpha` | RDROP alpha value. |
| `label_smoothing` | Label Smoothing value. |
| `max_length` | Max length for model inputs/outputs during training. |
| `train_dataset` | Train JSON dataset path. |
| `validation_dataset` | Validation JSON dataset path. |
| `maps` | Map JSON keys in dataset to the corresponding language tokens. |
| `directions` | Directions to train as a list. Example: `eng-spa`, `spa-eng`, etc. |
| `resume` | Resume training. If not specified, this will be done automatically. |
| `hps` | Hyperparameters for training. See `example_config.yaml`. |

See `examples/example_config.yaml` for more details.

# Dataset format
Here we show the dataset format both for train and validation.

## Train dataset
This must be a JSON file with a list of only pairs. See `examples/example_train_dataset.jsonl`.

## Validation dataset
This must be a JSON file with a list of a single sentence with its various translations. See `examples/example_validation_dataset.jsonl`.

Only the `directions` in the configuration file will be evaluated. Other ones will be ignored.

# MLFlow Setup
You can setup MLFlow locally:
```bash
mlflow server --host=localhost --port=5000
```
Then you can go to your browser: https://localhost:5000/

Also, you must have a `.env` file in this directory (`multilingual-translator/`) with the `MLFLOW_TRACKING_URI` variable defined. This can be `localhost:5000` or any other address to your MLFlow server.

# mT5
## Training
Before training mT5 models, you need to make sure to add language tokens to both the tokenizer and the model's embeddings. For this, you can use the script `convert_model.py`:
```bash
python convert_model <path-to-mt5-model> --new-tokens=<list-of-tokens> -O <output-path>
```

## Inference
Example:
```python
from transformers import T5TokenizerFast, AutoModelForSeq2SeqLM

tokenizer = T5TokenizerFast.from_pretrained("path-to-your-model-or-tokenizer")
model = AutoModelForSeq2SeqLM.from_pretrained("path-to-your-model")

def translate(sentence: str, translate_from="spa_Latn", translate_to="eng_Latn") -> str:
inputs = tokenizer(translate_from + sentence, return_tensors="pt")
result = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids(translate_to))
decoded = tokenizer.batch_decode(result, skip_special_tokens=True)[0]
return decoded
```

# NLLB
## Training
For language tokens, make sure to check available languages in [https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json](NLLB's tokenizer).

## Inference
```python
from transformers import NllbTokenizerFast, AutoModelForSeq2SeqLM

tokenizer = NllbTokenizerFast.from_pretrained("path-to-your-model-or-tokenizer")
model = AutoModelForSeq2SeqLM.from_pretrained("path-to-your-model")

def translate(sentence: str, translate_from="spa_Latn", translate_to="eng_Latn") -> str:
tokenizer.src_lang = translate_from
tokenizer.tgt_lang = translate_to

inputs = tokenizer(sentence, return_tensors="pt")
result = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids(translate_to))
decoded = tokenizer.batch_decode(result, skip_special_tokens=True)[0]
return decoded
```
47 changes: 47 additions & 0 deletions examples/example_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#resume: false # <-- force resuming from checkpoint

######### MLFlow Setup #########
track_name: your-experiment-name
run_name: your-run-name

log_every: 10 # steps
evaluate_every_n_steps: 400

######### Model Configuration #########
model: facebook/nllb-200-distilled-600M
tokenizer: facebook/nllb-200-distilled-600M
model_path: your-new-nllb-model
compile: true

dropout: 0.1
rdrop: false
rdrop_alpha: 5
label_smoothing: 0.1
max_length: 200

######### Dataset setup #########
train_dataset: path-to-your-training-data
validation_dataset: path-to-your-validation-data

maps:
spa: spa_Latn
eng: eng_Latn

# If not specified, directions to evaluate will be inferred automatically.
# Directions will be based on the key of the validation dataset.
directions:
# Directions are separated by '-' character, meaning: 'source-target'.
- spa-eng
- eng-spa

######### Hyper Parameters configuration #########
# Check https://github.com/ghanvert/AcceleratorModule for different optimizers and schedulers available.
hps:
epochs: 10
batch_size: 32
optim:
type: Adam
lr: 1e-3
scheduler:
type: LinearWithWarmup
warmup_ratio: 0.2
26 changes: 26 additions & 0 deletions examples/example_train_dataset.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[
{
"spa": "Ejemplo número 1",
"eng": "Example number 1"
},
{
"spa": "Ejemplo número 1",
"jap": "例番号 1"
},
{
"spa": "Ejemplo número 1",
"deu": "Beispiel Nummer 1"
},
{
"spa": "Ejemplo número 2",
"eng": "Example number 2"
},
{
"spa": "Ejemplo número 2",
"jap": "例番号 2"
},
{
"spa": "Ejemplo número 2",
"deu": "Beispiel Nummer 2"
}
]
14 changes: 14 additions & 0 deletions examples/example_val_dataset.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[
{
"spa": "Ejemplo número 1",
"eng": "Example number 1",
"deu": "Beispiel Nummer 1",
"jap": "例番号 1"
},
{
"spa": "Ejemplo número 2",
"eng": "Example number 2",
"deu": "Beispiel Nummer 2",
"jap": "例番号 2"
}
]
113 changes: 113 additions & 0 deletions src/arguments.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
from argparse import ArgumentParser
from typing import Optional

import yaml


class TrainingArguments:
"""
Get training arguments from a YAML or JSON file.
The following arguments are defined inside this class:
RESUME (`bool`, *optional*):
If set to `True` or `False`, it forces to resume training. If not defined,
automatically detects if a checkpoint exists.
MODEL (`str`):
Model to be finetuned. This should be a path to a directory containing the
model itself in HuggingFace's format.
MODEL_TYPE (`str`, *optional*, defaults to `nllb`):
Model type. Available types are: `nllb` and `mt5`.
TOKENIZER (`str`):
Tokenizer path.
MODEL_PATH (`str`):
Output model path where to save the model.
TRAIN_DATASET (`str`):
Train dataset path.
VALIDATION_DATASET (`str`):
Validation dataset path.
TRACK_NAME (`str`):
Track name in MLFlow.
RUN_NAME (`str`, *optional*):
Run name inside `TRACK_NAME` in MLFlow. If not defined, a name will be
generated for this run.
DROPOUT (`float`, *optional*):
Set a custom Dropout value for the model.
RDROP (`bool`):
Applies RDROP regularization technique (https://arxiv.org/abs/2106.14448).
RDROP_ALPHA (`float`, *optional*, defaults to `5`):
Applies an alpha factor to the RDROP regularization.
LABEL_SMOOTHING (`float`, *optional*):
Applies Label Smoothing regularization technique
(https://arxiv.org/pdf/1512.00567).
MAPS (`dict`, *optional*):
Dictionary key to corresponding language token.
DIRECTIONS (`list`, *optional*):
List of directions.
LOG_EVERY (`int`, *optional*, defaults to `10`):
Log train loss to MLFlow every N steps.
EVALUATE_EVERY_N_STEPS (`int`, *optional*):
Evaluate every N steps.
COMPILE (`bool`, *optional*, defaults to `False`):
Compile model.
IGNORE (`set`, *optional*):
Ignore certain pairs.
Args:
path (`str`):
Path to the YAML or JSON file.
"""

_mandatory_keys = {
"model",
"tokenizer",
"model_path",
"track_name",
"train_dataset",
"validation_dataset",
}

def __init__(self, path: str):
self._data: dict = yaml.safe_load(open(path))
self._check_arguments()

self.RESUME: Optional[bool] = self._data.get("resume")
self.MODEL: str = self._data.get("model")
self.MODEL_TYPE: str = self._data.get("model_type", "nllb").lower()
self.TOKENIZER: str = self._data.get("tokenizer")
self.MODEL_PATH: str = self._data.get("model_path")
self.TRACK_NAME: str = self._data.get("track_name")
self.RUN_NAME: Optional[str] = self._data.get("run_name")
self.MAX_LENGTH: int = self._data.get("max_length", 200)
self.DROPOUT: Optional[float] = self._data.get("dropout")
self.RDROP: bool = self._data.get("rdrop", False)
self.RDROP_ALPHA: float = self._data.get("rdrop_alpha", 5)
self.LABEL_SMOOTHING: Optional[float] = self._data.get("label_smoothing")
self.TRAIN_DATASET: str = self._data.get("train_dataset")
self.VALIDATION_DATASET: str = self._data.get("validation_dataset")
self.MAPS: dict = self._data.get("maps")
self.DIRECTIONS: list = self._data.get("directions")
self.LOG_EVERY: int = self._data.get("log_every", 10)
self.EVALUATE_EVERY_N_STEPS: Optional[int] = self._data.get(
"evaluate_every_n_steps"
)
self.COMPILE: Optional[bool] = self._data.get("compile", False)
self.IGNORE: Optional[list[str]] = self._data.get("ignore")

def _check_arguments(self):
keys = set(self._data.keys())
intersection = keys & self._mandatory_keys

if intersection != self._mandatory_keys:
missing_keys = list(self._mandatory_keys - intersection)
raise ValueError(f"Missing keys: {missing_keys}")


def get_config():
parser = ArgumentParser(
description="Train a translation model with a given configuration file."
)

parser.add_argument("config", type=str, help="Configuration YAML or JSON file.")
args = parser.parse_args()

return args.config
33 changes: 33 additions & 0 deletions src/convert_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from argparse import ArgumentParser

from transformers import AutoModelForSeq2SeqLM, T5TokenizerFast

from utils import add_new_languages

parser = ArgumentParser(description="Convert model and tokenizer.")
parser.add_argument("model", type=str, help="Path to the model.")
parser.add_argument(
"--tokenizer",
type=str,
help="Path to a tokenizer. If not provided, 'model' path will be used.",
)
parser.add_argument(
"--new-tokens", type=str, nargs="+", required=True, help="New tokens to add."
)
parser.add_argument("--type", type=str, default="mt5", help="Model type.")
parser.add_argument("--output", "-O", required=True, type=str, help="Output path.")
args = parser.parse_args()

available_types = ["mt5"]
model_type = args.type.lower()
assert model_type in available_types, f"Available models types are: {available_types}"

# TODO We're only supporting mT5 models for now.
model_path = args.model
tokenizer_path = model_path if args.tokenizer is None else args.tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
tokenizer = T5TokenizerFast.from_pretrained(tokenizer_path)

add_new_languages(tokenizer, model, args.new_tokens)
model.save_pretrained(args.output, safe_serialization=False)
tokenizer.save_pretrained(args.output)
Loading

0 comments on commit 8412e9e

Please sign in to comment.