uploaded files

CENIA-DEV · Dec 16, 2024 · 8412e9e · 8412e9e
1 parent 8c3bcf5
commit 8412e9e
Show file tree

Hide file tree

Showing 13 changed files with 765 additions and 6 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1 +1,5 @@
+.env
 /config/*
+data/
+config/
+src/*/
diff --git a/README.md b/README.md
@@ -1,9 +1,105 @@
-# repo-template
-This repository template contains formatting configurations to be used with [pre-commit](https://pre-commit.com/)
+# Multilingual Translator
+This repository exposes the code to train multilingual translators using either NLLB or mT5 models.
 
-# Pre-commit usage
-Install the requirements located inside requirements/dev.txt and before making commits use the following command:
+# How to train
+Inside `src`, there is a `train.py` file that you can call as a command providing a configuration file.
 
-`pre-commit run --all-files`
+To train on one single CUDA device:
+```bash
+python train.py <config-file>
+```
 
-It will format the staged files using the hooks detailed inside the .pre-commit-config.yaml file
+To train on multiple CUDA devices, we use [https://github.com/ghanvert/AcceleratorModule](accmt):
+```bash
+accmt launch train.py <config-file>
+```
+
+# Configuration file
+The configuration file consists of different settings to adjust your training. This consists of a YAML file with the following keys:
+| Key                      | Definition                                                          |
+|--------------------------|---------------------------------------------------------------------|
+| `track_name`             | Track/Experiment name on MLFlow.                                    |
+| `run_name`               | Run name in experiment on MLFlow.                                   |
+| `log_every`              | Log every N steps to MLFlow.                                        |
+| `evaluate_every_n_steps` | Do evaluation every N steps.                                        |
+| `model`                  | Path to model to finetune.                                          |
+| `model_path`             | Output model path where to save best model and progress.            |
+| `model_type`             | Model type: **nllb** or **mt5**.                                    |
+| `tokenizer`              | Path to tokenizer to use.                                           |
+| `compile`                | Compile model for training.                                         |
+| `dropout`                | Dropout rate.                                                       |
+| `rdrop`                  | Enable RDROP regularization technique.                              |
+| `rdrop_alpha`            | RDROP alpha value.                                                  |
+| `label_smoothing`        | Label Smoothing value.                                              |
+| `max_length`             | Max length for model inputs/outputs during training.                |
+| `train_dataset`          | Train JSON dataset path.                                            |
+| `validation_dataset`     | Validation JSON dataset path.                                       |
+| `maps`                   | Map JSON keys in dataset to the corresponding language tokens.      |
+| `directions`             | Directions to train as a list. Example: `eng-spa`, `spa-eng`, etc.  |
+| `resume`                 | Resume training. If not specified, this will be done automatically. |
+| `hps`                    | Hyperparameters for training. See `example_config.yaml`.            |
+
+See `examples/example_config.yaml` for more details.
+
+# Dataset format
+Here we show the dataset format both for train and validation.
+
+## Train dataset
+This must be a JSON file with a list of only pairs. See `examples/example_train_dataset.jsonl`.
+
+## Validation dataset
+This must be a JSON file with a list of a single sentence with its various translations. See `examples/example_validation_dataset.jsonl`.
+
+Only the `directions` in the configuration file will be evaluated. Other ones will be ignored.
+
+# MLFlow Setup
+You can setup MLFlow locally:
+```bash
+mlflow server --host=localhost --port=5000
+```
+Then you can go to your browser: https://localhost:5000/
+
+Also, you must have a `.env` file in this directory (`multilingual-translator/`) with the `MLFLOW_TRACKING_URI` variable defined. This can be `localhost:5000` or any other address to your MLFlow server.
+
+# mT5
+## Training
+Before training mT5 models, you need to make sure to add language tokens to both the tokenizer and the model's embeddings. For this, you can use the script `convert_model.py`:
+```bash
+python convert_model <path-to-mt5-model> --new-tokens=<list-of-tokens> -O <output-path>
+```
+
+## Inference
+Example:
+```python
+from transformers import T5TokenizerFast, AutoModelForSeq2SeqLM
+
+tokenizer = T5TokenizerFast.from_pretrained("path-to-your-model-or-tokenizer")
+model = AutoModelForSeq2SeqLM.from_pretrained("path-to-your-model")
+
+def translate(sentence: str, translate_from="spa_Latn", translate_to="eng_Latn") -> str:
+    inputs = tokenizer(translate_from + sentence, return_tensors="pt")
+    result = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids(translate_to))
+    decoded = tokenizer.batch_decode(result, skip_special_tokens=True)[0]
+    return decoded
+```
+
+# NLLB
+## Training
+For language tokens, make sure to check available languages in [https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json](NLLB's tokenizer).
+
+## Inference
+```python
+from transformers import NllbTokenizerFast, AutoModelForSeq2SeqLM
+
+tokenizer = NllbTokenizerFast.from_pretrained("path-to-your-model-or-tokenizer")
+model = AutoModelForSeq2SeqLM.from_pretrained("path-to-your-model")
+
+def translate(sentence: str, translate_from="spa_Latn", translate_to="eng_Latn") -> str:
+    tokenizer.src_lang = translate_from
+    tokenizer.tgt_lang = translate_to
+
+    inputs = tokenizer(sentence, return_tensors="pt")
+    result = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids(translate_to))
+    decoded = tokenizer.batch_decode(result, skip_special_tokens=True)[0]
+    return decoded
+```
diff --git a/examples/example_config.yaml b/examples/example_config.yaml
@@ -0,0 +1,47 @@
+#resume: false  #  <-- force resuming from checkpoint
+
+######### MLFlow Setup #########
+track_name: your-experiment-name
+run_name: your-run-name
+
+log_every: 10 # steps
+evaluate_every_n_steps: 400
+
+######### Model Configuration #########
+model: facebook/nllb-200-distilled-600M
+tokenizer: facebook/nllb-200-distilled-600M
+model_path: your-new-nllb-model
+compile: true
+
+dropout: 0.1
+rdrop: false
+rdrop_alpha: 5
+label_smoothing: 0.1
+max_length: 200
+
+######### Dataset setup #########
+train_dataset: path-to-your-training-data
+validation_dataset: path-to-your-validation-data
+
+maps:
+  spa: spa_Latn
+  eng: eng_Latn
+
+# If not specified, directions to evaluate will be inferred automatically.
+# Directions will be based on the key of the validation dataset.
+directions:
+  # Directions are separated by '-' character, meaning: 'source-target'.
+  - spa-eng
+  - eng-spa
+
+######### Hyper Parameters configuration #########
+# Check https://github.com/ghanvert/AcceleratorModule for different optimizers and schedulers available.
+hps:
+  epochs: 10
+  batch_size: 32
+  optim:
+    type: Adam
+    lr: 1e-3
+  scheduler:
+    type: LinearWithWarmup
+    warmup_ratio: 0.2
diff --git a/examples/example_train_dataset.jsonl b/examples/example_train_dataset.jsonl
@@ -0,0 +1,26 @@
+[
+    {
+        "spa": "Ejemplo número 1",
+        "eng": "Example number 1"
+    },
+    {
+        "spa": "Ejemplo número 1",
+        "jap": "例番号 1"
+    },
+    {
+        "spa": "Ejemplo número 1",
+        "deu": "Beispiel Nummer 1"
+    },
+    {
+        "spa": "Ejemplo número 2",
+        "eng": "Example number 2"
+    },
+    {
+        "spa": "Ejemplo número 2",
+        "jap": "例番号 2"
+    },
+    {
+        "spa": "Ejemplo número 2",
+        "deu": "Beispiel Nummer 2"
+    }
+]
diff --git a/examples/example_val_dataset.jsonl b/examples/example_val_dataset.jsonl
@@ -0,0 +1,14 @@
+[
+    {
+        "spa": "Ejemplo número 1",
+        "eng": "Example number 1",
+        "deu": "Beispiel Nummer 1",
+        "jap": "例番号 1"
+    },
+    {
+        "spa": "Ejemplo número 2",
+        "eng": "Example number 2",
+        "deu": "Beispiel Nummer 2",
+        "jap": "例番号 2"
+    }
+]
diff --git a/src/arguments.py b/src/arguments.py
@@ -0,0 +1,113 @@
+from argparse import ArgumentParser
+from typing import Optional
+
+import yaml
+
+
+class TrainingArguments:
+    """
+    Get training arguments from a YAML or JSON file.
+
+    The following arguments are defined inside this class:
+        RESUME (`bool`, *optional*):
+            If set to `True` or `False`, it forces to resume training. If not defined,
+            automatically detects if a checkpoint exists.
+        MODEL (`str`):
+            Model to be finetuned. This should be a path to a directory containing the
+            model itself in HuggingFace's format.
+        MODEL_TYPE (`str`, *optional*, defaults to `nllb`):
+            Model type. Available types are: `nllb` and `mt5`.
+        TOKENIZER (`str`):
+            Tokenizer path.
+        MODEL_PATH (`str`):
+            Output model path where to save the model.
+        TRAIN_DATASET (`str`):
+            Train dataset path.
+        VALIDATION_DATASET (`str`):
+            Validation dataset path.
+        TRACK_NAME (`str`):
+            Track name in MLFlow.
+        RUN_NAME (`str`, *optional*):
+            Run name inside `TRACK_NAME` in MLFlow. If not defined, a name will be
+            generated for this run.
+        DROPOUT (`float`, *optional*):
+            Set a custom Dropout value for the model.
+        RDROP (`bool`):
+            Applies RDROP regularization technique (https://arxiv.org/abs/2106.14448).
+        RDROP_ALPHA (`float`, *optional*, defaults to `5`):
+            Applies an alpha factor to the RDROP regularization.
+        LABEL_SMOOTHING (`float`, *optional*):
+            Applies Label Smoothing regularization technique
+            (https://arxiv.org/pdf/1512.00567).
+        MAPS (`dict`, *optional*):
+            Dictionary key to corresponding language token.
+        DIRECTIONS (`list`, *optional*):
+            List of directions.
+        LOG_EVERY (`int`, *optional*, defaults to `10`):
+            Log train loss to MLFlow every N steps.
+        EVALUATE_EVERY_N_STEPS (`int`, *optional*):
+            Evaluate every N steps.
+        COMPILE (`bool`, *optional*, defaults to `False`):
+            Compile model.
+        IGNORE (`set`, *optional*):
+            Ignore certain pairs.
+
+    Args:
+        path (`str`):
+            Path to the YAML or JSON file.
+    """
+
+    _mandatory_keys = {
+        "model",
+        "tokenizer",
+        "model_path",
+        "track_name",
+        "train_dataset",
+        "validation_dataset",
+    }
+
+    def __init__(self, path: str):
+        self._data: dict = yaml.safe_load(open(path))
+        self._check_arguments()
+
+        self.RESUME: Optional[bool] = self._data.get("resume")
+        self.MODEL: str = self._data.get("model")
+        self.MODEL_TYPE: str = self._data.get("model_type", "nllb").lower()
+        self.TOKENIZER: str = self._data.get("tokenizer")
+        self.MODEL_PATH: str = self._data.get("model_path")
+        self.TRACK_NAME: str = self._data.get("track_name")
+        self.RUN_NAME: Optional[str] = self._data.get("run_name")
+        self.MAX_LENGTH: int = self._data.get("max_length", 200)
+        self.DROPOUT: Optional[float] = self._data.get("dropout")
+        self.RDROP: bool = self._data.get("rdrop", False)
+        self.RDROP_ALPHA: float = self._data.get("rdrop_alpha", 5)
+        self.LABEL_SMOOTHING: Optional[float] = self._data.get("label_smoothing")
+        self.TRAIN_DATASET: str = self._data.get("train_dataset")
+        self.VALIDATION_DATASET: str = self._data.get("validation_dataset")
+        self.MAPS: dict = self._data.get("maps")
+        self.DIRECTIONS: list = self._data.get("directions")
+        self.LOG_EVERY: int = self._data.get("log_every", 10)
+        self.EVALUATE_EVERY_N_STEPS: Optional[int] = self._data.get(
+            "evaluate_every_n_steps"
+        )
+        self.COMPILE: Optional[bool] = self._data.get("compile", False)
+        self.IGNORE: Optional[list[str]] = self._data.get("ignore")
+
+    def _check_arguments(self):
+        keys = set(self._data.keys())
+        intersection = keys & self._mandatory_keys
+
+        if intersection != self._mandatory_keys:
+            missing_keys = list(self._mandatory_keys - intersection)
+            raise ValueError(f"Missing keys: {missing_keys}")
+
+
+def get_config():
+    parser = ArgumentParser(
+        description="Train a translation model with a given configuration file."
+    )
+
+    parser.add_argument("config", type=str, help="Configuration YAML or JSON file.")
+    args = parser.parse_args()
+
+    return args.config
diff --git a/src/convert_model.py b/src/convert_model.py
@@ -0,0 +1,33 @@
+from argparse import ArgumentParser
+
+from transformers import AutoModelForSeq2SeqLM, T5TokenizerFast
+
+from utils import add_new_languages
+
+parser = ArgumentParser(description="Convert model and tokenizer.")
+parser.add_argument("model", type=str, help="Path to the model.")
+parser.add_argument(
+    "--tokenizer",
+    type=str,
+    help="Path to a tokenizer. If not provided, 'model' path will be used.",
+)
+parser.add_argument(
+    "--new-tokens", type=str, nargs="+", required=True, help="New tokens to add."
+)
+parser.add_argument("--type", type=str, default="mt5", help="Model type.")
+parser.add_argument("--output", "-O", required=True, type=str, help="Output path.")
+args = parser.parse_args()
+
+available_types = ["mt5"]
+model_type = args.type.lower()
+assert model_type in available_types, f"Available models types are: {available_types}"
+
+# TODO We're only supporting mT5 models for now.
+model_path = args.model
+tokenizer_path = model_path if args.tokenizer is None else args.tokenizer
+model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
+tokenizer = T5TokenizerFast.from_pretrained(tokenizer_path)
+
+add_new_languages(tokenizer, model, args.new_tokens)
+model.save_pretrained(args.output, safe_serialization=False)
+tokenizer.save_pretrained(args.output)