Supertrainer is a unified repository for various trainers, inspired by projects like Axolotl, Torchtune, and Unsloth. It aims to provide a flexible and extensible framework for training, inferencing, and benchmarking different types of models (e.g., BERT, LLM, MLLM) using a configuration-driven approach.
- For moondream training, my main resource of this is this youtube channel
Read How to run first before reading this. This is just a debug section, but I know probably you'll be lazy to read this if I put it below the how to run section. BUT THIS IS IMPORTANT.
To see what's out config is before running the model, you can use this command
python src/supertrainer/main.py --cfg job
Make sure that these commands is correct before running the model. Hydra uses OmegaConf
for the config. But I casted it into addict.Dict
since OmegaConf
can't transfer object file. But now it introduce another problem where if, let's say you want to access something like config.a.b.c.d.e.f.g
which probably doesn't exists, it'll return you an empty dictionary (I plan to changed this into None
instead) since addict.Dict
is very dynamic and not enforce you to purposely have the key beforehand. Maybe it's better to have it as static or maybe some kind of instantiation first before adding new value to the key? That's a good idea tbh I'll try find this way.
[NOTE -> I supposed to already handle this below error by my commit of 0.0.2 by using the StrictDict
. But I'll put this below error temporarily since I haven't tested fully the StrictDict
]
Oh, because it keeps returning empty dictionary to the key that did not exists, you will find a lot of error something like TypeError: Dict object is not subscriptable
or TypeError: Dict object do not have an attribute X
. This is because the key that you're trying to access is not exists. So make sure that you're accessing the correct key. As long as you understand that these errors means that the key is not exists in addict.Dict
part, you're good to go.
To install the environment, you can do make install_conda_env CONDA_ENV=<YOUR PREFERED NAME>
to create the conda environment. By default, CONDA_ENV=supertrainer
if not specified. We assume that you use MBZUAI HPC cluster to run this code. If not, you need to change the CONDA_PATH
in scripts/install_conda_env.sh
to your own conda path.
The script will install the environment based on environment.yaml
file. It'll also install the supertrainer
package in editable mode. So you can modify the code and it'll automatically updated in the environment.
If you need to install new library or package, you can do it by either using pip
or conda
. Just to make sure to export your environment.yaml
by doing make export_conda_env
so that the environment is updated. This command will do an agnostic OS export + remove the supertrainer
package from the environment.yaml
file.
Everything that you need for training is in src/supertrainer/main.py
. So you can run this file to train the model.
python src/supertrainer/main.py
Because it's based on Hydra, you can follow it's example to change any of the parameters. For example, since we will do a lot of training on different cases like maybe training BERT or training MLLM, you will be changing alot the experiment part of the config.
python src/supertrainer/main.py +experiment=training_bert
In Hydra, every configs are in configs/
folder. So you can change the parameters there. My plan is to split the task to each of their own folder. For example I already implemented bert/
and mllm/
in here. Next milestone on this project will be :
- Implementing LLM training
For each project, mainly they will have two folder, which is dataset/
and trainer/
. The dataset/
will contain the dataset and the trainer/
will contain the training loop. Maybe I'll add model/
in here too if we need really custom model or want to tweak the architecture. Generally you don't but I am not too sure.
What's the different between
trainer/
andmodel/
? Well you can think oftrainer/
as inTrainer
in HuggingFace where we only setup likebatch_size
orlearning_rate
stuff. We don't really modify any architecture inside it. Whereasmodel/
is where maybe we want to change the activation function or the number of dimension of the model.
For dataset/
, generally I found it simple where mostly I only need to specify the dataset path or the tokenizer if needed. But I open on suggestion on better design of this.
For trainer/
, I found it a bit more complex. So since everything in here will be based on QLoRA, there are bitsandbytes_kwargs/
which specify what kind of LoRA do we want to use, peft_kwargs/
which tells which layer and ranks of the LoRA, model_kwargs/
which specify the settings of the actual model, and trainer_kwargs/
which specify the training loop.
bitsandbytes_kwargs/
-> I found this rare to modify, maybe we can add like LoRA 8-bits but generally it's not recommended. Since probably all of the settings will use the same config, I am thinking of making likedefault/
folder and just use this settings from there.model_kwargs/
-> I found that I never change this much. So I am also thinking to move this into like global config or someting.peft_kwargs/
-> This is where we specify the layer and ranks of the LoRA. I found this is the most important part of the training since different model will use different sets of this. I highly suggested to naming each file of this to the model name. But me myself didn't apply that yet LOLtraining_kwargs/
-> This is where we specify the training loop. I also didn't really change this much for each architecture. But 100% we will change a lot of this per-architecture. We can utilize Hydra capability for this.
For each of task category, I put a config file that combine the above config into one file. For example in mllm/
I put mllm_kwargs.yaml
. But I found that I keep putting custom config in here which probably is not modular at all? So maybe I should be separating per-architecture but then it'll overlap with peft_kwargs/
?? Maybe we can set like default settings for the peft_kwargs/
and only modify certain part per architecture? (we can do this in Hydra)
Let's say I want to modify something in the config. Let's say I want to use sdpa
as my attention instead of flash_attention_2
which is the default config. What can I did is to modify the training cli into like this.
python src/supertrainer/train.py +experiment=train_mllm trainer/common/[email protected]_kwargs=sdpa
It'll automatically use the sdpa
file inside the model_kwargs
folder.
Note that the @
symbol is used since we want to map the the file of sdpa
from trainer/common/model_kwargs
to trainer.model_kwargs
(Note the usage of /
and .
. I know it's weird but I can't find other solution of this).
You can also specify one single params like this
python src/supertrainer/main.py +experiment=train_mllm trainer.training_kwargs.num_train_epochs=10
Now our run will ran for 10 epochs.
We will assume that the dataset that you are going to feed here is ready to be fed into the model. What we meant by ready to be fed is that it already in the form of text that can be immediately feed into tokenizer. The tokenization process will be in the model respected class instead! We will make a guide of what column is needed for each task.
Also, we will also put a program on how to transform each dataset into the correct form. This is important especially for LLM!
Expand This to See The Structure Plan
supertrainer/
├── configs/
│ ├── experiment/
│ │ ├── train_mllm.yaml
│ │ ├── train_bert.yaml
│ │ └── train_llm.yaml
│ ├── trainer/
│ │ ├── common/
│ │ │ ├── hf_trainer_kwargs.yaml
│ │ │ └── bitsandbytes_kwargs.yaml
│ │ ├── mllm/
│ │ │ ├── model_kwargs.yaml
│ │ │ ├── peft_kwargs.yaml
│ │ │ └── training_kwargs.yaml
│ │ ├── bert/
│ │ │ ├── model_kwargs.yaml
│ │ │ ├── peft_kwargs.yaml
│ │ │ └── training_kwargs.yaml
│ │ └── llm/
│ │ ├── model_kwargs.yaml
│ │ ├── peft_kwargs.yaml
│ │ └── training_kwargs.yaml
│ ├── dataset/
│ │ ├── mllm/
│ │ ├── bert/
│ │ └── llm/
│ ├── model/
│ │ ├── mllm/
│ │ ├── bert/
│ │ └── llm/
│ └── mode/
│ ├── sanity_check.yaml
│ ├── toy.yaml
│ └── full.yaml
├── src/
│ ├── data/
│ │ ├── base.py
│ │ ├── mllm.py
│ │ ├── bert.py
│ │ ├── llm.py
│ │ └── __init__.py
│ ├── models/
│ │ ├── base.py
│ │ ├── mllm.py
│ │ ├── bert.py
│ │ ├── llm.py
│ │ └── __init__.py
│ ├── trainers/
│ │ ├── base.py
│ │ ├── mllm_trainer.py
│ │ ├── bert.py
│ │ ├── llm_trainer.py
│ │ └── __init__.py
│ ├── inference/
│ │ ├── base_inferencer.py
│ │ ├── mllm_inferencer.py
│ │ ├── bert_inferencer.py
│ │ ├── llm_inferencer.py
│ │ └── engines/
│ │ ├── hf_vanilla.py
│ │ ├── vllm.py
│ │ └── tensorrt.py
│ ├── benchmark/
│ │ ├── base_benchmark.py
│ │ ├── mllm_benchmark.py
│ │ ├── bert_benchmark.py
│ │ └── llm_benchmark.py
│ ├── metrics/
│ │ ├── base_metric.py
│ │ ├── classification_metrics.py
│ │ ├── generation_metrics.py
│ │ └── custom_metrics.py
│ ├── utils/
│ │ └── ...
│ ├── train.py
│ ├── inference.py
│ └── eval.py
└── ...
-
Configuration-Driven Approach
- Why: Allows for easy experimentation and reproducibility without changing code.
- How: Using YAML files in the
configs/
directory, leveraging Hydra for composition.
-
Separation of Concerns
- Why: Improves maintainability and allows for easy extension of functionality.
- How: Separate directories for different components (data, models, trainers, etc.).
-
Task-Specific Implementations
- Why: Different model types (BERT, LLM, MLLM) have unique requirements.
- How: Task-specific classes (e.g.,
BERTInferencer
,LLMTrainer
) inherit from base classes.
-
Flexible Inference Engines
- Why: Different inference engines (e.g., HuggingFace, vLLM, TensorRT) have varying performance characteristics.
- How: Engine-specific configurations and implementations in
src/inference/engines/
.
-
Unified Benchmarking
- Why: Consistent evaluation across different model types and inference engines.
- How: Common benchmarking interface with task-specific implementations.
-
Extensible Metrics
- Why: Different tasks require different evaluation metrics.
- How: Modular metric implementations in
src/metrics/
.
-
Single Entry Point Why: Provides clear separation between different operations and allows for operation-specific configurations and logic. How: Separate scripts (
train.py
,inference.py
,benchmark.py
) in the scripts/ directory for each main operation. -
Model Source Flexibility
- Why: Support for both pre-trained models from HuggingFace and locally saved models.
- How: Configuration options for model source and path.
-
Customizable Training Loops
- Why: Different tasks and models may require specific training procedures.
- How: Task-specific trainer implementations with shared base functionality.
- Unified Interface: Train, infer, and benchmark different model types using a consistent interface.
- Configurability: Easily switch between models, datasets, and training/inference settings using YAML configs.
- Extensibility: Add new model types, inference engines, or metrics with minimal changes to existing code.
- Reproducibility: Experiments can be easily reproduced by sharing configuration files.
- Performance Optimization: Support for various inference engines allows for optimized deployment.
- Implement LLM Training with Unsloth
- Refactor dataset to follow the structure plan
- Implement LLM Training with HuggingFace Trainer?
- I implemented
self.config.is_testing
but I don't really use it for the testing. Maybe I shoud really think better of this! - In
mllm
we actually need to pass the column nameimage_col
forDataCollatorWithPadding
, sadly we only pass the config name todataset/
and not intrainers/
. This means I have to explicitly pass theimage_col
in thepostprocess_config
ofmllm
. One solution that I can think of is to not separately the config betweentrainer
anddataset
(Even though currently it's not splitted in.yaml
, it'll get splitted insrc/supertrainer/train.py
) - Every single
push_to_hub
is still buggy :(. What I want is to be kinda like the unsloth codebase where you can push the model to the hub while solving every single problem (basically just ready to inference).- One single "simple" example (only simple if you know the thing, if not you are screwed!) is to do padding to the left like this. Their repo is so gooodddd!!!!
- I need to consider more in
supertrainer/data/base.py
where for now I commented theprepare_dataset
function since it's clash withsupertrainer/data/llm.py
ofConversationLLMDataset
. I need better design for this! - We haven't use the
tokenizer
instruct for thellm
which is bad since we want thatchat_template
! - When
sanity_check
, we should remove evaluaton dataset (maybe?). Which means we need to remove certain config likeeval_dataset
,auto_batch_size
, etc - Fix
FA2
in theenvironment.yaml
- Hardcoded settings in
train.py
forbert
since I have to postprocess config for the dataset and the model of the same config. One way that I can think to solve this is to create a class of postprocess config that we will pass in the yaml config of which postprocess type that we want to use (just likedataset
andtrainer
). I already create a boilerplate and I just need to execute this. Which means I need to remove postprocess config at all from the trainer (which fixes te comment that I put there where I said that it's weird to do this etc)- But what happens when the
postprocess_config
is instantiating something likeLoraConfig
which is I always instantiate it inpostprocess_config
of trainer? Need to think more about that - Fixed by modifying config in each class. Apparently,
StrictDict
is a modular object. Hence, if we modify it in place. It will keep that changes outside of the scope. Therefore we can just modify theconfig
directly whenever we want
- But what happens when the
- Fused Kernels
- Early Fusion training support
- Distributed training support
- Automated hyperparameter tuning
- Support for additional model architectures and tasks
By following this structure and philosophy, SuperTrainer aims to provide a flexible and powerful tool for researchers and practitioners working with a variety of machine learning models and tasks.