Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Upgrade Weights & Biases callback #29125

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
bb7e5fd
feat: add peft config to wandb if it exists in the model
parambharat Oct 25, 2023
2b386bb
feat: add model parameter count to wandb config and model metadata
parambharat Oct 25, 2023
665f284
feat: add metrics on prediction to wandb
parambharat Oct 25, 2023
d0f3176
feat: add model architecture to the model artifact
parambharat Oct 27, 2023
46d0115
feat: add initial model and architecture to the model artifact on setup
parambharat Oct 27, 2023
72480ff
Merge branch 'main' into wandb/callback-upgrade
parambharat Jan 10, 2024
7a3b476
feat: add markdown badge to model card
parambharat Jan 11, 2024
44a4226
feat: add parameters for peft models and model card badge
parambharat Jan 15, 2024
e59e15e
Merge branch 'main' into wandb/callback-upgrade
parambharat Feb 19, 2024
bf93923
refactor: change checkpoints to log and model and rename initial to base
parambharat Feb 19, 2024
8ab50ad
feat: add step and epoch aliases to the checkpoints
parambharat Feb 20, 2024
08ced55
chore: run fixup and style fixes
parambharat Feb 20, 2024
f0bcb24
Merge branch 'main' into wandb/callback-upgrade
parambharat Feb 20, 2024
62155b2
Merge branch 'main' into wandb/callback-upgrade
parambharat Mar 12, 2024
b1a3110
fix: address review comments related to DRY and naming consistency
parambharat Mar 21, 2024
9042c82
Merge branch 'main' of github.com:parambharat/transformers into wandb…
parambharat Mar 21, 2024
b50e13b
Merge branch 'main' of github.com:parambharat/transformers into wandb…
parambharat Apr 1, 2024
096f304
[docs] Big model loading (#29920)
stevhliu Apr 2, 2024
83b26dd
[`generate`] fix breaking change for patch (#29976)
ArthurZucker Apr 2, 2024
416711c
Fix 29807 sinusoidal positional encodings in Flaubert, Informer and X…
hovnatan Apr 2, 2024
33288ff
[bnb] Fix bug in `_replace_with_bnb_linear` (#29958)
SunMarc Apr 2, 2024
fed27ff
Adding FlaxNoRepeatNGramLogitsProcessor (#29677)
giganttheo Apr 2, 2024
0d04b1e
Add Flash Attention 2 support to Musicgen and Musicgen Melody (#29939)
ylacombe Apr 2, 2024
cb5927c
[Docs] Make an ordered list prettier in add_tensorflow_model.md (#29949)
windsonsea Apr 2, 2024
15cd687
Fix `skip_special_tokens` for `Wav2Vec2CTCTokenizer._decode` (#29311)
msublee Apr 2, 2024
9b0a8ea
Hard error when ignoring tensors. (#27484) (#29906)
Narsil Apr 2, 2024
5080ab1
Generate: fix logits processors doctests (#29718)
gante Apr 2, 2024
fce52ce
Fix `remove_columns` in `text-classification` example (#29351)
mariosasko Apr 2, 2024
b44df05
Update `tests/utils/tiny_model_summary.json` (#29941)
ydshieh Apr 3, 2024
81642d2
Make EncodecModel.decode ONNX exportable (#29913)
fxmarty Apr 3, 2024
17b06e2
Fix Swinv2ForImageClassification NaN output (#29981)
miguelm-almeida Apr 3, 2024
851f253
Fix Qwen2Tokenizer (#29929)
jklj077 Apr 3, 2024
bcd42c4
Fix `kwargs` handling in `generate_with_fallback` (#29225)
cifkao Apr 3, 2024
240e106
Fix probability computation in `WhisperNoSpeechDetection` when recomp…
cifkao Apr 3, 2024
cc75f1a
Fix vipllava for generation (#29874)
zucchini-nlp Apr 3, 2024
34bfe95
[docs] Fix audio file (#30006)
stevhliu Apr 3, 2024
c10b5dd
Superpoint imports fix (#29898)
zucchini-nlp Apr 3, 2024
695d823
[`Main CIs`] Fix the red cis (#30022)
ArthurZucker Apr 3, 2024
863e256
Make clearer about zero_init requirements (#29879)
muellerzr Apr 3, 2024
03732de
Enable multi-device for efficientnet (#29989)
jla524 Apr 3, 2024
4e6c5eb
Add a converter from mamba_ssm -> huggingface mamba (#29705)
byi8220 Apr 4, 2024
75b76a5
[`ProcessingIdefics`] Attention mask bug with padding (#29449)
byi8220 Apr 4, 2024
517a3e6
Refactor Cohere Model (#30027)
saurabhdash2512 Apr 4, 2024
24d787c
Add `whisper` to `IMPORTANT_MODELS` (#30046)
ydshieh Apr 5, 2024
8b52fa6
skip `test_encode_decode_fast_slow_all_tokens` for now (#30044)
ydshieh Apr 5, 2024
79d62b2
if output is tuple like facebook/hf-seamless-m4t-medium, waveform is …
sywangyi Apr 5, 2024
d704c0b
Fix mixtral ONNX Exporter Issue. (#29858)
AdamLouly Apr 5, 2024
1ab7136
[Trainer] Allow passing image processor (#29896)
NielsRogge Apr 5, 2024
ec7e47a
feat: add peft config to wandb if it exists in the model
parambharat Oct 25, 2023
d1717c6
feat: add model parameter count to wandb config and model metadata
parambharat Oct 25, 2023
042d1aa
feat: add metrics on prediction to wandb
parambharat Oct 25, 2023
cf31c9a
feat: add model architecture to the model artifact
parambharat Oct 27, 2023
13a4d43
feat: add initial model and architecture to the model artifact on setup
parambharat Oct 27, 2023
940f296
chore: update and rebase with upstream main
parambharat Apr 5, 2024
859b414
feat: add parameters for peft models and model card badge
parambharat Jan 15, 2024
f43dd42
refactor: change checkpoints to log and model and rename initial to base
parambharat Feb 19, 2024
a98ffeb
feat: add step and epoch aliases to the checkpoints
parambharat Feb 20, 2024
e80a34e
chore: run fixup and style fixes
parambharat Feb 20, 2024
b25675b
fix: address review comments related to DRY and naming consistency
parambharat Mar 21, 2024
4e5e2a4
chore: update and rebase with upstream main
parambharat Apr 5, 2024
e5ad376
chore: update and rebase with upstream main
parambharat Apr 5, 2024
10c1142
chore: run make fixup
parambharat Apr 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@
title: GPU inference
title: Optimizing inference
- local: big_models
title: Instantiating a big model
title: Instantiate a big model
- local: debugging
title: Debugging
- local: tf_xla
Expand Down
62 changes: 31 additions & 31 deletions docs/source/en/add_tensorflow_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,52 +109,52 @@ instructions below to set up your environment and open a draft PR.

2. Clone your `transformers` fork to your local disk, and add the base repository as a remote:

```bash
git clone https://github.com/[your Github handle]/transformers.git
cd transformers
git remote add upstream https://github.com/huggingface/transformers.git
```
```bash
git clone https://github.com/[your Github handle]/transformers.git
cd transformers
git remote add upstream https://github.com/huggingface/transformers.git
```

3. Set up a development environment, for instance by running the following command:
3. Set up a development environment, for instance by running the following commands:

```bash
python -m venv .env
source .env/bin/activate
pip install -e ".[dev]"
```
```bash
python -m venv .env
source .env/bin/activate
pip install -e ".[dev]"
```

Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a
failure with this command. If that's the case make sure to install TensorFlow then do:
Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a
failure with this command. If that's the case make sure to install TensorFlow then do:

```bash
pip install -e ".[quality]"
```
```bash
pip install -e ".[quality]"
```

**Note:** You don't need to have CUDA installed. Making the new model work on CPU is sufficient.
**Note:** You don't need to have CUDA installed. Making the new model work on CPU is sufficient.

4. Create a branch with a descriptive name from your main branch
4. Create a branch with a descriptive name from your main branch:

```bash
git checkout -b add_tf_brand_new_bert
```
```bash
git checkout -b add_tf_brand_new_bert
```

5. Fetch and rebase to current main
5. Fetch and rebase to current main:

```bash
git fetch upstream
git rebase upstream/main
```
```bash
git fetch upstream
git rebase upstream/main
```

6. Add an empty `.py` file in `transformers/src/models/brandnewbert/` named `modeling_tf_brandnewbert.py`. This will
be your TensorFlow model file.

7. Push the changes to your account using:

```bash
git add .
git commit -m "initial commit"
git push -u origin add_tf_brand_new_bert
```
```bash
git add .
git commit -m "initial commit"
git push -u origin add_tf_brand_new_bert
```

8. Once you are satisfied, go to the webpage of your fork on GitHub. Click on “Pull request”. Make sure to add the
GitHub handle of some members of the Hugging Face team as reviewers, so that the Hugging Face team gets notified for
Expand Down
192 changes: 142 additions & 50 deletions docs/source/en/big_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,110 +14,202 @@ rendered properly in your Markdown viewer.

-->

# Instantiating a big model
# Instantiate a big model

When you want to use a very big pretrained model, one challenge is to minimize the use of the RAM. The usual workflow
from PyTorch is:
A barrier to accessing very large pretrained models is the amount of memory required. When loading a pretrained PyTorch model, you usually:

1. Create your model with random weights.
1. Create a model with random weights.
2. Load your pretrained weights.
3. Put those pretrained weights in your random model.
3. Put those pretrained weights in the model.

Step 1 and 2 both require a full version of the model in memory, which is not a problem in most cases, but if your model starts weighing several GigaBytes, those two copies can make you get out of RAM. Even worse, if you are using `torch.distributed` to launch a distributed training, each process will load the pretrained model and store these two copies in RAM.
The first two steps both require a full version of the model in memory and if the model weighs several GBs, you may not have enough memory for two copies of it. This problem is amplified in distributed training environments because each process loads a pretrained model and stores two copies in memory.

<Tip>
> [!TIP]
> The randomly created model is initialized with "empty" tensors, which take space in memory without filling it. The random values are whatever was in this chunk of memory at the time. To improve loading speed, the [`_fast_init`](https://github.com/huggingface/transformers/blob/c9f6e5e35156e068b227dd9b15521767f6afd4d2/src/transformers/modeling_utils.py#L2710) parameter is set to `True` by default to skip the random initialization for all weights that are correctly loaded.

Note that the randomly created model is initialized with "empty" tensors, which take the space in memory without filling it (thus the random values are whatever was in this chunk of memory at a given time). The random initialization following the appropriate distribution for the kind of model/parameters instantiated (like a normal distribution for instance) is only performed after step 3 on the non-initialized weights, to be as fast as possible!

</Tip>

In this guide, we explore the solutions Transformers offer to deal with this issue. Note that this is an area of active development, so the APIs explained here may change slightly in the future.
This guide will show you how Transformers can help you load large pretrained models despite their memory requirements.

## Sharded checkpoints

Since version 4.18.0, model checkpoints that end up taking more than 10GB of space are automatically sharded in smaller pieces. In terms of having one single checkpoint when you do `model.save_pretrained(save_dir)`, you will end up with several partial checkpoints (each of which being of size < 10GB) and an index that maps parameter names to the files they are stored in.
From Transformers v4.18.0, a checkpoint larger than 10GB is automatically sharded by the [`~PreTrainedModel.save_pretrained`] method. It is split into several smaller partial checkpoints and creates an index file that maps parameter names to the files they're stored in.

You can control the maximum size before sharding with the `max_shard_size` parameter, so for the sake of an example, we'll use a normal-size models with a small shard size: let's take a traditional BERT model.
The maximum shard size is controlled with the `max_shard_size` parameter, but by default it is 5GB, because it is easier to run on free-tier GPU instances without running out of memory.

```py
from transformers import AutoModel

model = AutoModel.from_pretrained("google-bert/bert-base-cased")
```

If you save it using [`~PreTrainedModel.save_pretrained`], you will get a new folder with two files: the config of the model and its weights:
For example, let's shard [BioMistral/BioMistral-7B](https://hf.co/BioMistral/BioMistral-7B).

```py
>>> import os
>>> import tempfile

>>> with tempfile.TemporaryDirectory() as tmp_dir:
... model.save_pretrained(tmp_dir)
... model.save_pretrained(tmp_dir, max_shard_size="5GB")
... print(sorted(os.listdir(tmp_dir)))
['config.json', 'pytorch_model.bin']
['config.json', 'generation_config.json', 'model-00001-of-00006.safetensors', 'model-00002-of-00006.safetensors', 'model-00003-of-00006.safetensors', 'model-00004-of-00006.safetensors', 'model-00005-of-00006.safetensors', 'model-00006-of-00006.safetensors', 'model.safetensors.index.json']
```

Now let's use a maximum shard size of 200MB:
The sharded checkpoint is reloaded with the [`~PreTrainedModel.from_pretrained`] method.

```py
>>> with tempfile.TemporaryDirectory() as tmp_dir:
... model.save_pretrained(tmp_dir, max_shard_size="200MB")
... print(sorted(os.listdir(tmp_dir)))
['config.json', 'pytorch_model-00001-of-00003.bin', 'pytorch_model-00002-of-00003.bin', 'pytorch_model-00003-of-00003.bin', 'pytorch_model.bin.index.json']
... model.save_pretrained(tmp_dir, max_shard_size="5GB")
... new_model = AutoModel.from_pretrained(tmp_dir)
```

On top of the configuration of the model, we see three different weights files, and an `index.json` file which is our index. A checkpoint like this can be fully reloaded using the [`~PreTrainedModel.from_pretrained`] method:
The main advantage of sharded checkpoints for big models is that each shard is loaded after the previous one, which caps the memory usage to only the model size and the largest shard size.

You could also directly load a sharded checkpoint inside a model without the [`~PreTrainedModel.from_pretrained`] method (similar to PyTorch's `load_state_dict()` method for a full checkpoint). In this case, use the [`~modeling_utils.load_sharded_checkpoint`] method.

```py
>>> from transformers.modeling_utils import load_sharded_checkpoint

>>> with tempfile.TemporaryDirectory() as tmp_dir:
... model.save_pretrained(tmp_dir, max_shard_size="200MB")
... new_model = AutoModel.from_pretrained(tmp_dir)
... model.save_pretrained(tmp_dir, max_shard_size="5GB")
... load_sharded_checkpoint(model, tmp_dir)
```

The main advantage of doing this for big models is that during step 2 of the workflow shown above, each shard of the checkpoint is loaded after the previous one, capping the memory usage in RAM to the model size plus the size of the biggest shard.
### Shard metadata

Behind the scenes, the index file is used to determine which keys are in the checkpoint, and where the corresponding weights are stored. We can load that index like any json and get a dictionary:
The index file determines which keys are in the checkpoint and where the corresponding weights are stored. This file is loaded like any other JSON file and you can get a dictionary from it.

```py
>>> import json

>>> with tempfile.TemporaryDirectory() as tmp_dir:
... model.save_pretrained(tmp_dir, max_shard_size="200MB")
... with open(os.path.join(tmp_dir, "pytorch_model.bin.index.json"), "r") as f:
... model.save_pretrained(tmp_dir, max_shard_size="5GB")
... with open(os.path.join(tmp_dir, "model.safetensors.index.json"), "r") as f:
... index = json.load(f)

>>> print(index.keys())
dict_keys(['metadata', 'weight_map'])
```

The metadata just consists of the total size of the model for now. We plan to add other information in the future:
The `metadata` key provides the total model size.

```py
>>> index["metadata"]
{'total_size': 433245184}
{'total_size': 28966928384}
```

The weights map is the main part of this index, which maps each parameter name (as usually found in a PyTorch model `state_dict`) to the file it's stored in:
The `weight_map` key maps each parameter name (typically `state_dict` in a PyTorch model) to the shard it's stored in.

```py
>>> index["weight_map"]
{'embeddings.LayerNorm.bias': 'pytorch_model-00001-of-00003.bin',
'embeddings.LayerNorm.weight': 'pytorch_model-00001-of-00003.bin',
{'lm_head.weight': 'model-00006-of-00006.safetensors',
'model.embed_tokens.weight': 'model-00001-of-00006.safetensors',
'model.layers.0.input_layernorm.weight': 'model-00001-of-00006.safetensors',
'model.layers.0.mlp.down_proj.weight': 'model-00001-of-00006.safetensors',
...
}
```

If you want to directly load such a sharded checkpoint inside a model without using [`~PreTrainedModel.from_pretrained`] (like you would do `model.load_state_dict()` for a full checkpoint) you should use [`~modeling_utils.load_sharded_checkpoint`]:
## Accelerate's Big Model Inference

> [!TIP]
> Make sure you have Accelerate v0.9.0 or later and PyTorch v1.9.0 or later installed.

From Transformers v4.20.0, the [`~PreTrainedModel.from_pretrained`] method is supercharged with Accelerate's [Big Model Inference](https://hf.co/docs/accelerate/usage_guides/big_modeling) feature to efficiently handle really big models! Big Model Inference creates a *model skeleton* on PyTorch's [**meta**](https://pytorch.org/docs/main/meta.html) device. The randomly initialized parameters are only created when the pretrained weights are loaded. This way, you aren't keeping two copies of the model in memory at the same time (one for the randomly initialized model and one for the pretrained weights), and the maximum memory consumed is only the full model size.

To enable Big Model Inference in Transformers, set `low_cpu_mem_usage=True` in the [`~PreTrainedModel.from_pretrained`] method.

```py
>>> from transformers.modeling_utils import load_sharded_checkpoint
from transformers import AutoModelForCausalLM

>>> with tempfile.TemporaryDirectory() as tmp_dir:
... model.save_pretrained(tmp_dir, max_shard_size="200MB")
... load_sharded_checkpoint(model, tmp_dir)
gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", low_cpu_mem_usage=True)
```

Accelerate automatically dispatches the model weights across all available devices, starting with the fastest device (GPU) first and then offloading to the slower devices (CPU and even hard drive). This is enabled by setting `device_map="auto"` in the [`~PreTrainedModel.from_pretrained`] method. When you pass the `device_map` parameter, `low_cpu_mem_usage` is automatically set to `True` so you don't need to specify it.

```py
from transformers import AutoModelForCausalLM

# these loading methods are equivalent
gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto")
gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto", low_cpu_mem_usage=True)
```

## Low memory loading
You can also write your own `device_map` by mapping each layer to a device. It should map all model parameters to a device, but you don't have to detail where all the submodules of a layer go if the entire layer is on the same device.

Sharded checkpoints reduce the memory usage during step 2 of the workflow mentioned above, but in order to use that model in a low memory setting, we recommend leveraging our tools based on the Accelerate library.
```python
device_map = {"model.layers.1": 0, "model.layers.14": 1, "model.layers.31": "cpu", "lm_head": "disk"}
```

Access `hf_device_map` attribute to see how Accelerate split the model across devices.

```py
gemma.hf_device_map
```

```python out
{'model.embed_tokens': 0,
'model.layers.0': 0,
'model.layers.1': 0,
'model.layers.2': 0,
'model.layers.3': 0,
'model.layers.4': 0,
'model.layers.5': 0,
'model.layers.6': 0,
'model.layers.7': 0,
'model.layers.8': 0,
'model.layers.9': 0,
'model.layers.10': 0,
'model.layers.11': 0,
'model.layers.12': 0,
'model.layers.13': 0,
'model.layers.14': 'cpu',
'model.layers.15': 'cpu',
'model.layers.16': 'cpu',
'model.layers.17': 'cpu',
'model.layers.18': 'cpu',
'model.layers.19': 'cpu',
'model.layers.20': 'cpu',
'model.layers.21': 'cpu',
'model.layers.22': 'cpu',
'model.layers.23': 'cpu',
'model.layers.24': 'cpu',
'model.layers.25': 'cpu',
'model.layers.26': 'cpu',
'model.layers.27': 'cpu',
'model.layers.28': 'cpu',
'model.layers.29': 'cpu',
'model.layers.30': 'cpu',
'model.layers.31': 'cpu',
'model.norm': 'cpu',
'lm_head': 'cpu'}
```

Please read the following guide for more information: [Large model loading using Accelerate](./main_classes/model#large-model-loading)
## Model data type

PyTorch model weights are normally instantiated as torch.float32 and it can be an issue if you try to load a model as a different data type. For example, you'd need twice as much memory to load the weights in torch.float32 and then again to load them in your desired data type, like torch.float16.

> [!WARNING]
> Due to how PyTorch is designed, the `torch_dtype` parameter only supports floating data types.

To avoid wasting memory like this, explicitly set the `torch_dtype` parameter to the desired data type or set `torch_dtype="auto"` to load the weights with the most optimal memory pattern (the data type is automatically derived from the model weights).

<hfoptions id="dtype">
<hfoption id="specific dtype">

```py
from transformers import AutoModelForCausalLM

gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", torch_dtype=torch.float16)
```

</hfoption>
<hfoption id="auto dtype">

```py
from transformers import AutoModelForCausalLM

gemma = AutoModelForCausalLM.from_pretrained("google/gemma-7b", torch_dtype="auto")
```

</hfoption>
</hfoptions>

You can also set the data type to use for models instantiated from scratch.

```python
import torch
from transformers import AutoConfig, AutoModel

my_config = AutoConfig.from_pretrained("google/gemma-2b", torch_dtype=torch.float16)
model = AutoModel.from_config(my_config)
```
Loading