Skip to content

Commit

Permalink
feat: add sequence packing support for DPO (#423)
Browse files Browse the repository at this point in the history
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: NeMo-Aligner CI <[email protected]>
Signed-off-by: abukharin <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: arendu <[email protected]>
Co-authored-by: Terry Kong <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Alexander Bukharin <[email protected]>
Co-authored-by: abukharin <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Adi Renduchintala <[email protected]>
  • Loading branch information
7 people authored Dec 6, 2024
1 parent 9e515ce commit 7a2d427
Show file tree
Hide file tree
Showing 17 changed files with 1,086 additions and 193 deletions.
1 change: 1 addition & 0 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ jobs:
- ppo-llama3-pp2-reshard
- reinforce-llama3-pp2-reshard
- dpo-llama3
- dpo-llama3-pack
- kd-llama3
- sft-llama3
- rm-llama3
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
## [Next Version]

### New Features and Optimizations
- Sequence packing is now supported when running DPO.
- Added support for Knowledge Distillation with SFT. See the [tutorial](docs/user-guide/knowledge-distillation.rst) for details.
- Added support for Megatron Core’s distributed optimizer, which can be configured using `++model.optim.name=mcore_distributed_optim`.
- Introduced `ScopedTimer` as a successor to `SyncedTimer`. `SyncedTimer` is marked for deprecation and will be removed in the next version.
Expand Down
61 changes: 61 additions & 0 deletions docs/user-guide/dpo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ The algorithm is identified with the ``dpo.preference_loss`` config variable. We

To use the RPO algorithm, each dataset example should have ``chosen_reward`` and ``rejected_reward``, which might come from human labelers or reward models. If ``chosen_reward`` and ``rejected_reward`` are not existent in the data, ``dpo.default_chosen_reward`` and ``dpo.default_rejected_reward`` are used.


Obtain a Pretrained Model
#########################
To start, we must first get a pretrained model to align. There are two models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes, we will use the smaller 2B model.
Expand Down Expand Up @@ -80,6 +81,9 @@ For best DPO training performance, it is recommended that you start with a SFT m
DPO Model Training
##################

Prepare your Dataset
====================

Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects ``.jsonl`` files where each line is a JSON dict corresponding to a single, complete sample, as shown below::

{"prompt": "Which year was the Magna Carta signed?", "chosen_response": "1215", "rejected_response": "I refuse to answer this question."}
Expand All @@ -94,6 +98,63 @@ Always follow the prompt-response template format used during your SFT training

Your JSONL file must contain at least as many samples as the Global Batch Size (GBS) you plan to use during training. For example, if GBS = 64, ensure that both your training and validation files include at least 64 samples. Using a file with fewer samples than the GBS will result in a crash.

Sequence Packing with DPO
=========================

We also support packed sequence training with DPO. Sequence packing is a training technique in which multiple training examples are concatenated to create one longer sequence. This approach eliminates the need for padding and improves GPU utilization.
Refer to the `sequence packing documentation <https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/optimizations/sequence_packing.html?highlight=packing#>`_ for a detailed overview of sequence packing and its advantages. This document
discusses sequence packing for SFT in particular, but the same benefits apply to DPO.

Packing your DPO dataset is done as a preprocessing step in NeMo and NeMo-Aligner. We provide a `script https://github.com/NVIDIA/NeMo-Aligner/blob/ashors/dpo-packing/examples/nlp/data/dpo/prepare_packed_dpo_dataset.py`_ to pack your DPO dataset. This script assumes you already have a prepared DPO-format dataset. Three main steps are run in this script:

#. The online processing code in ``DPOModelDataset`` is run. This includes tasks such as prompt template manipulation and tokenization. The result is an array of tokenized sequences, represented by indices.
#. Chosen and rejected sequences are concatenated.
#. The tokenized sequences are grouped by length and a packing algorithm is run.


You can read more about packing algorithms `here <https://en.wikipedia.org/wiki/Bin_packing_problem#Offline_algorithms>`_. Currently, two variants of ``first_fit`` are supported:

#. ``first_fit_decreasing``: sorts the sequences in decreasing order before applying the first-fit algorithm. It generates a more optimal packing, but it tends to keep all short sequences together, which may have an impact for convergence.
#. ``first_fit_shuffle``: runs first-fit in a random order. Packing is less optimal but it keeps the dataset order random. The recommendation is to run first_fit_shuffle and check the packed sequence lengths. If they are similar to the target length (i.e. efficient packing), then use shuffle. Otherwise try first_fit_decreasing.


The following is an example of running the packing script to prepare your DPO dataset:

.. code-block:: bash
python examples/nlp/data/dpo/prepare_packed_dpo_dataset.py \
model.data.data_prefix=/path/to/training.jsonl \
+model.encoder_seq_length=2048 \
+tokenizer_path=/path/to/tokenizer/model \
+output_dir=/path/to/output_folder \
+pack_sizes=[4096] \
+tokenizer_type=<huggingface or sentencpiece>
[ +packing_algorithm=first_fit_shuffle \ ]
[ ++model.seed=0 ]
Because this script packs chosen and rejected sequences together, ``pack_sizes`` should always be at least double ``model.encoder_seq_length``.
When running training using the packed dataset, ``model.encoder_seq_length`` should be set to the ``packed_size`` used for the packed dataset.

To use the packed dataset during training, add the following line to your train command:

.. code-block:: bash
++model.data.data_impl=packed_jsonl
A few notes to keep in mind when running training with sequence packing:

#. Make sure to pack your train, validation, and test datasets.
#. Sequence packing can only be run with a micro batch size of 1.
#. Sequence packing is supported via Transformer Engine, so be sure to enable transformer engine in your config by setting `++model.transformer_engine=True`.
#. Sequence packing increases the number of examples processed per global batch. Try to scale your global batch size accordingly by setting the new
global batch size to approximately ``unpacked_global_batch_size / avg_num_sequences_per_pack``. The average number of sequences per pack is printed to stdout after ``prepare_packed_dpo_dataset.py`` completes.


Begin Training
==============

Once your data is processed into the correct format, you are ready to begin DPO training. You must start with a pretrained or SFT trained model. For this section, we will use the SFT model trained in the previous step to train the DPO model.
For the purposes of the following sections, we assume that your training ``.jsonl`` file is located in ``/path/to/train_dpo_format.jsonl`` and your validation ``.jsonl`` file is located in ``/path/to/valid_dpo_format.jsonl``.

Expand Down
270 changes: 270 additions & 0 deletions examples/nlp/data/dpo/prepare_packed_dpo_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json
import os
from dataclasses import dataclass
from typing import TYPE_CHECKING, Dict, List, Tuple

import numpy as np
import torch
from tqdm import tqdm

from nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset import GPTSFTDataset
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
from nemo.core.config import hydra_runner
from nemo.utils import logging
from nemo.utils.sequence_packing_utils import create_hist, create_packing_strategy
from nemo_aligner.data.nlp.builders import build_train_valid_test_dpo_datasets
from nemo_aligner.data.nlp.datasets import DPOModelDataset

if TYPE_CHECKING:
from omegaconf import DictConfig

"""
Script to prepare packed dataset from a DPO dataset in the jsonl format.
Three main steps are run in this script:
1. The online processing code in DPOModelDataset is run (including prompt template manipulation,
sequence length truncation, tokenization, etc) and the result is an array of tokenized sequences,
represented by indices).
2. chosen and rejected sequences are concatenated for each example
3. The sequences are grouped by length, and a packing algorithm is run. (https://en.wikipedia.org/wiki/Bin_packing_problem#Offline_algorithms)
Currently, two variants of "first fit" are supported.
"first_fit_decreasing" sorts the sequences in decreasing order before applying first-fit.
It generates a more optimal packing, but it tends to keep all short sequences together, which may affect convergence.
"first_fit_shuffle" runs first-fit in a random order. Packing is less optimal but it keeps the dataset order random.
The recommendation is to run "first_fit_shuffle" and check the packed sequence lengths in the printout.
If they are similar to the target length (i.e. packing is efficient), then use shuffle. Otherwise try first_fit_decreasing.
Example usage:
python scripts/nlp_language_modeling/prepare_packed_dpo_dataset.py \
model.data.train_ds.file_names=[/path/to/training.jsonl] \
model.encoder_seq_length=1024 \
+tokenizer_path=<see note 1 below> \
+tokenizer_type=sentencepiece \
+output_dir=/path/to/output_folder \
+pack_sizes=[2048,4096,8192]
Note:
- Tokenizer path supports SentencePiece tokenizer and HF tokenizer.
For SentencePiece tokenizer, specify the file /path/to/tokenizer.model
For HF tokenizer, specify a folder /path/to/hf_folder which contains tokenizer.json, tokenizer_config.json
and special_tokens_map.json or the HF name of the tokenizer to use (e.g. "meta-llama/Meta-Llama-3-8B")
- If your model or dataset requires non-default configs for DPO training in NeMo, you will
need to pass in the same configs to ``model.data.train_ds`` as you would for training with unpacked dataset.
- ``model.encoder_seq_length`` is the length to truncate each sequence before packing multiple sequences
to the size of packed sequence (``pack_size``).
- ``pack_sizes`` is a list of packed sequence lengths. In this example, there will be three output files, one for
each pack size. The output files are named ``<output_folder>/packed_{pack_size}_seed{seed}.npy``.
This argument is a list because you will likely want to experiment with a few ``pack_sizes`` to find out which length
can fill the GPU memory without exceeding it. Adjusting ``pack_size`` is analogous to adjusting the micro batch size in
the unpacked case.
- **important**: ``pack_sizes`` should be at least double the value of model.encoder_seq_length in order to guarantee
that chosen and rejected sequences for a given example can be packed together.
"""


def tokenize_dataset(cfg: "DictConfig", tokenizer_type):
"""
Tokenizes a dataset using the same configuration file as DPOModelDataset.
This function reads a dataset and tokenizes based on the provided configuration.
Args:
cfg: A Hydra configuration object containing parameters for tokenization.
Returns:
A NumPy array containing the tokenized sequences from the dataset.
"""

logging.info("Tokenizing dataset...")

if tokenizer_type == "huggingface":
# pass in either a local Hugging Face folder which contains tokenizer.json or a path to the tokenizer on huggingface
tokenizer = get_nmt_tokenizer(library="huggingface", model_name=cfg.tokenizer_path, use_fast=True)
elif tokenizer_type == "sentencepiece":
tokenizer = get_nmt_tokenizer(library="sentencepiece", tokenizer_model=cfg.tokenizer_path)
else:
raise ValueError(f"unsupported tokenizer type {tokenizer_type}")

with open(cfg.model.data.data_prefix, "r", encoding="utf_8") as fr:
data_payload = [json.loads(line.strip()) for line in fr]
documents = np.arange(len(data_payload), step=1, dtype=np.int32)
dataset = DPOModelDataset(
cfg=cfg.model,
name="packing_dataset",
tokenizer=tokenizer,
data_prefix=cfg.model.data.data_prefix,
documents=documents,
data=data_payload,
seq_length=cfg.model.data.seq_length,
seed=cfg.model.get("seed", 1234),
drop_last=True, ## False not currently supported
pad_chosen_rejected_to_max=False,
)

combined_dataset = []
for item in dataset:
if item["ignore_example"]:
continue
input_ids = torch.cat((item["chosen"], item["rejected"])).numpy()
labels = torch.cat((item["chosen_labels"], item["rejected_labels"])).numpy()
reward = torch.tensor([item["chosen_reward"], item["rejected_reward"]]).numpy()
boundary = len(item["chosen"])
lengths = np.array([item["chosen_length"], item["rejected_length"]])
new_item = {
"input_ids": input_ids,
"labels": labels,
"reward": reward,
"lengths": lengths,
"boundary": boundary,
}
combined_dataset.append(new_item)

return np.array(combined_dataset)


## modified version of https://github.com/NVIDIA/NeMo/blob/main/nemo/utils/sequence_packing_utils.py#L178 for DPO
## pack size should be at least 2*encoder_seq_length since the packed sequences include both the chosen and rejected sequences
## for a given example
def fill_packing_strategy(
assignments: List[List[int]], sequences: Dict[int, List[Dict]], pack_size: int
) -> List[Dict]:
"""
Fills the packing strategy with actual sequence data based on assignments and sequence information.
This function takes the assignments generated by the packing algorithm (containing sequence length indices),
the original sequences data, and the pack size. It iterates through the assignments, retrieves the corresponding
sequences from the sequences dictionary, and constructs the final output data structure with input IDs, loss masks
(if available), and starting indices for each sequence in a packed sequence.
Args:
assignments: A list of lists, where each inner list represents a bin and contains the indices of the
sequence lengths assigned to that bin (output of 'create_packing_strategy').
sequences: A dictionary where keys are sequence lengths and values are lists of corresponding sequences
from the dataset (output of 'create_hist').
pack_size: The maximum capacity of each bin.
Returns:
output_data: A list of dictionaries, where each dictionary represents a packed sequence with its input IDs,
loss mask (if available), and starting indices.
"""
ifile_handles = dict()
for seq_len in tqdm(range(pack_size + 1)):
per_seq_data = sequences[seq_len]
if len(per_seq_data) > 0:
perm = np.random.permutation(len(per_seq_data))

perm = np.random.permutation(len(per_seq_data))
input_ids = np.array([x["input_ids"] for x in per_seq_data])[perm].tolist()
labels = np.array([x["labels"] for x in per_seq_data])[perm].tolist()
reward = np.array([x["reward"] for x in per_seq_data])[perm].tolist()
lengths = np.array([x["lengths"] for x in per_seq_data])[perm].tolist()
boundary = np.array([x["boundary"] for x in per_seq_data])[perm].tolist()

ifile_handles[seq_len] = (input_ids, labels, reward, lengths, boundary)

input_ids, labels, reward, lengths, seq_boundaries = {}, {}, {}, {}, {}

for oindex, assignment in tqdm(enumerate(assignments), total=len(assignments)):
_input_ids, _labels, _reward, _lengths, _seq_boundaries = [], [], [], [], [0]

for seq_length in assignment:

previous_seq_len = len(_input_ids)

_input_ids.extend(ifile_handles[seq_length][0].pop())
_labels.extend(ifile_handles[seq_length][1].pop())
_reward.extend(ifile_handles[seq_length][2].pop())
_lengths.extend(ifile_handles[seq_length][3].pop())

## store the boundaries for the chosen, rejected sequences
_seq_boundaries.append(previous_seq_len + ifile_handles[seq_length][4].pop())
_seq_boundaries.append(len(_input_ids))

input_ids[oindex] = _input_ids
labels[oindex] = _labels
reward[oindex] = _reward
lengths[oindex] = _lengths
seq_boundaries[oindex] = _seq_boundaries

output_data = []
for i in range(len(input_ids)):
item_dict = {
"input_ids": input_ids[i],
"labels": labels[i],
"reward": reward[i],
"lengths": lengths[i],
"seq_boundaries": seq_boundaries[i],
}
output_data.append(item_dict)

# (input_ids, labels, reward, lengths, boundary) = length 5
for i in range(5):
assert all(
not seq[i] for seq in ifile_handles.values()
), "Error: There are items left over from the assignment"
return output_data


@dataclass
class PackingArgs:
output_dir: str = "output"
pack_sizes: Tuple[int] = (2048,)
packing_algorithm: str = "first_fit_shuffle"
tokenizer_type: str = "sentencepiece" ## one of "huggingface" or "sentencepiece"

def from_config(self, cfg: "DictConfig"):
for required_arg in ("output_dir", "pack_sizes"):
assert cfg.get(required_arg, None), f"Please specify +{required_arg}=..."
self.output_dir = cfg.output_dir
self.pack_sizes = cfg.pack_sizes
self.packing_algorithm = cfg.get("packing_algorithm", "first_fit_shuffle")
self.tokenizer_type = cfg.tokenizer_type
return self


@hydra_runner(config_path="../../gpt/conf", config_name="gpt_dpo")
def main(cfg: "DictConfig") -> None:
args = PackingArgs().from_config(cfg)
dataset = tokenize_dataset(cfg, args.tokenizer_type)
sequences, histogram = create_hist(
dataset, 2 * cfg.model.data.seq_length
) ## multiply by 2 because packed sequences include chosen and rejected
for pack_size in args.pack_sizes:
assignments = create_packing_strategy(histogram, pack_size, args.packing_algorithm)
output_data = fill_packing_strategy(assignments, sequences, pack_size)

# save output data
os.makedirs(args.output_dir, exist_ok=True)
output_path = os.path.join(args.output_dir, f"packed_{pack_size}_seed{cfg.model.get('seed', 1234)}.npy")
np.save(output_path, output_data)
logging.info(f"Done, output written to {output_path}")

logging.info(
f"""
✅ Packed datasets with pack sizes {args.pack_sizes} are prepared successfully.
To train with packed sequences, you need to make changes to the DPO config file.
See the NeMo-Aligner sequence packing documentation for more details:
https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/dpo.rst#sequence-packing-with-dpo
"""
)


if __name__ == "__main__":
main()
Loading

0 comments on commit 7a2d427

Please sign in to comment.