🚅 How To Train

AD Training

Step 1: Download Training Data

Download nuScenes dataset and place it into /path/to/nuscenes

Step 2: Prepare Data in Specified Format

Generate per-scene resized (and optionally FOV-adjusted) frames

Run this command to save resized (and optionally adjusted) CAM_FRONT frames for every scene.

python -m data.nuscenes.generate_scene_frames_folders --dataroot /path/to/nuscenes --version v1.0-trainval --cam-type CAM_FRONT --output-path /path/to/train/data --scale-factor 0.4 --save-adjusted-fov --fov-from 120 --fov-to 94
# Example
python -m data.nuscenes.generate_scene_frames_folders --dataroot /mnt/d/AD/datasets/nuscenes --version v1.0-trainval --cam-type CAM_FRONT --output-path /mnt/d/z004x7dn/datasets/nuscenes/scenes_frames --scale-factor 0.4 --save-adjusted-fov --fov-from 120 --fov-to 94

Generate training segments and captions

Run this command to generate training segments (.mp4) and captions (csv file in ./sample_data)

python -m data.nuscenes.data_preparation --dataroot /path/to/nuscenes --version v1.0-trainval --cam-type CAM_FRONT --json-path /path/to/nuscenes/predictions/mllm/results_nusc_mllm.json --input-path /path/to/nuscenes/scenes_frames --output-path /path/to/nuscenes/scenes_videos_segments --use-adjusted-fov --generate-segments --augment-captions --csv-filename video_captions_nuscenes.csv --segment-length 16
# Example
python -m data.nuscenes.data_preparation --dataroot /mnt/d/AD/datasets/nuscenes --version v1.0-trainval --cam-type CAM_FRONT --json-path /mnt/d/z004x7dn/datasets/nuscenes/predictions/mllm/results_nusc_mllm.json --input-path /mnt/d/z004x7dn/datasets/nuscenes/scenes_frames --output-path /mnt/d/z004x7dn/datasets/nuscenes/scenes_videos_segments --use-adjusted-fov --generate-segments --augment-captions --csv-filename video_captions_nuscenes.csv --segment-length 16

Step 3: Run Training

Here is the command we used to start training on I2VGENXL with segmenentation map as control condition on nuscenes driving scenes. Training scripts on I2VGen-XL and SVD are roughly the same.

sh train_scripts/i2vgenxl/i2vgenxl_train_segmentation_nuscenes.sh

Specifically, in the training scripts:

--yaml_file: The configuration file for all hyper-parameters related to training.

The rest of the hyper-parameters in the training script are for evaluation, which can help you monitor the training process better.

--save_n_steps: Save the trained adapter checkpoints every n training steps.

--save_starting_step: Save the trained adapter checkpoints after such training steps.

--validate_every_steps: Perform evaluation every x training steps. The evaluation data are placed under ./assets/evaluation. If you prefer to evaluate different samples, you can replace them by following the same file structure.

--num_inference_steps: The number of inference steps during inference. We can just set it as the same value as the default inference steps of the backbone model.

--extract_control_conditions: If you already have condition image/frames extracted from evaluation image/video (see Inference Data Structure section above), you can set it as False. Otherwise, if you haven't extracted control conditions and only have the raw image/frames, you can set it as True, and our code can automatically extract the control conditions from the evaluation image/frames. The default setting is False.

--control_guidance_end: As mentioned above, this is the most important parameter that balances generated image/video quality with control strength. But since we want to see if the training code working or not, we recommend just setting it as 1.0 to give control across all inference steps. You can adjust it to a lower value later after you have a trained model.

Ctrl-Adapter

Step 1: Download Training Data

For Ctrl-Adapter training on image backbones (e.g., SDXL), we use 300k images from the LAION POP dataset. You can download a subset from this dataset here.
For Ctrl-Adapter training on video backbones (e.g., I2VGen-XL, SVD), we download a subset of videos (around 1.5M) from the 10M training set of the Panda70M dataset. You can follow their instructions to download the dataset. To ensure the videos contain enough movement, we further filter out the videos with optical flow score lower than a threshold of 0.3.

⚠️ The dataset we used here might not be optimal for all control conditions. We recommend that users train our Ctrl-Adapter on the dataset that best suits their use cases.

Step 2: Prepare Data in Specified Format

We provides some sample training images/videos under folder ./sample_data.

For image dataset, you can put the png/jpg raw images under the folder ./sample_data/images, and create a csv file with similar rows/columns format as ./sample_data/image_captions.csv.
For video dataset, you can put the mp4 raw videos under the folder ./sample_data/videos, and create a csv file with similar rows/columns format as ./sample_data/video_captions.csv.

If you want to use different input data format, you can modify ./utils/data_loader.py to suit your need.

If your data is stored in different paths, you can change the train_data_path and train_prompt_path in the training configuration files listed under ./configs.

In addition, please set the DATA_PATH in the training configuration files to the path where you want all training checkpoints to be stored.

Step 3: Control Conditions Extractors

To simplify the start-up process, our codebase automatically performs all control condition extractions during training. This eliminates the need for pre-processing the control images from input images/videos! (Note that this may slightly reduce training speed, but it is generally worthwhile since the Ctrl-Adapter converges quite rapidly.)

Most of our control condition extractors are directly utilized from the transformers or controlnet_aux libraries. You can check ./model/ctrl_helper.py to see the extractors used during training. If you wish to use different condition extractors, you can modify the python script accordingly.

⚠️ One major changes we made is for the depth estimator, we found that the default depth estimator from the transformers library is relatively slow. Therefore, we recommend using dpt_swin2_large_384 from the MiDaS library for depth estimation. We have already added code in the ./utils folder to utilize this depth estimator. All you need to do is to download the checkpoint dpt_swin2_large_384, and place it under the path {DATA_PATH}/ckpts/DepthMidas/dpt_swin2_large_384.pt.

Step 4: Set Up Training Scripts

All training configuration files and training scripts are placed under ./configs and train_scripts respectively.

4.1 Controllable Image/Video Generation

Here is the command we used to start training on SDXL with depth map as control condition. Training scripts on I2VGen-XL and SVD are roughly the same.

sh train_scripts/sdxl/sdxl_train_depth.sh