Based on the maitrix-org/Pandora project on GitHub, we have open-sourced the training code and models for the Pandora project. The training process includes two main stages: alignment and finetuning. Additionally, we have released the latest Pandora model weights, which were trained for 60w steps on the Webvid dataset.
You can control the model in real-time using text, currently supporting 5 rounds of autoregressive prediction to generate 10-second videos. Alternatively, you can generate a single video with the following effects:
- [2024/09/24] 🎉 We have released the first version of the model weights, available on Hugging Face. This model can be directly used for inference on the original Pandora project.
- [2024/09/24] 🎉 The training code for the alignment and finetuning stages is available.
- [2024/09/24] 🎉 Supports video output at 576×1024 resolution.
conda create -n pandora python=3.11.0
conda activate pandora
conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -U xformers==0.0.24+cu121 --index-url https://download.pytorch.org/whl/cu121
bash build_envs.sh
If your GPU doesn't support CUDA 12.1, you can also install with CUDA 11.8:
conda create -n pandora python=3.11.0
conda activate pandora
conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -U xformers==0.0.24+cu118 --index-url https://download.pytorch.org/whl/cu118
bash build_envs.sh
- Download the model checkpoint from Hugging Face.
- Run the commands on your terminal
CUDA_VISIBLE_DEVICES={cuda_id} python gradio_app.py --ckpt_path {path_to_ckpt}
Then you can interact with the model through gradio interface.
Before training the model, ensure that you have downloaded our model locally. Set $MODEL_DIR
as the model path and $HOST_GPU_NUM
as the number of GPUs. Run the following command to align the outputs of the Large Language Model (LLM) and the Text Encoder:
python3 -m torch.distributed.launch \
--nproc_per_node=$HOST_GPU_NUM --nnodes=1 --master_addr=127.0.0.1 --master_port=10042 --node_rank=0 \
trainer.py \
--model_path $MODEL_DIR \
--base config/config.yaml \
--train \
--do_alignment \
--logdir output/ckp \
--devices $HOST_GPU_NUM \
lightning.trainer.num_nodes=1
Then, use the following command to finetune the model to obtain the final version:
python3 -m torch.distributed.launch \
--nproc_per_node=$HOST_GPU_NUM --nnodes=1 --master_addr=127.0.0.1 --master_port=10042 --node_rank=0 \
trainer.py \
--model_path $MODEL_DIR \
--base config/config.yaml \
--train \
--logdir output/ckp \
--devices $HOST_GPU_NUM \
lightning.trainer.num_nodes=1
The project is continuously improving, and we look forward to your contributions and participation.
- Repositories: maitrix-org/Pandora
- Related Article: Pandora: Towards General World Model with Natural Language Actions and Video States
If you find our work useful in your research, please cite us using the following BibTeX entry:
@misc{OpenPandora2024,
author = {OpenSparseLLMs},
title = {{Open-Pandora: An Open World Video Generation Model}},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/OpenSparseLLMs/Open-Pandora}},
}