-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OutOfMemoryError: CUDA out of memory with 24GB. Is it possible to apply torchao quantization? #20
Comments
It is already fp16. We have also used tiled and sliced VAE decoding. |
Okay thank you. If I use 4x GB 24 GB it should work considering slicing enabled etc? |
Not sure. I think the VAE part is more likely to induce OOM, you can decline the tile size (f, h, w) through VEnhancer/video_to_video/video_to_video_model_parallel.py line172~174 And you can change the chunk size (frame length for one chunk). Now the chunk size is set to 32, you can use 24 or lower. But now for frame length less than 32, we only use one chunk. There is some restrictions in VEnhancer/video_to_video/video_to_video_model_parallel.py, please comment it. VEnhancer/video_to_video/utils/util.py Line 31 in 80ffaa3
It's quite annoying, I will make these visible for users, by providing more configuration parameters in the command script. |
So thank you, while I'm trying to adjust the chunk size I did a multi-gpu test: +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 48C P0 216W / 300W | 20644MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 44C P0 220W / 300W | 20644MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 43C P0 223W / 300W | 20644MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 44C P0 215W / 300W | 20644MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+ I got OOM, but after some processing:
whole stacktrace was vae/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 609/609 [00:00<00:00, 7.14MB/s]
diffusion_pytorch_model.fp16.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 196M/196M [00:00<00:00, 385MB/s]
2024-09-17 16:09:34,646 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:34,646 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:34,693 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:34,693 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:34,693 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:34,733 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:34,733 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:34,733 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:34,733 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:34,781 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:09:35,113 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:35,113 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:35,143 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:35,143 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:35,143 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:35,143 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:35,143 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:35,172 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:35,172 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:35,172 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:35,182 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:35,182 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:35,182 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:35,182 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:35,189 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:09:35,211 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:35,212 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:35,212 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:35,212 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:35,218 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:09:35,425 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:35,425 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:35,454 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:35,454 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:35,454 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:35,493 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:35,493 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:35,494 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:35,494 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:35,500 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:10:00,041 - video_to_video - INFO - step: 0
2024-09-17 16:10:00,863 - video_to_video - INFO - step: 0
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:00,871 - video_to_video - INFO - step: 0
2024-09-17 16:10:00,887 - video_to_video - INFO - step: 0
2024-09-17 16:10:01,277 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:13,071 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:47,013 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:58,293 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:20,867 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,165 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:43,418 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:43,418 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:28,542 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:39,849 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:24,969 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,970 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,970 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,970 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,580 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:47,580 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:47,580 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:43,985 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:15:17,860 - video_to_video - INFO - sampling, finished.
2024-09-17 16:15:17,935 - video_to_video - INFO - sampling, finished.
2024-09-17 16:15:17,935 - video_to_video - INFO - sampling, finished.
2024-09-17 16:15:17,937 - video_to_video - INFO - sampling, finished. |
It seems that you have already finished sampling for diffusion part, so it is because of VAE decoding.
For example, you can make some modifications: self.frame_chunk_size = 3 self.tile_img_height = 576 self.tile_img_width = 768
|
Would those modifications reduce the quality of the output or just slow down processing. |
I don't see obvious quality loss, but I've just tested several samples. |
@hejingwenhejingwen 🥇 it worked! iron_man.mp4astronaut.mp4
Thank you, so I have applied these modifications: passed
def make_chunks(f_num, interp_f_num, chunk_overlap_ratio=0.5, max_chunk_len = 32):
MAX_O_LEN = max_chunk_len * chunk_overlap_ratio
chunk_len = int((max_chunk_len - 1) // (1 + interp_f_num) * (interp_f_num + 1) + 1)
o_len = int((MAX_O_LEN - 1) // (1 + interp_f_num) * (interp_f_num + 1) + 1)
chunk_inds = sliding_windows_1d(f_num, chunk_len, o_len)
return chunk_inds in order to adjust to 24 by example max_chunk_len = 24 # 32
torch.cuda.empty_cache()
chunk_inds = make_chunks(frames_num, interp_f_num, max_chunk_len = max_chunk_len) in the same way as you suggested I passed to ( logger.info(f"sampling, finished.")
frame_chunk_size = 3
tile_img_height = 576
tile_img_width = 768
gen_video = self.tiled_chunked_decode(gen_vid,
frame_chunk_size=frame_chunk_size,
tile_img_height=tile_img_height,
tile_img_width=tile_img_width) NOTES. |
@hejingwenhejingwen other tests. I'm trying the CogVideoX generation now, and the OOM is caused as for your detailed description above to the higher number of frames (defaults to 49 frames)
and the image frames info:
In this case I get the OOM, I would say in the vae decode step:
stacktrace:
In this case I have tried to handle in 4x24GB the 49 frames, the setup was to use a chunking of 4/12 frames per chunk in the max_chunk_len = 12
torch.cuda.empty_cache()
chunk_inds = make_chunks(frames_num, interp_f_num, max_chunk_len = max_chunk_len) While I have ketp the setup for the frame_chunk_size = 3
tile_img_height = 576
tile_img_width = 768
gen_video = self.tiled_chunked_decode(gen_vid,
frame_chunk_size=frame_chunk_size,
tile_img_height=tile_img_height,
tile_img_width=tile_img_width) So, there is a rule of thumb to do the VRAM requirements calculations by image H, W given FPS, STEPS to set |
It is because of VAE encoding, I will add sliced encoding to avoid this. |
Great, I was looking infact to this impl for the
|
Thanks, actually it is okay to process 31 frames, but OOM with 97 frames. So the problem is too many frames. We actually already encode the frames one by one, but it is still OOM. Besides sliced and tiled VAE encoding, you can make chunks of these frames, and process each chunk separately for both VAE encoding and all sampling steps. The existing code can only make chunks for each sampling step. That is, all frames are split before denoising, and then be merged together after denoising. |
ok thank you very much @hejingwenhejingwen ! @apply_forward_hook
def encode(
self, x: torch.Tensor, return_dict: bool = True
) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]:
"""
Encode a batch of images into latents.
Args:
x (`torch.Tensor`): Input batch of images.
return_dict (`bool`, *optional*, defaults to `True`):
Whether to return a [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] instead of a plain
tuple.
Returns:
The latent representations of the encoded images. If `return_dict` is True, a
[`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is
returned.
"""
# LP: added slicing - see https://github.com/Vchitect/VEnhancer/issues/20
#h = self.encoder(x)
if self.use_slicing and x.shape[0] > 1:
encoded_slices = [self.encoder(x_slice) for x_slice in x.split(1)]
h = torch.cat(encoded_slices)
else:
h = self.encoder(x)
moments = self.quant_conv(h)
posterior = DiagonalGaussianDistribution(moments)
if not return_dict:
return (posterior,)
return AutoencoderKLOutput(latent_dist=posterior) while the @apply_forward_hook
def decode(
self,
z: torch.Tensor,
num_frames: int,
return_dict: bool = True,
) -> Union[DecoderOutput, torch.Tensor]:
"""
Decode a batch of images.
Args:
z (`torch.Tensor`): Input batch of latent vectors.
return_dict (`bool`, *optional*, defaults to `True`):
Whether to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
Returns:
[`~models.vae.DecoderOutput`] or `tuple`:
If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
returned.
"""
batch_size = z.shape[0] // num_frames
image_only_indicator = torch.zeros(batch_size, num_frames, dtype=z.dtype, device=z.device)
# LP: added slicing - see https://github.com/Vchitect/VEnhancer/issues/20
#decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)
if self.use_slicing and z.shape[0] > 1:
decoded_slices = [self.decoder(z_slice, num_frames=num_frames, image_only_indicator=image_only_indicator) for z_slice in z.split(1)]
decoded = torch.cat(decoded_slices)
else:
decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)
if not return_dict:
return (decoded,)
return DecoderOutput(sample=decoded) so I think my error was due to the fact that I have to consider |
okay the @apply_forward_hook
def decode(
self,
z: torch.Tensor,
num_frames: int,
return_dict: bool = True,
) -> Union[DecoderOutput, torch.Tensor]:
"""
Decode a batch of images.
Args:
z (`torch.Tensor`): Input batch of latent vectors.
return_dict (`bool`, *optional*, defaults to `True`):
Whether to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
Returns:
[`~models.vae.DecoderOutput`] or `tuple`:
If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
returned.
"""
batch_size = z.shape[0] // num_frames
image_only_indicator = torch.zeros(batch_size, num_frames, dtype=z.dtype, device=z.device)
if self.use_slicing and z.shape[0] > 1:
# Split the tensor based on the number of frames, not into individual slices
z_slices = torch.split(z, num_frames)
decoded_slices = [self.decoder(z_slice, num_frames=num_frames, image_only_indicator=image_only_indicator) for z_slice in z_slices]
decoded = torch.cat(decoded_slices, dim=0) # Concatenate along the batch dimension
else:
decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)
if not return_dict:
return (decoded,)
return DecoderOutput(sample=decoded) |
@hejingwenhejingwen I realized the @apply_forward_hook
def encode(
self, x: torch.Tensor, return_dict: bool = True
) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]:
"""
Encode a batch of images into latents.
Args:
x (`torch.Tensor`): Input batch of images.
return_dict (`bool`, *optional*, defaults to `True`):
Whether to return a [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] instead of a plain
tuple.
Returns:
The latent representations of the encoded images. If `return_dict` is True, a
[`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is
returned.
"""
# x.shape [1, 3, 1296, 1920] [B, C, W, H]
w_patch_size=128
h_patch_size=128
if self.use_slicing and (x.shape[2] > w_patch_size or x.shape[3] > h_patch_size):
h_slices = []
height_splits = x.split(h_patch_size, dim=2)
for h_slice in height_splits:
width_splits = h_slice.split(w_patch_size, dim=3)
encoded_width_slices = [self.encoder(w_slice) for w_slice in width_splits]
h_slices.append(torch.cat(encoded_width_slices, dim=3))
h = torch.cat(h_slices, dim=2)
else:
h = self.encoder(x)
moments = self.quant_conv(h)
posterior = DiagonalGaussianDistribution(moments)
if not return_dict:
return (posterior,)
return AutoencoderKLOutput(latent_dist=posterior) While it goes on and finish the video, while keeping the memory, the quality degrades as you can see from here: iron_man.2.mp4 |
The patch size should be larger (e.g., 512x512), and has overlap (e.g., 128). |
I'm getting a OOM running the
using the
Error
Is it possibile to apply Bf16 quantization? My approach using CogVideoX to run in 24GB is to Tiled VAE Decoding, Sliced VAE Decoding plus CPU offload and running the pipe in BF16. To quantize the model I use
torchao
:Not sure that this can be applied to your model too.
The text was updated successfully, but these errors were encountered: