Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approximate release date of MULTI-GPU inferencing #11

Open
SamitM1 opened this issue Aug 23, 2024 · 18 comments
Open

Approximate release date of MULTI-GPU inferencing #11

SamitM1 opened this issue Aug 23, 2024 · 18 comments

Comments

@SamitM1
Copy link

SamitM1 commented Aug 23, 2024

Hey guys great work with this. We were wondering if and (approximately) when you will be releasing the multi gpu inferencing. Furthermore what is the time taken with default settings to inference a 6 second long CogVideoX generated video if using a H100 (which more powerful and efficient than A100). Furthermore if and once multi gpu inferencing is implemented approximately how long would it take to inference a CogVideoX generated video with 8 A100 or with 8 H100 gpus?

thanks in advance

@hejingwenhejingwen
Copy link
Collaborator

Thanks for your questions! The multi gpu inference will be supported next week. And the corresponding inference time on A100 (1~8) will be recorded.

@SamitM1
Copy link
Author

SamitM1 commented Sep 6, 2024

Great! So it will be released today or tomorrow?

I see you guys have added an open source plan, which is awesome by the way. But there is no official release date for the multi gpu inference.

Thanks again @hejingwenhejingwen

@hejingwenhejingwen
Copy link
Collaborator

Great! So it will be released today or tomorrow?

I see you guys have added an open source plan, which is awesome by the way. But there is no official release date for the multi gpu inference.

Thanks again @hejingwenhejingwen

It is released now.

@SamitM1
Copy link
Author

SamitM1 commented Sep 12, 2024

Awesome @hejingwenhejingwen

Will you guys be releasing the corresponding inference time on A100 (1~8). Right now it still seems kind of slow. We tried with 8 A100(80gb) and takes 56 to 62gb memory per gpu and roughly 45mins to enhance a 6 second long CogVideoX video.

I am running the multi_gpu_inference.sh is there anything else we need to do?

@ChrisLiu6
Copy link
Collaborator

My apologies🙃 There was a bug in the previous version where the non-parallel model implementation is always used even during multi-gpu inference. We have now fixed the bug, could you please pull the latest version and try again with the run_VEnhancer_MultiGPU.sh script?

@SamitM1
Copy link
Author

SamitM1 commented Sep 12, 2024

@ChrisLiu6 I tried with 8 and 4 A100(40gb) cpus and now I get this error:

2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 25, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 25, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 25, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 25, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 1, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 25])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 25])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 25])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 25])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 170, 248])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 25])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 25])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 25])
2024-09-12 03:54:14,648 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
[rank7]: Traceback (most recent call last):
[rank7]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 202, in <module>
[rank7]:     main()
[rank7]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 198, in main
[rank7]:     venhancer.enhance_a_video(file_path, prompt, up_scale, target_fps, noise_aug)
[rank7]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 80, in enhance_a_video
[rank7]:     output = self.model.test(
[rank7]:   File "/home/ubuntu/VEnhancer/video_to_video/video_to_video_model_parallel.py", line 129, in test
[rank7]:     gen_vid = self.diffusion.sample(
[rank7]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/home/ubuntu/VEnhancer/video_to_video/diffusion/diffusion_sdedit.py", line 256, in sample
[rank7]:     x0 = solver_fn(noise, fn, sigmas, show_progress=show_progress, **kwargs)
[rank7]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/home/ubuntu/VEnhancer/video_to_video/diffusion/solvers_sdedit.py", line 145, in sample_dpmpp_2m_sde
[rank7]:     denoised = model(x * c_in, sigmas[i])
[rank7]:   File "/home/ubuntu/VEnhancer/video_to_video/diffusion/diffusion_sdedit.py", line 195, in model_chunk_fn
[rank7]:     x0_chunk = self.denoise(
[rank7]:   File "/home/ubuntu/VEnhancer/video_to_video/diffusion/diffusion_sdedit.py", line 59, in denoise
[rank7]:     y_out = model(
[rank7]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/home/ubuntu/VEnhancer/video_to_video/modules/unet_v2v_parallel.py", line 1114, in forward
[rank7]:     x = x[get_context_parallel_rank()]
[rank7]: IndexError: tuple index out of range

I did this:
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
pip install -r requirements.txt

and then install xformers:
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121

for 4 gpus it just prints this out many times and doesn't seem to be stopping:

2024-09-12 03:59:03,103 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 4, 86, 124])
2024-09-12 03:59:03,103 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 6, 86, 124])
2024-09-12 03:59:03,103 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 6, 86, 124])
2024-09-12 03:59:03,108 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 6, 86, 124])
2024-09-12 03:59:03,108 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 4, 86, 124])
2024-09-12 03:59:03,109 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 6, 86, 124])
2024-09-12 03:59:03,109 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 6, 86, 124])
2024-09-12 03:59:03,114 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 6, 86, 124])
2024-09-12 03:59:03,114 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 4, 86, 124])
2024-09-12 03:59:03,114 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 6, 86, 124])
2024-09-12 03:59:03,114 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 640, 6, 86, 124])
2024-09-12 03:59:03,141 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 640, 6, 10664])
2024-09-12 03:59:03,141 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 640, 6, 10664])
2024-09-12 03:59:03,141 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 640, 4, 10664])
2024-09-12 03:59:03,141 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 640, 6, 10664])
2024-09-12 03:59:03,152 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 640, 22, 2666])
2024-09-12 03:59:03,152 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 640, 22, 2666])
2024-09-12 03:59:03,152 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 640, 22, 2666])
2024-09-12 03:59:03,152 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 640, 22, 2666])
2024-09-12 03:59:03,154 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 640, 22, 2666])
2024-09-12 03:59:03,154 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 640, 22, 2666])
2024-09-12 03:59:03,154 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 640, 22, 2666])
2024-09-12 03:59:03,155 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 640, 22, 2666])
2024-09-12 03:59:03,173 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 640, 6, 10664])
2024-09-12 03:59:03,173 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 640, 4, 10664])
2024-09-12 03:59:03,173 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 640, 6, 10664])
2024-09-12 03:59:03,173 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 640, 6, 10664])
2024-09-12 03:59:03,209 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,209 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 4, 170, 248])
2024-09-12 03:59:03,209 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,209 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,218 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,218 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 4, 170, 248])
2024-09-12 03:59:03,219 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,219 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,228 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,228 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 4, 170, 248])
2024-09-12 03:59:03,228 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,228 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,237 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,237 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 4, 170, 248])
2024-09-12 03:59:03,237 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,237 - video_to_video.modules.parallel_modules - INFO - paraconv out shape: torch.Size([1, 320, 6, 170, 248])
2024-09-12 03:59:03,343 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 320, 6, 42160])
2024-09-12 03:59:03,343 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 320, 6, 42160])
2024-09-12 03:59:03,343 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 320, 4, 42160])
2024-09-12 03:59:03,343 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 320, 6, 42160])
2024-09-12 03:59:03,364 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 320, 22, 10540])
2024-09-12 03:59:03,364 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 320, 22, 10540])
2024-09-12 03:59:03,364 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 320, 22, 10540])
2024-09-12 03:59:03,364 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 320, 22, 10540])
2024-09-12 03:59:03,367 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 320, 22, 10540])
2024-09-12 03:59:03,367 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 320, 22, 10540])
2024-09-12 03:59:03,367 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 320, 22, 10540])
2024-09-12 03:59:03,368 - video_to_video.modules.parallel_modules - INFO - all_to_all_input: torch.Size([1, 320, 22, 10540])
2024-09-12 03:59:03,404 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 320, 6, 42160])
2024-09-12 03:59:03,404 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 320, 4, 42160])
2024-09-12 03:59:03,404 - video_to_video.modules.parallel_modules - INFO - all_to_all_output: torch.Size([1, 320, 6, 42160])

@ChrisLiu6
Copy link
Collaborator

for 4 gpus it just prints this out many times and doesn't seem to be stopping:

If it continuously prints the stuff, it is working as expected.

For the 8 gpu case the error is likely due to too few frames to be parallel-processed by too many gpus, I'm looking into it.

@SamitM1
Copy link
Author

SamitM1 commented Sep 12, 2024

@ChrisLiu6 thank you for your response.

Have you fixed the 8 gpu inferencing in the latest pushes or are you still working on it?

@ChrisLiu6
Copy link
Collaborator

@SamitM1
Well, the problem with 8 GPUs is as follows: your input to the diffusion model has totally 25 frames. When 8 GPUs are used, each GPU becomes responsible for 4 frames. However, when doing so, the number of frames allocated to each frame would be [4, 4, 4, 4, 4, 4, 1, 0], which means the last GPU is allocated with no frame at all, and the second to last GPU also has 3 empty frames. This case was not covered when I wrote the parallel inference codes as I only assumed only the last GPU could have empty frames.

This means that when using 8 gpus to process 25 frames, one gpu is destined to be idle. Therefore, as a walkaround, you can use 7 GPUs. We will soon add the logics that for cases like yours we will automatically use less GPUs.

By the way, does the 4-GPU run work fine? Thx

@SamitM1
Copy link
Author

SamitM1 commented Sep 13, 2024

@ChrisLiu6
I am actually inputing 49 frames(6 second long video at 8 FPS + 1 starting frame).

The thing is we want to use more/8 GPUs (assuming it would lead to faster inferencing) so based on what you said above does that mean we have to pass in a video with a total number of frames that is divisible by 8 like for example 48 frames(which would mean 6 frames per GPU).

Basically is there anyway for us run the following input with 8 GPUS:

2024-09-13 04:42:38,905 - video_to_video - INFO - input frames length: 49
2024-09-13 04:42:38,905 - video_to_video - INFO - input fps: 8.0
2024-09-13 04:42:38,905 - video_to_video - INFO - target_fps: 24.0
2024-09-13 04:42:39,204 - video_to_video - INFO - input resolution: (480, 720)
2024-09-13 04:42:39,205 - video_to_video - INFO - target resolution: (1320, 1982)
2024-09-13 04:42:39,205 - video_to_video - INFO - noise augmentation: 250
2024-09-13 04:42:39,205 - video_to_video - INFO - scale s is set to: 8
2024-09-13 04:42:39,251 - video_to_video - INFO - video_data shape: torch.Size([145, 3, 1320, 1982])
2024-09-13 04:42:39,294 - video_to_video - INFO - input resolution: (480, 720)
2024-09-13 04:42:39,294 - video_to_video - INFO - target resolution: (1320, 1982)
2024-09-13 04:42:39,294 - video_to_video - INFO - noise augmentation: 250
2024-09-13 04:42:39,294 - video_to_video - INFO - scale s is set to: 8
2024-09-13 04:42:39,335 - video_to_video - INFO - video_data shape: torch.Size([145, 3, 1320, 1982])
2024-09-13 04:42:39,464 - video_to_video - INFO - Load model path ./ckpts/venhancer_paper.pt, with local status <All keys matched successfully>
2024-09-13 04:42:39,466 - video_to_video - INFO - Build diffusion with GaussianDiffusion

Thanks in advance!

edit:

7 gpus does not work:

[rank6]: Traceback (most recent call last):
[rank6]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 215, in <module>
[rank6]:     main()
[rank6]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 211, in main
[rank6]:     venhancer.enhance_a_video(file_path, prompt, up_scale, target_fps, noise_aug)
[rank6]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 89, in enhance_a_video
[rank6]:     output = self.model.test(
[rank6]:   File "/home/ubuntu/VEnhancer/video_to_video/video_to_video_model_parallel.py", line 129, in test
[rank6]:     gen_vid = self.diffusion.sample(
[rank6]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/home/ubuntu/VEnhancer/video_to_video/diffusion/diffusion_sdedit.py", line 256, in sample
[rank6]:     x0 = solver_fn(noise, fn, sigmas, show_progress=show_progress, **kwargs)
[rank6]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/home/ubuntu/VEnhancer/video_to_video/diffusion/solvers_sdedit.py", line 145, in sample_dpmpp_2m_sde
[rank6]:     denoised = model(x * c_in, sigmas[i])
[rank6]:   File "/home/ubuntu/VEnhancer/video_to_video/diffusion/diffusion_sdedit.py", line 195, in model_chunk_fn
[rank6]:     x0_chunk = self.denoise(
[rank6]:   File "/home/ubuntu/VEnhancer/video_to_video/diffusion/diffusion_sdedit.py", line 59, in denoise
[rank6]:     y_out = model(
[rank6]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank6]:     return forward_call(*args, **kwargs)
[rank6]:   File "/home/ubuntu/VEnhancer/video_to_video/modules/unet_v2v_parallel.py", line 1114, in forward
[rank6]:     x = x[get_context_parallel_rank()]
[rank6]: IndexError: tuple index out of range
/home/ubuntu/VEnhancer/video_to_video/video_to_video_model_parallel.py:110: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with amp.autocast(enabled=True):
/home/ubuntu/VEnhancer/video_to_video/video_to_video_model_parallel.py:110: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with amp.autocast(enabled=True):

@SamitM1
Copy link
Author

SamitM1 commented Sep 15, 2024

@ChrisLiu6
whats the max number of GPUs I can use for a video that has a total of 49 frames and is there any way I can increase the number of GPUs I can use without it erroring?

@ChrisLiu6
Copy link
Collaborator

ChrisLiu6 commented Sep 16, 2024

@ChrisLiu6 whats the max number of GPUs I can use for a video that has a total of 49 frames and is there any way I can increase the number of GPUs I can use without it erroring?

Hi, I've just pushed a commit, and now you should be able to use any number of GPUs without error (however, in some cases some of the GPUs may be idle). Generally speaking, using 4 or 8 GPUs can be a good choice in most cases.

@SamitM1
Copy link
Author

SamitM1 commented Sep 16, 2024

@ChrisLiu6
I ran with 8 A100(40gb) GPUs and it gave me a memory error AFTER it basically finished:

2024-09-16 23:12:22,184 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 40
2024-09-16 23:12:24,191 - video_to_video - INFO - step: 13
2024-09-16 23:12:24,191 - video_to_video - INFO - step: 13
2024-09-16 23:12:24,192 - video_to_video - INFO - step: 13
2024-09-16 23:12:24,192 - video_to_video - INFO - step: 13
2024-09-16 23:12:24,192 - video_to_video - INFO - step: 13
2024-09-16 23:12:24,193 - video_to_video - INFO - step: 13
2024-09-16 23:12:24,193 - video_to_video - INFO - step: 13
2024-09-16 23:12:24,194 - video_to_video - INFO - step: 13
2024-09-16 23:12:24,217 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:24,217 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:24,217 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:24,217 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:24,217 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:24,217 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:24,220 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:26,110 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:26,110 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:26,110 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:26,110 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:26,110 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:26,110 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:26,111 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:27,886 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:27,886 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:27,887 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:27,887 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:27,887 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:27,887 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:27,887 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:29,663 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:29,663 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:29,663 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:29,663 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:29,663 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:29,663 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:29,664 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:31,440 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:31,440 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:31,440 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:31,440 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:31,440 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:31,440 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:31,440 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:33,217 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:33,218 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:33,218 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:33,218 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:33,218 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:33,218 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:33,218 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:34,995 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:34,996 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:34,996 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:34,996 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:34,996 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:34,996 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:34,996 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:36,786 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:36,786 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:36,787 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:36,787 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:36,787 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:36,787 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:36,787 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:38,571 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:38,572 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:38,572 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:38,572 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:38,572 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:38,572 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:38,572 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:40,346 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:40,346 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:40,346 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:40,346 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:40,346 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:40,346 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:40,347 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:42,124 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:42,125 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:42,125 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:42,125 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:42,125 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:42,125 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:42,125 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:43,904 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:43,905 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:43,905 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:43,905 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:43,905 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:43,905 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:43,905 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:45,683 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:45,684 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:45,684 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:45,684 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:45,684 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:45,684 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:45,684 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:47,460 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 162, 240])
2024-09-16 23:12:47,461 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 162, 240])
2024-09-16 23:12:47,461 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:47,461 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:47,461 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-16 23:12:47,461 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:47,461 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-16 23:12:49,243 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 40, 162, 240])
2024-09-16 23:12:49,243 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 5, 162, 240])
2024-09-16 23:12:49,243 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:49,243 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:49,243 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 40])
2024-09-16 23:12:49,243 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:49,243 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 40
2024-09-16 23:12:51,144 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 40, 162, 240])
2024-09-16 23:12:51,144 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 5, 162, 240])
2024-09-16 23:12:51,145 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 162, 240])
2024-09-16 23:12:51,145 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-16 23:12:51,145 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 40])
2024-09-16 23:12:51,145 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-16 23:12:51,145 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 40
2024-09-16 23:12:53,130 - video_to_video - INFO - sampling, finished.
2024-09-16 23:12:53,370 - video_to_video - INFO - sampling, finished.
2024-09-16 23:12:53,370 - video_to_video - INFO - sampling, finished.
2024-09-16 23:12:53,377 - video_to_video - INFO - sampling, finished.
2024-09-16 23:12:53,377 - video_to_video - INFO - sampling, finished.
2024-09-16 23:12:53,377 - video_to_video - INFO - sampling, finished.
2024-09-16 23:12:53,377 - video_to_video - INFO - sampling, finished.
2024-09-16 23:12:53,452 - video_to_video - INFO - sampling, finished.
[rank4]: Traceback (most recent call last):
[rank4]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 215, in <module>
[rank4]:     main()
[rank4]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 211, in main
[rank4]:     venhancer.enhance_a_video(file_path, prompt, up_scale, target_fps, noise_aug)
[rank4]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 89, in enhance_a_video
[rank4]:     output = self.model.test(
[rank4]:   File "/home/ubuntu/VEnhancer/video_to_video/video_to_video_model_parallel.py", line 146, in test
[rank4]:     gen_video = self.tiled_chunked_decode(gen_vid)
[rank4]:   File "/home/ubuntu/VEnhancer/video_to_video/video_to_video_model_parallel.py", line 209, in tiled_chunked_decode
[rank4]:     tile = self.temporal_vae_decode(tile, tile_f_num)
[rank4]:   File "/home/ubuntu/VEnhancer/video_to_video/video_to_video_model_parallel.py", line 157, in temporal_vae_decode
[rank4]:     return self.vae.decode(z / self.vae.config.scaling_factor, num_frames=num_f).sample
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_temporal_decoder.py", line 366, in decode
[rank4]:     decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank4]:     return forward_call(*args, **kwargs)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_temporal_decoder.py", line 147, in forward
[rank4]:     sample = up_block(sample, image_only_indicator=image_only_indicator)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank4]:     return forward_call(*args, **kwargs)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/diffusers/models/unets/unet_3d_blocks.py", line 1007, in forward
[rank4]:     hidden_states = upsampler(hidden_states)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank4]:     return forward_call(*args, **kwargs)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/diffusers/models/upsampling.py", line 180, in forward
[rank4]:     hidden_states = self.conv(hidden_states)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank4]:     return forward_call(*args, **kwargs)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 458, in forward
[rank4]:     return self._conv_forward(input, self.weight, self.bias)
[rank4]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
[rank4]:     return F.conv2d(input, weight, bias, self.stride,
[rank4]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 720.00 MiB. GPU 4 has a total capacity of 39.39 GiB of which 619.94 MiB is free. Including non-PyTorch memory, this process has 38.78 GiB memory in use. Of the allocated memory 29.64 GiB is allocated by PyTorch, and 1.63 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank3]: Traceback (most recent call last):

why is it taking up more GPU memory right as it finished? How can I fix this?

I am assuming i have to add "torch.cuda.empty_cache()" but we are unsure where.

Thanks for all your help!

@hejingwenhejingwen
Copy link
Collaborator

The OOM comes from temporal VAE decoding, its parallel inference is not supported now. But you can reduce the memory by modifying the tile size through VEnhancer/video_to_video/video_to_video_model.py line172~174

@SamitM1
Copy link
Author

SamitM1 commented Sep 17, 2024

@hejingwenhejingwen would love your recommendation for how much to reduce tile size for our particular case as we have a 40gb memory per gpu.

@hejingwenhejingwen
Copy link
Collaborator

Please try self.frame_chunk_size = 3 self.tile_img_height = 576 self.tile_img_width = 768.
We cannot give a definitive answer at this time. We will work on it later.

@SamitM1
Copy link
Author

SamitM1 commented Sep 17, 2024

@hejingwenhejingwen
I took your suggestion and updated video_to_video_model_parallel.py(not video_to_video_model.py since i am using 8 gpus)

I tried everything:

  • including reducing the batch size(which causes memory issues earlier on)
  • i reduced the tile size even more to self.frame_chunk_size = 2 self.tile_img_height = 385 self.tile_img_width =512.

but it still failed:

2024-09-17 19:47:53,265 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 19:47:53,265 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 19:47:53,265 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 19:47:55,166 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 170, 248])
2024-09-17 19:47:55,167 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 4, 170, 248])
2024-09-17 19:47:55,167 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 170, 248])
2024-09-17 19:47:55,167 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 19:47:55,167 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 19:47:55,167 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 19:47:55,167 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 19:47:57,070 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 40, 170, 248])
2024-09-17 19:47:57,070 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 5, 170, 248])
2024-09-17 19:47:57,070 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 170, 248])
2024-09-17 19:47:57,070 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 19:47:57,070 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 40])
2024-09-17 19:47:57,070 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 19:47:57,070 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 40
2024-09-17 19:47:59,310 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 40, 170, 248])
2024-09-17 19:47:59,310 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 5, 170, 248])
2024-09-17 19:47:59,310 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 49, 170, 248])
2024-09-17 19:47:59,310 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 19:47:59,310 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 40])
2024-09-17 19:47:59,310 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 19:47:59,310 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 40
2024-09-17 19:48:01,460 - video_to_video - INFO - sampling, finished.
2024-09-17 19:48:01,461 - video_to_video - INFO - sampling, finished.
2024-09-17 19:48:01,461 - video_to_video - INFO - sampling, finished.
2024-09-17 19:48:01,461 - video_to_video - INFO - sampling, finished.
2024-09-17 19:48:01,461 - video_to_video - INFO - sampling, finished.
2024-09-17 19:48:01,461 - video_to_video - INFO - sampling, finished.
2024-09-17 19:48:01,461 - video_to_video - INFO - sampling, finished.
2024-09-17 19:48:01,683 - video_to_video - INFO - sampling, finished.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 215, in <module>
[rank1]:     main()
[rank1]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 211, in main
[rank1]:     venhancer.enhance_a_video(file_path, prompt, up_scale, target_fps, noise_aug)
[rank1]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 89, in enhance_a_video
[rank1]:     output = self.model.test(
[rank1]:   File "/home/ubuntu/VEnhancer/video_to_video/video_to_video_model_parallel.py", line 146, in test
[rank1]:     gen_video = self.tiled_chunked_decode(gen_vid)
[rank1]:   File "/home/ubuntu/VEnhancer/video_to_video/video_to_video_model_parallel.py", line 209, in tiled_chunked_decode
[rank1]:     tile = self.temporal_vae_decode(tile, tile_f_num)
[rank1]:   File "/home/ubuntu/VEnhancer/video_to_video/video_to_video_model_parallel.py", line 157, in temporal_vae_decode
[rank1]:     return self.vae.decode(z / self.vae.config.scaling_factor, num_frames=num_f).sample
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_temporal_decoder.py", line 366, in decode
[rank1]:     decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_temporal_decoder.py", line 147, in forward
[rank1]:     sample = up_block(sample, image_only_indicator=image_only_indicator)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/diffusers/models/unets/unet_3d_blocks.py", line 1007, in forward
[rank1]:     hidden_states = upsampler(hidden_states)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/diffusers/models/upsampling.py", line 180, in forward
[rank1]:     hidden_states = self.conv(hidden_states)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 458, in forward
[rank1]:     return self._conv_forward(input, self.weight, self.bias)
[rank1]:   File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
[rank1]:     return F.conv2d(input, weight, bias, self.stride,
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 324.00 MiB. GPU 1 has a total capacity of 39.39 GiB of which 263.94 MiB is free. Including non-PyTorch memory, this process has 39.12 GiB memory in use. Of the allocated memory 30.26 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank5]: Traceback (most recent call last):
[rank5]:   File "/home/ubuntu/VEnhancer/enhance_a_video_MultiGPU.py", line 215, in <module>

@hejingwenhejingwen
Copy link
Collaborator

You can make chunks of all latents before you pass them to the VAE decoder. After you finish one chunk, put them to cpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants