Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching latents and Text Encoder outputs with multiple GPUs #1690

Merged
merged 11 commits into from
Oct 13, 2024

Conversation

kohya-ss
Copy link
Owner

No description provided.

@FurkanGozukara
Copy link

awesome. this will work automatically when multi gpu used?

@kohya-ss
Copy link
Owner Author

kohya-ss commented Oct 12, 2024

awesome. this will work automatically when multi gpu used?

Yes😀 In FLUX.1, the Text Encoder cache also takes time, so we've made it compatible with multiple GPUs. We'd appreciate it if you could test it.

Please note that --highvram is needed for faster caching.

@kohya-ss kohya-ss changed the title Caching latents with multiple GPUs Caching latents and Text Encoder outputs with multiple GPUs Oct 12, 2024
@FurkanGozukara
Copy link

@kohya-ss those --highvram and --lowvram made 0 impact on my testings previously what they do actually? I tested both for FLUX Fine tuning and FLUX LoRA training

I can test FLUX LoRA multi gpu caching - fine tuning still requires 80 gb GPUs, fused backward pass not working

@kohya-ss
Copy link
Owner Author

@kohya-ss those --highvram and --lowvram made 0 impact on my testings previously what they do actually? I tested both for FLUX Fine tuning and FLUX LoRA training

Currently, --highvram only affects caching of latens, and --lowvram only affects model loading. Training speed remains unchanged.

I can test FLUX LoRA multi gpu caching - fine tuning still requires 80 gb GPUs, fused backward pass not working

I've done some more research recently, but so far I don't know of any way to improve memory usage with multi-GPU fine tuning other than DeepSpeed ​​or FSDP.

@FurkanGozukara
Copy link

@kohya-ss you are awesome thank you so much

@SoonNOON
Copy link

I hope the full version of GPT 01 will help find the right solution very soon.

@sdbds
Copy link
Contributor

sdbds commented Oct 12, 2024

I hope it is possible to push the cached data directly to HF, so that pulling latents directly in the cloud platform will not take up cache time, and more hard disk space can be saved on large datasets.

@kohya-ss
Copy link
Owner Author

Certainly, how to handle large datasets is a big challenge. I don't have much experience working with large-scale datasets, but I think we should also consider using web datasets, etc.

Also, since the cost of AE/VAE processing decreases relatively during large-scale training, it may not be necessary to cache latents.

@kohya-ss kohya-ss marked this pull request as ready for review October 13, 2024 10:23
@kohya-ss kohya-ss merged commit 1275e14 into sd3 Oct 13, 2024
2 checks passed
@kohya-ss kohya-ss deleted the multi-gpu-caching branch October 13, 2024 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants