-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Immiscible Noise algorithm #1395
base: dev
Are you sure you want to change the base?
Conversation
This sounds interesting. I fetched your branch, and ran one of my standard training runs (110 images, mostly high quality/resolution, with decent captions) at these learning rates: Tenc: 1e-10 Those are very slow learning rates, but the images still became 'wobbly' almost immediately, and even after 1500 iterations, it hadn't recovered: Do other people see something similar?
Edit: Re-ran the same training run without --immiscible_noise and the images were sharp again, so the low quality images I saw are associated with --immiscible_noise, and not that cudnn warning. |
@v0xie Your loss graph says these were trained with batch size 1, so there's nothing to assign. The fact that it's still affecting the loss tells me something is wrong with the implementation. |
The immiscible noise is supposed to replace the original random noise, but the code is adding both to the latents. Based on the paper, we only need to:
Something like this (I don't know if my distance calculation is efficient but it does work in fp16): def get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents):
# Sample noise that we'll add to the latents
- noise = torch.randn_like(latents, device=latents.device)
+ if args.immiscible_diffusion:
+ # Immiscible Diffusion https://arxiv.org/abs/2406.12303
+ from scipy.optimize import linear_sum_assignment
+ n = args.immiscible_diffusion # arg is an integer for how many noise tensors to generate
+ size = [n] + list(latents.shape[1:])
+ noise = torch.randn(size, dtype=latents.dtype, layout=latents.layout, device=latents.device)
+ # find similar latent-noise pairs
+ latents_expanded = latents.half().unsqueeze(1).expand(-1, n, *latents.shape[1:])
+ noise_expanded = noise.half().unsqueeze(0).expand(latents.shape[0], *noise.shape)
+ dist = (latents_expanded - noise_expanded)**2
+ dist = dist.mean(list(range(2, dist.dim()))).cpu()
+ noise = noise[linear_sum_assignment(dist)[1]]
+ else:
+ noise = torch.randn_like(latents, device=latents.device) |
Hey @feffy380, my first impression is that your code seems to be working. I set n = 32 for my first run with it (cause I hadn't read the bit in the paper where they recommend 1024 at that point), and I think I saw quality improvements even at that low level. I'm restarting a new run with n = 1024 now. Maybe make the default just be 1024, so people don't need to know what value to pass in? One thing I noticed is that even though my training images are all real-world images, the sample renders continue to show cartoon-styled images longer than usual. I saw one even at iteration 550. I don't think that's an issue, it looks like it'll learn to stop doing that, but I found it interesting to note. (I stopped at iteration 650, so I don't know if I'd have gotten any more cartoon-style samples) |
Thank you for testing @araleza, and thank you for the detailed review @feffy380! I incorporated the suggested changes and I'm running some tests now.
|
My test run with noise batch size 1024 has reached 11000 iterations now with feffy380's code (I haven't tried the new updated version from v0xie yet), and it's looking good. My sample images look different in quality (better lighting, and fewer facial distortions on the difficult training images) to how they usually look without the immiscible noise parameter set. I'd like to try more training runs at different learning rates to be more confident, but as far as I can tell, this is a positive change. |
Hi, so I grabbed the latest code in your branch again, @v0xie . I'm still seeing lots of very noisy, damaged images. When I look at the code, it seems there are two sections, the part that feffy380 wrote, and a second section that looks like this:
If I comment out the call to immiscible_diffusion() - which still leaves the call to immiscible_diffusion_get_noise() in the code - then the noisy corruption on the images goes away. Looking at the paper you've linked, I can see why you added that second call. But I think there must be a bug in that implementation. :( @feffy380: I've now done lots of runs with just that section of code you provided in place. These are the BEST runs of sdxl training that I've done to date. The quality gains are amazing - it's like a new model. And thanks go to @v0xie for finding this great paper. |
Like I said before, adding noise to the latents like this is wrong because the noise_scheduler already does that a few lines later. You get noisy results because the latents now have 2x noise, but the unet is only removing 1x noise. The extra noise has effectively become part of the ground truth, which completely corrupts the dataset. |
@feffy380, I think that call is there to try to implement step 3 in this part of the paper: Is there some other way of doing that step that might be correct, and better than just picking the closest noise to the current latent? |
You're absolutely correct about the double noise add @feffy380. Removed it and it's much improved. What's funny is that even with the double noise add I was getting pretty good results, which might speak to the effectiveness of this method. Results after removing the double noise add; also trained a test with immiscible_noise=4096, which didn't add any noticeable delay to training, at least at 512^2. |
@araleza Step 3 is adding noise to the latents, which is what |
@feffy380, thanks for helping me understand; I don't have a very strong knowledge of pytorch commands. The bit that confuses me still though is that the code that's now been removed has this section:
And that looks exactly like Step 3 in the paper: But the bit we've kept doesn't have anything that looks like that equation. So how come it still works? Does the code section that's still around (i.e. the immiscible_diffusion_get_noise() function) implement that function with the two square roots in some way that isn't so obviously written out explicitly? Edit: Or maybe those square roots are inside noise_scheduler.add_noise()? |
After doing some testing I'm actually getting consistently slightly worse results with the latest iteration of this PR compared to 7b487ce. Certain high frequency details that appeared consistently with the original are lost when reusing the same settings and dataset. Not really sure why. |
Figure I'll put this out there since there appears to have been an update for immiscible diffusion (v2 on the arxiv?), along with code examples. I've gotten it simplified down for a single-process use-case (I think this works as intended?) A notable change is the distance calculation, which seems to be rather different. By any means, it worked rather well on a test run, so I felt the need to share. # https://github.com/yhli123/Immiscible-Diffusion/blob/main/stable_diffusion/conditional_ft_train_sd.py#L941
def immiscible_diffusion_get_noise_v2(latents, n = None):
"""
Generates noise for immiscible diffusion, simplified for single process.
"""
with torch.no_grad():
batch_size = latents.shape[0] if n is None else n
size = [batch_size] + list(latents.shape[1:])
noise = torch.randn(size, dtype=latents.dtype, layout=latents.layout, device=latents.device) # [B, C, H, W]
# Distance calculation
distance = torch.linalg.vector_norm(
0.10 * latents.to(torch.float16).flatten(start_dim=1).unsqueeze(1) -
0.10 * noise.to(torch.float16).flatten(start_dim=1).unsqueeze(0),
dim=2
) # [B, B]
_, col_ind = linear_sum_assignment(distance.cpu().numpy())
noise = noise[col_ind].to(latents.device) # Assign the permuted noise
return noise In get_noise_noisy_latents_and_timesteps (or your model-specific noisy latent function,) replace the |
This PR implements the algorithm in "Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment" (2024, Li et al.) https://arxiv.org/abs/2406.12303
The algorithm modifies the latents before noise is added to project training images onto only nearby noise. This is supposed to speed up convergence time and capture more fine detail in the trained model.
There is an noise assignment operation that is supposed to add some overhead to training time, but the paper describes it only adding 22.8ms when training with batch size of 1024.
Use by adding argument
--immiscible_noise
.2024/06/27 - Outdated results - Expand for more
Here are some experimental results trained on the "monster_toy" dataset from the Dreambooth repository (https://github.com/google/dreambooth/blob/main/dataset/monster_toy/00.jpg). Keep in mind the dataset is only 5 images, so by Epoch 30 the model is already starting to be overtrained.Training with Huber loss:
Training with no Huber loss:
The loss/epoch graph looks like the FID/Training Steps graphs from the paper:
Thank you for your consideration!