-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow unknown schedule-free optimizers to continue to module loader #1811
Conversation
Thank you, this is great! |
It would be great when you'd also add a documentation here about the best way to use it. |
This PR unblocks prodigyplus but I do not have a recommendation of how best to use it. With any suggestions I will update the original description to allow users to follow. Initially it's like prodigy, so setting the LR to 1.0 for text encoders and UNet. It is schedulefree so no LR schedulers are used or required.
|
Hey, I gave this a try, combined with Huber loss, and I wanted to report my results. I have around 200 training images, I'm training a rank 52 / alpha 26 LoRA, and I'm using batch size 1 with no gradient accumulation. I used your default settings as suggested (learning rate 1.0, I'm not training the text encoders). The image quality was superb, and seemed to be learning slowly over time. But around step 1000, despite the images still being very high quality in appearance, I started getting the (incredibly annoying) grid artifact: It appeared most strongly down the right hand side of the image, as it usually does when it appears. It's also mostly vertical lines rather than horizontal lines. I might restart and set the learning rate to 5e-1 instead of 1e-0 (1.0). By the way, I'm getting pretty suspicious of the fact that this artifact appears most strongly at one side of the image in particular, and is mostly made of vertical lines but not horizontal lines. Artifacts like this shouldn't affect vertical lines more than horizontal ones, and shouldn't be at one side particularly more than the other three. Maybe some sort of offset bug somewhere? If someone was able to hunt down the true model-level cause of this artifact, that would be valuable. I don't think it's just 'your learning rate is too high', as I think using a lower learning rate might just be masking the problem, rather than the cure for it. Edit: One other artifact comes to mind: I've noticed that sample images (which I make with flux_minimal_inference.py) often have a slightly brighter line down the left side of the image. That brighter edge tends not to appear at the other three sides of the image. It's not a single row of pixels, but rather it tends to form part of a brighter edge structure, implying that it occurs early in the image generation steps and which then gets incorporated into the surrounding pixels of the left side of the image during further iterations. |
Just in case that previous message of mine came across as a bit negative, the sample images I saw before the artifact showed up were very sharp and had lots of rich detail. Scene complexity is greatly increased, a sign of the network being well-balanced with itself so most attention keys are activating in the same proportions as they did in the untrained base model. I don't think this grid artifact is due to any error that @rockerBOO has made, but is just a pre-existing issue. I think this optimizer seems might be very useful for raising image quality. |
Yeah, even dropping the LR down to 0.3 still produces the vertical stripes near the right hand edge at around about 1000 steps: But apart from that, the images are stunningly high quality and beautiful. 😭 Edit: Reading through the schedule-free Prodigy docs, I found this:
Given that |
I have removed the suggestion of prodigy_steps here as it's absolutely dependent on how it's training and how many steps your training is running. Suggested from the module is maybe 25% or 5-10% but it's a parameter if you want to disable it from continuing to modify it's d after that percentage of steps for your training. See the linked repo for the actual recommendations, as this PR only unblocks this optimizer. Any suggestions or improvements to the algorithms would be best informed to the developer. |
Yeah, knocking down the prodigy_steps from 700 to 550 has gotten rid of that corruption for me. My total intended steps is 7400, so that's down from 9.5% to 7.4%. Maybe it's lower for me than most people as most of my images are using an alpha_mask of around 5%, with typically only small parts at 100% alpha. |
I looked at the Prodigy docs, rather than just the schedule-free version, and I found this:
If people are experimenting with this optimizer, it might be worth trying setting use_bias_correction to true, as recommended there. Although I'm not certain whether these pages are talking about DiT when they say 'diffusion models', or if they specifically mean Unet-based diffusion. You could also try setting that weight decay value, as the default is zero. You can set that value to true in the --optimizer_args. Edit: One other thing that's working well for me now is pushing the LR higher than 1.0. I couldn't get the fine detail of an object I was trying to train to learn, but it seems to be learning better now with LR = 2.0. Edit2: The docs say:
so maybe try experimenting with setting d_coef instead of a non-1.0 LR. |
I'm very sorry, but I don't know much about code. Could you explain step by step how to implement this optimizer in sd-scripts and use its arguments in a batch file? All my attempts were unsuccessful in either dev or sd3, only ValueError: too many values to unpack. |
the arguments are passed like |
Thank you for the answer, will check later! For the future, how can I find out from the optimizer settings how to write arguments in a right way? Btw, could you tell me how to make the original ADOPT optimizer work from https://github.com/iShohei220/adopt to SD-scripts? I can see that ADOPT can be enabled in the ProdigyPSF settings, but I'd like to try the original as well. UPD. By the way, the training on the default SF is over, the result is just flawless (perfect similarity at 700 steps out of 3000), but there is a slight overcooking in some generations in my particular use (I use distillate loras to speed up genes), I basically know how to fix, but I would like to clarify what exactly will reduce the aggressiveness of training weights and encoder in SF: --lr_warmup_steps, lowered --prior_loss_weight, lowered d_coef or something else? In my experience with the original Prodigy, using these particular arguments above took away the overcooking entirely. |
See the train_network docs but may need to translate it.
See the linked repo for the argument you need to pass it. |
Mm okay.
I dont understand how, thats the problem. Can you clarify my UPD in my recent post also, please? There is also a question, in the original Prodigy you need to use decouple=True use_bias_correction=True safeguard_warmup=True, does it work with SF? Well, except for bias, it is in SF by default, but it is turned off. |
And the last question for now: can I lower the separated lr for TE or is it not important/no sense in this case of SF using? |
To support ProdigyPlusScheduleFree
Initially it's like prodigy, so setting the LR to 1.0 for text encoders and UNet. It is schedulefree so no LR schedulers are used or required.
--text_encoder_lr 1.0 --unet_lr 1.0
Recommended usage
Full set of usage options: https://github.com/LoganBooker/prodigy-plus-schedule-free?tab=readme-ov-file#usage
Related #1796 #1799