Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow unknown schedule-free optimizers to continue to module loader #1811

Merged
merged 3 commits into from
Dec 1, 2024

Conversation

rockerBOO
Copy link
Contributor

@rockerBOO rockerBOO commented Nov 29, 2024

To support ProdigyPlusScheduleFree

pip install prodigy-plus-schedule-free
--optimizer_type prodigyplus.ProdigyPlusScheduleFree 
optimizer_type = "prodigyplus.ProdigyPlusScheduleFree"

Initially it's like prodigy, so setting the LR to 1.0 for text encoders and UNet. It is schedulefree so no LR schedulers are used or required.

--text_encoder_lr 1.0 --unet_lr 1.0

Recommended usage

Full set of usage options: https://github.com/LoganBooker/prodigy-plus-schedule-free?tab=readme-ov-file#usage

Related #1796 #1799

@kohya-ss kohya-ss merged commit 14c9ba9 into kohya-ss:sd3 Dec 1, 2024
1 check passed
@kohya-ss
Copy link
Owner

kohya-ss commented Dec 1, 2024

Thank you, this is great!
Is the reason you install directly from git instead of from PyPi because it's better to use the latest version? Also, do you think it's okay to close #1796?

@kohya-ss kohya-ss mentioned this pull request Dec 1, 2024
25 tasks
@rockerBOO
Copy link
Contributor Author

Thank you, this is great! Is the reason you install directly from git instead of from PyPi because it's better to use the latest version? Also, do you think it's okay to close #1796?

It wasn't on PyPi when I made this, so using that version would be suggestible. Closes #1796.

@StableLlama
Copy link

It would be great when you'd also add a documentation here about the best way to use it.
Linking in this PR to a different source will not transfer the relevant information to the users of this repository.

@rockerBOO
Copy link
Contributor Author

It would be great when you'd also add a documentation here about the best way to use it.

This PR unblocks prodigyplus but I do not have a recommendation of how best to use it. With any suggestions I will update the original description to allow users to follow.

Initially it's like prodigy, so setting the LR to 1.0 for text encoders and UNet. It is schedulefree so no LR schedulers are used or required.

--text_encoder_lr 1.0 --unet_lr 1.0

@kohya-ss kohya-ss mentioned this pull request Dec 2, 2024
@araleza
Copy link

araleza commented Dec 3, 2024

Hey, I gave this a try, combined with Huber loss, and I wanted to report my results.

I have around 200 training images, I'm training a rank 52 / alpha 26 LoRA, and I'm using batch size 1 with no gradient accumulation. I used your default settings as suggested (learning rate 1.0, I'm not training the text encoders).

The image quality was superb, and seemed to be learning slowly over time. But around step 1000, despite the images still being very high quality in appearance, I started getting the (incredibly annoying) grid artifact:

image

It appeared most strongly down the right hand side of the image, as it usually does when it appears. It's also mostly vertical lines rather than horizontal lines.

I might restart and set the learning rate to 5e-1 instead of 1e-0 (1.0).

By the way, I'm getting pretty suspicious of the fact that this artifact appears most strongly at one side of the image in particular, and is mostly made of vertical lines but not horizontal lines. Artifacts like this shouldn't affect vertical lines more than horizontal ones, and shouldn't be at one side particularly more than the other three. Maybe some sort of offset bug somewhere? If someone was able to hunt down the true model-level cause of this artifact, that would be valuable. I don't think it's just 'your learning rate is too high', as I think using a lower learning rate might just be masking the problem, rather than the cure for it.

Edit: One other artifact comes to mind: I've noticed that sample images (which I make with flux_minimal_inference.py) often have a slightly brighter line down the left side of the image. That brighter edge tends not to appear at the other three sides of the image. It's not a single row of pixels, but rather it tends to form part of a brighter edge structure, implying that it occurs early in the image generation steps and which then gets incorporated into the surrounding pixels of the left side of the image during further iterations.

@recris

@araleza
Copy link

araleza commented Dec 3, 2024

Just in case that previous message of mine came across as a bit negative, the sample images I saw before the artifact showed up were very sharp and had lots of rich detail. Scene complexity is greatly increased, a sign of the network being well-balanced with itself so most attention keys are activating in the same proportions as they did in the untrained base model.

I don't think this grid artifact is due to any error that @rockerBOO has made, but is just a pre-existing issue. I think this optimizer seems might be very useful for raising image quality.

@araleza
Copy link

araleza commented Dec 3, 2024

Yeah, even dropping the LR down to 0.3 still produces the vertical stripes near the right hand edge at around about 1000 steps:

image

But apart from that, the images are stunningly high quality and beautiful. 😭

Edit: Reading through the schedule-free Prodigy docs, I found this:

In some scenarios, it can be advantageous to freeze Prodigy's adaptive stepsize after a certain number of steps. This can be controlled via the prodigy_steps settings. It's been suggested that all Prodigy needs to do is achieve "escape velocity" in terms of finding a good LR, which it usually achieves after ~25% of training, though this is very dependent on batch size and epochs.

This setting can be particularly helpful when training diffusion models, which have very different gradient behaviour than what most optimisers are tuned for. Prodigy in particular will increase the LR forever if it is not stopped or capped in some way (usually via a decaying LR scheduler).

Given that "prodigy_steps=700" has been recommended as an optimizer parameter, and I see the stripes happen around around step 1000, I'm currently testing setting the prodigy_steps to a lower value, based on the theory that the LR has grown too high, and that's what might be assisting the stripes to appear.

@rockerBOO
Copy link
Contributor Author

Given that "prodigy_steps=700" has been recommended as an optimizer parameter, and I see the stripes happen around around step 1000

I have removed the suggestion of prodigy_steps here as it's absolutely dependent on how it's training and how many steps your training is running. Suggested from the module is maybe 25% or 5-10% but it's a parameter if you want to disable it from continuing to modify it's d after that percentage of steps for your training.

See the linked repo for the actual recommendations, as this PR only unblocks this optimizer. Any suggestions or improvements to the algorithms would be best informed to the developer.

@araleza
Copy link

araleza commented Dec 3, 2024

Yeah, knocking down the prodigy_steps from 700 to 550 has gotten rid of that corruption for me. My total intended steps is 7400, so that's down from 9.5% to 7.4%.

Maybe it's lower for me than most people as most of my images are using an alpha_mask of around 5%, with typically only small parts at 100% alpha.

@araleza
Copy link

araleza commented Dec 5, 2024

I looked at the Prodigy docs, rather than just the schedule-free version, and I found this:

image
(From https://github.com/konstmish/prodigy/blob/main/README.md )

If people are experimenting with this optimizer, it might be worth trying setting use_bias_correction to true, as recommended there. Although I'm not certain whether these pages are talking about DiT when they say 'diffusion models', or if they specifically mean Unet-based diffusion. You could also try setting that weight decay value, as the default is zero.

You can set that value to true in the --optimizer_args.

Edit: One other thing that's working well for me now is pushing the LR higher than 1.0. I couldn't get the fine detail of an object I was trying to train to learn, but it seems to be learning better now with LR = 2.0.

Edit2: The docs say:

We recommend using lr=1. (default) for all networks. If you want to force the method to estimate a smaller or larger learning rate, it is better to change the value of d_coef (1.0 by default). Values of d_coef above 1, such as 2 or 10, will force a larger estimate of the learning rate; set it to 0.5 or even 0.1 if you want a smaller learning rate.

so maybe try experimenting with setting d_coef instead of a non-1.0 LR.

@deGENERATIVE-SQUAD
Copy link

deGENERATIVE-SQUAD commented Dec 7, 2024

I'm very sorry, but I don't know much about code. Could you explain step by step how to implement this optimizer in sd-scripts and use its arguments in a batch file? All my attempts were unsuccessful in either dev or sd3, only ValueError: too many values to unpack.
UPD. The problem in --optimizer_args if I use it with arguments. Without --optimizer_args the train process runs. How can I control args without --optimizer_args?

@rockerBOO
Copy link
Contributor Author

rockerBOO commented Dec 7, 2024

UPD. The problem in --optimizer_args if I use it with arguments. Without --optimizer_args the train process runs. How can I control args without --optimizer_args?

the arguments are passed like --optimizer_args "prodigy_steps=1000" "weight_decay=0.01". or --optimizer_args prodigy_steps=1000 weight_decay=0.01. Note they are separated by spaces and no spaces in the arguments. Using the quotes allows you to use spaces if you want.

@deGENERATIVE-SQUAD
Copy link

deGENERATIVE-SQUAD commented Dec 7, 2024

the arguments are passed like --optimizer_args "prodigy_steps=1000" "weight_decay=0.01". or --optimizer_args prodigy_steps=1000 weight_decay=0.01. Note they are separated by spaces and no spaces in the arguments. Using the quotes allows you to use spaces if you want.

Thank you for the answer, will check later! For the future, how can I find out from the optimizer settings how to write arguments in a right way?

Btw, could you tell me how to make the original ADOPT optimizer work from https://github.com/iShohei220/adopt to SD-scripts? I can see that ADOPT can be enabled in the ProdigyPSF settings, but I'd like to try the original as well.

UPD. By the way, the training on the default SF is over, the result is just flawless (perfect similarity at 700 steps out of 3000), but there is a slight overcooking in some generations in my particular use (I use distillate loras to speed up genes), I basically know how to fix, but I would like to clarify what exactly will reduce the aggressiveness of training weights and encoder in SF: --lr_warmup_steps, lowered --prior_loss_weight, lowered d_coef or something else? In my experience with the original Prodigy, using these particular arguments above took away the overcooking entirely.

@rockerBOO
Copy link
Contributor Author

Thank you for the answer, will check later! For the future, how can I find out from the optimizer settings how to write arguments in a right way?

See the train_network docs but may need to translate it.

Btw, could you tell me how to make the original ADOPT optimizer work from https://github.com/iShohei220/adopt to SD-scripts? I can see that ADOPT can be enabled in the ProdigyPSF settings, but I'd like to try the original as well.

See the linked repo for the argument you need to pass it.

@deGENERATIVE-SQUAD
Copy link

deGENERATIVE-SQUAD commented Dec 7, 2024

See the train_network docs but may need to translate it.

Mm okay.

See the linked repo for the argument you need to pass it.

I dont understand how, thats the problem.

Can you clarify my UPD in my recent post also, please?

There is also a question, in the original Prodigy you need to use decouple=True use_bias_correction=True safeguard_warmup=True, does it work with SF? Well, except for bias, it is in SF by default, but it is turned off.

@deGENERATIVE-SQUAD
Copy link

And the last question for now: can I lower the separated lr for TE or is it not important/no sense in this case of SF using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants