Replies: 3 comments 4 replies
-
RTX4070TI. For SDXL my fastest result 2,73s/it (768x768, Network Dim. 64, Alpha 32) and 1,53it/s for SD. How I've understood, situation will not change untill sd-script will support CUDNN9 |
Beta Was this translation helpful? Give feedback.
-
Effective it/s for batch size 5 and 2.5 s/it is 2 it/s (normalized for batch size 1). Adafactor is slower than AdamW. |
Beta Was this translation helpful? Give feedback.
-
It's because you're trying to use this: scale_parameter=False relative_step=False warmup_init=False |
Beta Was this translation helpful? Give feedback.
-
Hi! I'm new to the party. Yesterday I was finally able to run Kohya SS on Win11 for the first time and trained some models. The speed I saw was no higher than 2.30-2.50s/it (XL train, batch size 5) and from what I googled, slower than 3090. But everywhere I've found discussions there is no clear guide and solutions. If you read Reddit there are people who claim to have it/s, there are some who say they have 1.5s/it on their 3090... Please share your speed and experience what can be optimized on 4090? I know the best option go Linux/Ubuntu but I'll leave it for later.
An old draver? Which one?
Triton?
Xformers or SDPA?
Which optimizer?
Bitsandbytes?
System > Display > Graphics > Hardware-accelerated GPU scheduling?
Someone advises
scale_parameter=False relative_step=False warmup_init=False
Disabled gradient checkpoint?
Disabled bucketing?
What are some things to try? I really don't want to jump between OS.
Beta Was this translation helpful? Give feedback.
All reactions