Skip to content
ArgentVASIMR edited this page Mar 15, 2024 · 6 revisions

Absolute Basics

  • Steps are how a LoRA becomes more and more trained, with every step changing the LoRA itself (through gradient updates).
  • 2000 is generally considered a good starting step count, then adjust based on the results gotten (at unet and text encoder LRs of 1e-4 and 5e-5 respectively)
  • For each step, a "gradient update" is done, which changes the "desired direction" (the "gradient(s)") to go from point A to point B; A is where it knows the concept less, and B is where it should know the concept more. Eventually, doing this enough times for long enough leads to it reaching the destination; your LoRA knowing the concept really well.

Warmup

If this is used, the learning rate(s) ascends from 0 up to your set learning rate(s) for a relatively small number of steps at the very start of the training run. This aids in more reliable training.

Batch Size

Recall how the "gradient" is the change from point A to B. It normally calculates the gradient should be from each image in the dataset. Calculated gradients from multiple images can be averaged together into a single gradient, which we call "gradient batching". We use this to avoid getting stuck in local minima, while also increasing training reliability generally.

When raising the batch size to n, n gradients are averaged together for each update step. Raising the batch size takes more VRAM, but has a negligible penalty on training speed. Contrary to popular belief, there is no quality drop from changing the batch size, however you must change the learning rate and training length to compensate. use-me.ps1 will do this automatically, though at a linear scaling; this could be incorrect, so feel free to turn $scale_lr_batch to false and manually change the LR yourself.

Additional notes:

  • Taking advantage of batch size can also be used to reduce the number of steps required for training a LoRA, so it can be used to make training faster. use-me.ps1 automatically divides step count by whatever your effective batch size is set to.
  • Training could run more efficiently when using batch sizes that are multiples of 8 (8, 16, 24, 32, etc.), because of matching GPU tensor core layouts. Further explanation for the reasons behind this is beyond the scope of this wiki.

Gradient Accumulation

Gradient accumulation is an alternative to batch size, only used when you do not have the VRAM for using your desired batch size. Instead of handling multiple images at the same time like with batch size, each image (and thus each gradient) is passed through sequentially, averaging the accumulated gradients together, and then taking one update step. This method has a significant speed penalty because it'll take n gradient accumulation steps to reach one actual update step, the upside being it only uses a negligible amount of VRAM to raise. Just like with batch size, however, you will need to re-tune your learning rate and training length to make the best use of this setting.

References:

  1. Hoffer, E. (2018), Train longer, generalize better: closing the generalization gap in large batch training of neural networks [ https://arxiv.org/abs/1705.08741 ]