Skip to content

Preparing a Dataset

ArgentVASIMR edited this page Mar 2, 2024 · 34 revisions

All of this advice is intended for Kohya's sd-scripts. If you are training LoRAs through a different program or setup, you will have to adapt the advice provided here to your set up.

Absolute Basics:

Images:

  • All pictures need diversity in the things you want to be variable (e.g., training a character needs different expressions, poses, backgrounds, styles, but a consistent visual design.)
  • At least 5 pictures is recommended, but more can be better.

Captions:

  • It is recommended you use a dedicated program for doing your captions. starik222's Booru Dataset Tag Manager is good for this.
  • Ensure that every image has a .txt file of the same name. (If you are using a program like above, this should be done automatically.)
  • The first tag of all caption files should be your 'instance token', usually the name of your concept.
  • Each caption file must have a description of the associated image. with terms that are understood by the model used for training.

Images

Optimising Images

There may often be images in your dataset containing features that can damage training reliability, whether it is compression artefacts, an overlaying element, a low resolution, an inconsistency, all components of what would otherwise be perfectly good images. In these situations, you could either prune these images from your dataset, OR you can clean up your dataset images to ensure your LoRA trains as reliably as it can. If your dataset is limited in size, then you may be forced to do the latter- in addition to finding as many images as you possibly can.

Compression Artefacts

Compression artefacts can interfere with LoRA training by making details look very muddy/blurry, and can sometimes potentially show up in your LoRA's generation results. Potential solutions are as follows:

  • A: Find the original version of the image, as it likely has less compression.
  • B: Run the image through an AI denoiser if the compression isn't too extreme.
  • C: Put the image through Stable Diffusion img2img, with 0.1 - 0.3 denoising strength.
  • D: Reduce the image to a low res (if it isn't already), then do "Low Resolution", solution B.

Overlaying Elements

AI has a tendency to focus on anything it finds mildly interesting, which includes anything not part of the 'environment' of a particular image. This can include speech bubbles, text, overlays, a border, or anything else not actually part of environment depicted in the image. Just one image in a dataset with an overlaying element may start making it generate that particular thing when it isn't wanted or relevant. This section may not be relevant to you if you intend to gen any these things, but otherwise, they are usually worth removing from any dataset. Potential solutions are as follows:

  • A: Find an existing alt/edit of the image where any overlaying elements are not present.
  • B. Edit out the overlaying element yourself, either through photoshop or inpainting. If inpainting, then prompt for what's supposed to be there and either:
    • i. Set masked content to Latent Noise (or equivalent), with 1.0 denoising strength.
    • ii. Sketch over the overlaying element in a painting program of choice, then set masked content to Original (or equivalent), with 0.5 - 0.8 denoising strength.

Low resolution

Self-explanatory. Small images have a lot of detail loss (similar to compression artefacts), which may necessitate upscaling them. Potential solutions are as follows:

  • A. Find the original version of the image, you may have saved a downscaled repost.
  • B. Run the image through an AI upscaler. My personal recommendation is R-ESRGAN 4x+, but feel free to use whatever upscaler you are comfortable with. This can double as a way of cleaning up compression artefacts.

Slight Inconsistencies

Let's say you are training a character. The character has a seemingly consistent design from a glance, but upon further inspection, slight inconsistencies crop up. Sometimes there's 3 piercings on one ear, sometimes 2. Sometimes a tattoo is one shape, then another. Sometimes there's inner ear fluff, sometimes there isn't. You should ensure that details remain consistent across all images, unless such differences are intentional or entirely ignorable. For example, eyes changing colour as a part of emotional response, surfaces appearing different due to the artstyle, parts of a body glowing at different intensities in different images. If it's an unwarranted inconsistency, edit it out. Potential solutions are as follows:

  • A. Prune dataset to focus on the most consistent images.
  • B. Edit inconsistencies with your tool of choice. If unsure, inpainting can do well.

Captions

Structuring Captions

The essential structure of LoRA captions is usually this:

[instance], [all other tags]

Organising the tags beyond the instance token is not necessary, nor does it have an effect on LoRA training reliability. However, sorting the tags in your captions can make dataset curation easier, as you are more familiar with how to navigate your captions as opposed to trying to navigate them if the captions were random.

Here's one system for sorting captions that you could use:

[instance], [species], [body], [clothes], [environment], [composition]
  • [instance]: Your instance token. Usually the name of your concept.
    Examples: shepherdbreed, ninjaturtle, by weirdstyle
  • [species]: A tag or two describing your species. (more relevant to furry models than other models)
    Examples: dog, anthro, scalie, avian, mammal, human
  • [body]:
    Examples: female, muscular, paws, white fur, 'wide hips', long tail, green eyes, countershading
  • [clothes]:
    Examples: blue shirt, sports bra, black hat, black shorts, boots, piercing
  • [action] examples: sitting, neutral expression, squinting, tongue out
  • [environment] examples: detailed background, ocean, sunset, cloud
  • [composition] examples: full-length portrait, backlighting, dutch angle

Whenever you want to add a new tag to your caption, locating it should be rather easy. If used for a character, it would be very easy to re-use your [species], [body], and [clothes] tags across your entire dataset (assuming they remain the same on all images, that is- some images may have slight variation in any of these.

Another sorting method could be to sort by alphabetical order:

[instance], a tag, ba tag, bb tag, bc tag, c tag, x tag, y tag, z tag

If your knowledge of your model's captions is extensive, then the above could also be suitable. This also has the advantage of having many tools out there able sort stuff in alphabetical order.

The Lazy Way to Caption

Captioning can involve a lot of hard work and refinement, often being one of the most tedious parts of the entire process. This is warranted as it is perhaps the most important part of LoRA training, even more important than the images. However, even on small datasets, some people may lack the focus or determination it takes to do it all. For those people, there is a dubious method for captioning that you can try. Do not expect perfect results from this.

Before doing this, it is recommended (though not absolutely necessary) that you optimise your dataset images. Keep in mind that this captioning method cannot be used for style LoRAs; only for objects/characters. To use this method, do the following:

  1. Ensure there are no text files in your dataset.
  2. Rename your folder(s) in the following format (dachshund is used as your instance token in this example):
    #_instance, tag1, tag2, tag3
    (1_dachshund, dog, canid, brown body)
  3. If you have prior preservation images, name those folders in following format (same as above, but without your instance token):
    #_tag1, tag2, tag3
    (1_dog, canid, brown body)

That's all there is to it.


Other Bits

Some aspects of preparing a dataset involve more than just the images and captions directly.

Repeats Balancing (Multi-Concept LoRAs Only)

There may be a case where you want to train two (or more) concepts into a LoRA (also known as multi-concept training), but the amount of images you have between your different concepts is imbalanced, resulting in the trained LoRA favouring one thing over another. You should always have the training put the same amount of steps into each concept you want to train into your LoRA. We do this through "repeats", which are the numbers next to your concept folders' names. (This is actually repeats intended purpose, but it often gets abused to pad the number of steps if controlling the training by epoch count instead of step count.)

Let's go through a worked example to show how repeats are used to balance multiple concepts. You have three dog breeds you want to train into a single LoRA: "Borzoi", "Pug", and "Akita".

For the images, you must ensure that no folder is more than double the number of images of any other folder. For example, Borzoi has 50 images while Pug has 10 images. Borzoi is 5 times as much images as Pug, so prune some images from Borzoi and/or add images to Pug. (If pruning, prioritise pruning problematic images in your dataset.)

Let's say you ended up with this:

  • Borzoi = 43 images (34% greater than Pug, 15% less than Akita)
  • Pug = 32 images (25% less than Borzoi, 37% less than Akita)
  • Akita = 51 images (18% more than Borzoi, 59% more than Pug)

The amount of training time is going to favour Akita more than the other two breeds, meaning that attempting to gen borzois may gen akita-borzoi hybrids instead, while attempting to gen pugs may gen borzoi-akita-pug hybrids instead. Since there's 3 concepts, we need to have one third of each epoch dedicated to each breed. Your dataset folders may look something like this to begin with:

1_borzoi
1_pug
1_akita

Balancing datasets can be done through the following steps, with examples:

  1. Put both image counts in a ratio.
    (26:31, concept 'xylophone' and concept 'yacht')
  2. Slightly adjust one or both integers so they're divisible by the same factor.
    (26:31 ≈ 26:32)
  3. Simplify the ratio by dividing the two numbers by the chosen factor.
    (26:32 / 2 == 13:16)
  4. Repeat steps 2 and 3 until satisfied.
    (13:36 ≈ 12:16, /4 = 3:4)
  5. Flip the ratio, so now the repeats for both sides associates with each concept.
    (3:4 -> 4:3, 4_xylophone and 3_yacht)

Let's focus on balancing the Borzoi and Pug folders first with the above process:

43:32 ≈ 40:30 = 4:3
Final repeats: [3, 4]

Proof they have same number of steps:
[3, 4] * [43, 32] = [129, 128]
129/128 ≈ 1.008 ≈ 1

So now we have this:

3_borzoi
4_pug
1_akita

Let's balance Borzoi and Pug to Akita. Similar process to before, but we will use the Borzoi folder times by repeats and the Akita folder instead.

borzoi:akita
(3*43):51 = 129:51 ≈ 130:52
(130/52)/ = 10/4 = 5/2
4 * [3, 4] = [12, 16]

Since all three repeat values are divisible by 2, further simplification is possible:
[borzoi, pug, akita]
[12, 16, 10] / 2 = [6, 8, 5]
Final repeats: [6, 8, 5]

Proof they have same number of steps:
[6, 8, 5] * [43, 32, 51] = [258, 256, 255]
258/255 ≈ 1.012 ≈ 1
256/255 ≈ 1.004 ≈ 1

After all that, we are left with this:

6_borzoi
8_pug
5_akita

Note that the bigger the folder, the lower the number of repeats, and vice versa.

Clone this wiki locally