-
Notifications
You must be signed in to change notification settings - Fork 0
Preparing a Dataset
All of this advice is intended for Kohya's sd-scripts. If you are training LoRAs through a different program or setup, you will have to adapt the advice provided here to your set up. Also note that a lot of this advice assumes you are training a character/object as opposed to a more abstract concept like a style, pose, or trait, unless stated otherwise. Adapt advice stated here to the specific concept you intend to train.
- Training a character should have variety in expressions, poses, backgrounds, styles; the only consistent part should be the concept or character you are training.
- Training a style should have variety in everything except for style.
- At least 5 pictures is recommended, but more can be better.
- It is recommended you use a dedicated program for doing your captions. starik222's Booru Dataset Tag Manager is good for this.
- Ensure that every image has a .txt file of the same name. (If you are using a program like above, this should be done automatically.)
- The first tag of all caption files should be your 'instance token', usually the name of your concept.
- Each caption file must have a description of the associated image. with terms that are understood by the model used for training.
- Characters, poses, objects, etc.: Always tag what changes in all images. More complex concepts should have complex elements tagged. Whether to tag anything else on concept itself, please see To Caption, or Not to Caption?
- Styles: Caption everything in every image, OR caption nothing.
This is usually done for images that are almost exact duplicates. This includes alternate variants of an image that only change something really slight, like a single tattoo or differing only in eye colour between two images that are basically the same. Near duplicate images in a dataset can cause the model to overtrain on those images and memorize them instead of learning them.
If your dataset is large, you can find duplicates with dupeGuru: https://dupeguru.voltaicideas.net/
You can deal with duplicate images by:
- Just removing them
- Putting them in a separate folder with fewer repeats than your other images.
- Cropping or editing the duplicates to make them new images.
You should look through your dataset and remove images that negatively impact training. For characters, make sure the character is alone for the majority of images in your dataset (occasional exceptions for duos and if you know what you're doing), isn't cropped strangely, and isn't upside-down or in an overly strange pose. If your character appears multiple times in the same image consider cutting the image apart and turning it into more than one image.
Compression artefacts can interfere with LoRA training by making details look very muddy/blurry, and can sometimes potentially show up in your LoRA's generation results. Potential solutions are as follows:
- A: Find the original version of the image, as it likely has less compression.
- B: Run the image through an AI denoiser if the compression isn't too extreme.
- C: Put the image through Stable Diffusion img2img, with 0.1 - 0.3 denoising strength.
- D: Reduce the image to a low res (if it isn't already), then do "Low Resolution", solution B.
AI has a tendency to focus on anything it finds mildly interesting, which includes anything not part of the 'environment' of a particular image. This can include speech bubbles, text, overlays, a border, or anything else not actually part of environment depicted in the image. Just one image in a dataset with an overlaying element may start making it generate that particular thing when it isn't wanted or relevant. This section may not be relevant to you if you intend to gen any of these things, but otherwise, they are usually worth removing from any dataset. Potential solutions are as follows:
- A: Find an existing alt/edit of the image where any overlaying elements are not present.
-
B. Edit out the overlaying element yourself, either through photoshop or inpainting. If inpainting, then prompt for what's supposed to be there and either:
-
i. Set masked content to
Latent Noise
(or equivalent), with 1.0 denoising strength. -
ii. Sketch over the overlaying element in a painting program of choice, then set masked content to
Original
(or equivalent), with 0.5 - 0.8 denoising strength.
-
i. Set masked content to
Make sure images are equal to or larger than your training resolution. If they are smaller they will still be trained but cause the lora to generate blurrier images. Potential solutions are as follows:
- A. Find the original version of the image, you may have saved a downscaled repost.
- B. Run the image through an AI upscaler. My personal recommendation is R-ESRGAN 4x+, but feel free to use whatever upscaler you are comfortable with. This can double as a way of cleaning up compression artefacts.
Let's say you are gathering images for your dataset, and you notice some inconsistencies crop up in how different images depict your concept. For characters, sometimes there's 3 piercings on one ear, sometimes 2. Sometimes a tattoo is one shape, then another. Sometimes there's inner ear fluff, sometimes there isn't. For styles, the lines could be clean, or messy. Shading could be cel, or smooth. You should check that important details remain consistent across all images. This kind of inconsistency in your dataset can impact the learning of your character/concept. If you notice an unwanted inconsistency, you can:
- A. Prune dataset to focus on the most consistent images.
- B. Edit inconsistencies with your tool of choice. If unsure, inpainting can do well.
- C. Simply caption the difference; sometimes, this can be enough.
You can improve your dataset by making new images out of the ones you already have. Examples below are all for characters specifically; if you are not training a character, adapt the advice below to your case.
- Multiple Crops: If you have a full body image of your character, you can crop the image into various crops of your character (e.g., headshot, half-body portrait, bust portrait, butt shot). Even though they come from the same image, they will improve your LoRA almost as much as a new image.
- Isolating: If the character you are training is in a group, you should crop to just the character you wish to train (unless you are training the LoRA to be group-resistant). You can either do this in an image editing program of your choice, or through A1111's SAM extension.
- Tilt Correction: If the character appears to be tilted relative to the camera, you could tilt the image itself so the character appears upright.
- Synthetic Datasets: If you have a bad dataset you can try making a LoRA anyway, then try to generate more images using that LoRA to add to that dataset to make a better one. Some people have managed to make LoRAs this way from just one image.
Class images, also known as regularization images or prior-preservation images, are additional images of your character's "class" that you can add to your dataset. The class of your character is likely to be "human", "anthro", or "feral". The class images would be a varied assortment of images in that class.
When using class images, a large number of them is recommended. For example if you have 50 training images, you should make at least 200 class images. Make sure to set the repeats on your training images to match the number of class images you made. (For the previous example of 50 images, you would set the repeats of your training images to 4.)
Class images are completely optional and can be safely ignored if you are new to lora training. Class images help guide your lora to the specified class and prevent it from learning concepts unrelated to your target. If your lora is inflexible in style or posing, class images may help alleviate that.
You can create class images by generating a variety of images in the model you are training on. You can also download a bunch of class images from any other source (e.g., stock photos, image boorus, class images made by others).
- You must turn off flip augmentation (
--flip_aug
in sd-scripts,$flip_aug
in use-me.ps1) either if your concept is asymmetrical, or the captions are directional (i.e.,facing left
,facing right
)
The essential structure of LoRA captions is usually this:
[instance], [all other tags]
As long as the instance token is first, the order of the remaining tags doesn't matter for training reliability, as they get shuffled anyway. However, sorting the tags in your captions can make dataset curation easier.
Here's an idea for sorting captions, particularly for a character:
[instance], [species], [body], [clothes], [action], [environment], [composition]
-
[species]
examples:dog
,anthro
,canid
,canine
,mammal
-
[body]
examples:muscular
,paws
,white fur
,long tail
,green eyes
,countershading
-
[clothes]
examples:blue shirt
,black hat
,black shorts
,boots
,piercing
-
[action]
examples:sitting
,neutral expression
,squinting
,tongue out
-
[environment]
examples:detailed background
,ocean
,sunset
,cloud
-
[composition]
examples:full-length portrait
,backlighting
,dutch angle
Whenever you want to add a new tag to your caption, locating it should be rather easy. If used for a character, it would be very easy to re-use your [species]
, [body]
, and [clothes]
tags across your entire dataset, assuming they remain the same, always. Species could differ in some images, body could differ in some images, clothes could differ in some images.
Another sorting method could be to sort by alphabetical order:
[instance], a tag, ba tag, bb tag, bc tag, c tag, x tag, y tag
If your knowledge of your model's captions is extensive, then the above could also be suitable.
There are two major approaches to captioning datasets. First approach is where you describe absolutely everything in every picture- including details relevant to the concept itself. Second approach is to describe only what changes between pictures, with no description of the concept itself. An example of the second approach would be a green fox character standing in a city, sitting in a school, and lying down in a bedroom. You would tag standing in the city, sitting in a school, and lying down in a bedroom, but you would not tag the body being green or the species being fox. There are advantages and disadvantages to both approaches:
- Pros:
- Works best when things need to be specific, like if a character has some complex design or element(s).
- Always done for styles
- LoRA is more flexible (e.g., describing the aforementioned green dog as having a green body and being a dog allows the use of other body colours or species more easily)
- Cons:
- In order to generate the concept, all common words used to describe the concept are required. (Irrelevant with styles)
- Pros:
- Works best for simpler concepts, like some characters.
- For when the concept itself is not intended to be changed (e.g., you will always intend to generate a green dog and nothing else)
- Some specifics work more reliably (e.g., the green dog's specific shade of green will show up more reliably as opposed to explicitly tagging it as green)
- Cons:
- More rigid/inflexible. If the green dog has some specific outfit that you don't caption, it will be harder to get rid of that outfit.
- May have more difficulty training reliably.
What approach is best? Ultimately, it's a case by case. Training artstyles should usually require captioning everything (or- counter-intuitively- even captioning nothing at all). For anything else, one thing worth considering is "if you caption it, you can change it". In other words, there's nothing stopping you from taking a mixed approach; caption everything that should be able to change (even if the thing that is able to be changed is the same among all images), and don't caption what would absolutely never change. It's up to you and your testing to find out what works best for you.
Captioning can involve a lot of hard work and refinement, often being one of the most tedious parts of the entire process. There is a way to avoid captioning at all that can work if your character or concept is simple and you have a large enough dataset. If there are no caption files in your dataset folder, it will use the name of the folder as the caption.
- Ensure there are no text files in your dataset.
- Rename your folder(s) in the following format (ignore class tag(s) if you aren't using class images):
#_instance, class
For eaxmple:1_dachshund, dog
- If you have class images, name those folders in the following format (same as above, but without your instance token):
#_class
(1_dog)
That's all there is to it.
Some aspects of preparing a dataset involve more than just the images and captions directly.
There may be a case where you want to train two (or more) concepts into a LoRA (also known as multi-concept training), but the amount of images you have between your different concepts is imbalanced, resulting in the trained LoRA favouring one thing over another. You should always have the training put the same amount of steps into each concept you want to train into your LoRA. We do this through "repeats", which are the numbers next to your concept folders' names.
For example, You have three dog breeds you want to train into a single LoRA: "Borzoi", "Pug", and "Akita".
For the images, you must ensure that no folder is more than double the number of images of any other folder. For example, Borzoi has 50 images while Pug has 10 images. Borzoi is 5 times as much images as Pug, so prune some images from Borzoi and/or add images to Pug. (If pruning, prioritise pruning problematic images in your dataset.)
Let's say you ended up with this:
- Borzoi = 43 images (34% greater than Pug, 15% less than Akita)
- Pug = 32 images (25% less than Borzoi, 37% less than Akita)
- Akita = 51 images (18% more than Borzoi, 59% more than Pug)
The amount of training time is going to favour Akita more than the other two breeds, meaning that attempting to gen borzois may gen akita-borzoi hybrids instead, while attempting to gen pugs may gen borzoi-akita-pug hybrids instead. Since there's 3 concepts, we need to have one third of each epoch dedicated to each breed. Your dataset folders may look something like this to begin with:
1_borzoi
1_pug
1_akita
Balancing datasets can be done through the following steps, with examples:
- Put both image counts in a ratio.
(26:31, concept 'xylophone' and concept 'yacht')
- Slightly adjust one or both integers so they're divisible by the same factor.
(26:31 ≈ 26:32)
- Simplify the ratio by dividing the two numbers by the chosen factor.
(26:32 / 2 == 13:16)
- Repeat steps 2 and 3 until satisfied.
(13:16 ≈ 12:16, /4 = 3:4)
- Flip the ratio, so now the repeats for both sides associates with each concept.
(3:4 -> 4:3, 4_xylophone and 3_yacht)
Let's focus on balancing the Borzoi and Pug folders first with the above process:
43:32 ≈ 40:30 = 4:3
Final repeats: [3, 4]
Proof they have same number of steps:
[3, 4] * [43, 32] = [129, 128]
129/128 ≈ 1.008 ≈ 1
So now we have this:
3_borzoi
4_pug
1_akita
Let's balance Borzoi and Pug to Akita. Similar process to before, but we will use the Borzoi folder times by repeats and the Akita folder instead.
borzoi:akita
(3*43):51 = 129:51 ≈ 130:52
(130/52)/13 = 10/4 = 5/2
2 * [3, 4] = [6, 8]
Final repeats: [6, 8, 5]
Proof they have same number of steps:
[6, 8, 5] * [43, 32, 51] = [258, 256, 255]
258/255 ≈ 1.012 ≈ 1
256/255 ≈ 1.004 ≈ 1
After all that, we are left with this:
6_borzoi
8_pug
5_akita
Note that the bigger the folder, the lower the number of repeats, and vice versa.