Ideas for Training Text Encoder in SDXL #1354

setothegreat · 2023-08-08T23:40:32Z

setothegreat
Aug 8, 2023

Feel free to correct anything I say if it's wrong or wouldn't work, as most of this is just based on my understanding as a layman from second hand sources.

It seems like the issue with using the text encoder as described in this project's read-me for training is that there are two separate CLIP models that work fundamentally differently from one another in SDXL. From my understanding, the first CLIP model is a natural language encoder, whereas the second is a tag-based encoder.

As an example, if the prompt for the first CLIP model was
"A photograph of a man smiling in a park wearing a white hat during the day time"
Then the second CLIP prompt should be something like
"photograph, realistic, man, smiling, park, white hat, sunny, day"

If this is indeed the case, then it seems like the solution would be to train both encoders by, for a hypothetical example, having the first line in an image's accompanying text document be the prompt for the first clip model to be trained on, and having the second line be the list of tags for the second clip model to be trained on.

Again I am a layman, but is there any reason why a solution like this couldn't be implemented for Dreambooth or LORA training on SDXL? Without proper text encoder training the models just seem somewhat handicapped in a lot of scenarios.

RockTheCosmos · 2023-08-23T20:48:46Z

RockTheCosmos
Aug 23, 2023

I completely agree. Being able to only train the Unet really handicaps SDXL LoRas in comparison to SD 1.5 LoRas, so I'm surprised this hasn't gotten any comments or gained more traction. What you suggested is the obvious solution to this problem, and it doesn't seem like it would be that complicated to implement, either.

In fact, I went ahead and submitted this as a feature request, so hopefully that will get some attention.