What are the purposes of axes_factor, ignore_factor_on_trunc, unlimited_area_hack? #4

jamesWalker55 · 2023-09-03T00:02:16Z

jamesWalker55
Sep 3, 2023

From the code, it looks like axes_factor, ignore_factor_on_trunc are for rearranging the input in groupnorm_mm_forward, though I'm not sure what unlimited_area_hack is for (possibly for low-VRAM devices?).

Kosinkadink · 2023-09-03T01:45:37Z

Kosinkadink
Sep 3, 2023
Maintainer

Good catch, I did not document those vars yet because the axes_factor can be a bit finicky outside of using the values 1 and 2, and unlimited_area_hack increases VRAM usage drastically when its intended usage comes up. I'll document them "officially" on the README at a later date, got a lot on my plate in terms of features and bug fixes atm. I'll be very verbose in my responses here, just so I have it in writing.

For axes_factor and ignore_factor_on_trunc, yep, it rearranges the input in groupnorm_mm_forward. The reason it's finicky (and why the ignore_factor_on_trunc is there) is that ComfyUI has VRAM optimizations kick in at certain resolution/batch_size combinations, that (I think) cause it to not add the uncond portions of the latents, cutting the expected latent chunks in half. Example: at something like 512x512 batch_size 16, the latent has the first dimension of 32, while at 1024x1024 batch_size 16, the latent has first dimension as 16 while all the others remain constant. Because it gets cut in half, if the code were to keep using the axes_factor 2 internally, the image seems to lose fidelity from my experimentation with the way it's normalized. So, when it detects that the expected latent chunks are not twice the size of the expected video frames, ignore_factor_on_trunc lets it use the value of 1 instead of the default 2 or whatever value is chosen, to keep it more consistent with what the user might expect to happen. And funnily enough, this halving behavior means that when ignore_factor_on_trunc is False, if you were to select, say a batch_size of 15 at a higher resolution (or have a second upscaling pass that pushes the resolution to the optimization threshold), the axes_factor 2 that normally works would attempt to rearrange the tensor in half - but it can't, because instead of getting the latent in 30 chunks, it would get 15. This optimization behavior is actually the OG reason why the original ComfyUI animatediff repo had the issue where too high of a resolution would cause the animation to be "cut in half" and rendered as two separate groups of latents - the code in motion_module.py originally derived the expected video_frame length from half the latent chunks. And yep, it would also throw an error with an odd amount of frames too when the opti kicked in.

As for unlimited_area_hack, that overrides the maximum_batch_area function, which is what is used to determine when that halving optimization should kick in when sampling. The override is making that function return the max integer supported by Python3 rather than actually doing any math. As expected, it will pretty much double VRAM requirements when sampling at resolutions/batch_sizes that trigger the optimization. From some subjective tests, honestly it does not affect the output significantly at the resolutions that trigger the optimization (at least on my machine, but I have 24GB of VRAM). So unlimited_area_hack can be kept off unless the image is getting chunked in half at resolutions low enough where it is noticeable, but it is hard to even know when it happens unless I add code to print something out in the logs about it happening. It's kinda the opposite of an optimization when set to True.

0 replies

Kosinkadink · 2023-09-03T07:55:36Z

Kosinkadink
Sep 3, 2023
Maintainer

Update: I talked with comfy and I now have a better understanding of how to get the appropriate axes factor value, to the point where I can probably take those variables out entirely. The unlimited_area_hack might also be unnecessary, I will be refactoring some code to hopefully get rid of a couple issues as well.

4 replies

jamesWalker55 Sep 3, 2023
Author

Thanks for the detailed reply! I was wondering why the tensors sometimes had 2x the batch size as well, didn't know it was VRAM optimisation from comfyui. How should the axes factor be calculated?

Kosinkadink Sep 3, 2023
Maintainer

The axes factor should be equal to the first dim of the passed in "chunked latents" divided by the actual number of latents expected. Apparently, the VRAM optimization makes it so that it will do each cond/uncond individually rather than together, which is why it will get passed in 1x the batch size when it kicks in. Also, this means that when conds are combined using ConditioningCombine, the amount of combined conds can increase that expected "axes factor" to greater than 2, which I was not aware of. Since I am aware of this now, I will implement the appropriate changes to do this automatically, and will no longer expose the "axes factor"-related variables.

I'm curious if there are custom nodes out there that weren't developed with this optimization in mind, causing weird issues. Or if me not accounting for the appropriate axes factor in cases where there are combined conditionings causes issues as well.

jamesWalker55 Sep 3, 2023
Author

From the animatediff paper, it mentions that the model input channels are batch × channels × frames × height × width which should correspond to b c f h w in the rearrange call, it'd seem to me that we can just divide the tensor.shape[0] by the video length to get the batch size (axes_factor)

Kosinkadink Sep 3, 2023
Maintainer

Yep, I may have worded my reply weird but that's exactly what I meant. A quick implementation change, tbh.

Speaking of how animatediff implemented stuff, they made the unfortunate decision to do cross attention using a (b d) f c arrangement instead of an f (b d) c arrangement like Comfy does. Example: for 512x512 batch size 16 latents, it results in a [8192, 16, 320]-shaped tensor instead of a [16, 8192, 320]-shaped one. For basic use this is not a problem, but xformers does not allow the first dimension to be >= 65536. So, at high enough resolutions/batch sizes/enough conditioning, sampling throws an error, and the workaround is to disable xformers. It's kicking my butt right now trying to rewrite the code to work correctly while being in an f (b d) c arrangement in the attention code. I can get it to run, but the images are basically completely trashed - I assume the loaded motion module tensors must all be expecting the other format. Hopefully I can figure it out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the purposes of axes_factor, ignore_factor_on_trunc, unlimited_area_hack? #4

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What are the purposes of axes_factor, ignore_factor_on_trunc, unlimited_area_hack? #4

jamesWalker55 Sep 3, 2023

Replies: 2 comments · 4 replies

Kosinkadink Sep 3, 2023 Maintainer

Kosinkadink Sep 3, 2023 Maintainer

jamesWalker55 Sep 3, 2023 Author

Kosinkadink Sep 3, 2023 Maintainer

jamesWalker55 Sep 3, 2023 Author

Kosinkadink Sep 3, 2023 Maintainer

jamesWalker55
Sep 3, 2023

Replies: 2 comments 4 replies

Kosinkadink
Sep 3, 2023
Maintainer

Kosinkadink
Sep 3, 2023
Maintainer

jamesWalker55 Sep 3, 2023
Author

Kosinkadink Sep 3, 2023
Maintainer

jamesWalker55 Sep 3, 2023
Author

Kosinkadink Sep 3, 2023
Maintainer