-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] group normalization optimization to reduce 5 kernels to 3 kernels #28339
[GPU] group normalization optimization to reduce 5 kernels to 3 kernels #28339
Conversation
Extra 2 kernels to calculate variance can be eliminated by using by compute variance using pow(input_data, 2) - pow(mean, 2). This will avoid reading input buffer twice and perform almost similar calculation for mean and variance
Could you explain about how we can calculate variance by pow(input_data, 2) - pow(mean, 2)? |
This is a simplify of the idea. First is get the average of square of all input data. After that, variance can compute by perform below average of square of all input data - mean * mean. |
build_jenkins |
Are you aware that RMS and GroupNorm are basically the same thing, but RMS is just a specific case of Groupnorm where scale is 1.0 and bias is 0.0? RMS has a different approach with just 1 kernel. Plus they both can be actually merged into the same thing. Compile-time optimization would just get rid of the 1.0 scale or 0.0 bias if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but with minor issues
} | ||
} | ||
#elif GROUP_NORM_KERNEL_GROUP_MEAN | ||
#elif GROUP_NORM_KERNEL_GROUP_MEAN_VARIANCE | ||
#if !IS_DYNAMIC | ||
__attribute__((reqd_work_group_size(LWS0, LWS1, LWS2))) | ||
#endif | ||
KERNEL(calc_mean_per_group)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this name is accurate anymore with the changes?
Now it does calculate mean and variance aswell
Same goes for the other file too
src/plugins/intel_gpu/src/kernel_selector/cl_kernels/group_normalization_gpu_b_fs_yx_fsv16.cl
Outdated
Show resolved
Hide resolved
Also, have you tested the accuracy? Though mathematically this should be the same, the way you make these operations on floating point is different. Really, looks like the win here is on memory since you don't have to read/write buffers many times, but I wonder if the RMS approach with just 1 kernel would actually be superior or actually not |
No, I'm not aware of this. Will see in next PR if able to use this for Group Normalization. |
I had compared vae_decoder model output before and after change. No different in floating point value. Yes, using one kernel will be more superior in performance for sure. Will look into this next PR. |
Oh sorry to mislead but RMSnorm does actually not need to compute mean, just the RMS mean, which means it's not really the same. Sorry I misremembered it |
Now I understood it. I also checked that SDXL generates correct outputs with your PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Even though RMS is not actually a specific case of GroupNorm (as I wrongly believed), the approach used in RMS may be tried here too. May be good to try |
GroupNorm have another variable call number of group. So, it will divide channels into two sub groups For example, {1, 8, 2, 2} with number of group =2, will become {1, 4, 2, 2, 2}. I don't think RMS will do this. |
Thanks for checking. |
Sorry, I meant MVN instead of RMS, I misremembered Which means that probably this optimization can be used for MVN aswell? |
Yes, look possible to reuse similar method as MVN also try to read input twice to perform almost similar calculation. |
I did the same thing on MVN BFYX OPT kernel and I got a 50% performance improvement |
build_jenkins |
} | ||
|
||
ACCUMULATOR_TYPE variance = work_group_reduce_add(my_variance); | ||
ACCUMULATOR_TYPE mean = work_group_reduce_add(mean_sum); | ||
ACCUMULATOR_TYPE variance = work_group_reduce_add(variance_sum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two work_group_reduce_add calls can actually be merged into one loop. The compiler will do a full loop for each of these reductions when actually you can merge them both. This is out of scope for this PR, as I tested this with MVN and I saw the same pattern, and merging the work_group_reduce_add calls improved performance. However, this requires doing it manually without work_group_reduce add since there is no OCL support for such thing. I just say this for info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if the compiler is smart enough to figure out the merge of two subsequent work_group_reduce operations that don't depend on each other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree if both merge into one will be better, but will need to do manually. However, for group normalization kernel, I think merge both might not yield much improvement. This second kernel only use run on # of threads based on # of channels. So, the time to run this second kernel is very small.
…ls (openvinotoolkit#28339) ### Details: Extra 2 kernels to calculate variance can be eliminated by using by compute variance using pow(input_data, 2) - pow(mean, 2). This will avoid reading input buffer twice and perform almost similar calculation for mean and variance Achieve about 30% performance improvement. ### Tickets: CVS-158816
Details:
Extra 2 kernels to calculate variance can be eliminated by using by compute variance using pow(input_data, 2) - pow(mean, 2). This will avoid reading input buffer twice and perform almost similar calculation for mean and variance
Achieve about 30% performance improvement.
Tickets:
CVS-158816