[GPU] group normalization optimization to reduce 5 kernels to 3 kernels #28339

clee30 · 2025-01-09T09:45:37Z

Details:

Extra 2 kernels to calculate variance can be eliminated by using by compute variance using pow(input_data, 2) - pow(mean, 2). This will avoid reading input buffer twice and perform almost similar calculation for mean and variance

Achieve about 30% performance improvement.

Tickets:

CVS-158816

Extra 2 kernels to calculate variance can be eliminated by using by compute variance using pow(input_data, 2) - pow(mean, 2). This will avoid reading input buffer twice and perform almost similar calculation for mean and variance

e-ddykim · 2025-01-09T10:01:44Z

Could you explain about how we can calculate variance by pow(input_data, 2) - pow(mean, 2)?

clee30 · 2025-01-09T10:06:51Z

Could you explain about how we can calculate variance by pow(input_data, 2) - pow(mean, 2)?

This is a simplify of the idea. First is get the average of square of all input data. After that, variance can compute by perform below

average of square of all input data - mean * mean.

p-durandin · 2025-01-09T11:33:04Z

build_jenkins

dnkurek · 2025-01-09T12:20:27Z

Are you aware that RMS and GroupNorm are basically the same thing, but RMS is just a specific case of Groupnorm where scale is 1.0 and bias is 0.0?

RMS has a different approach with just 1 kernel.

Plus they both can be actually merged into the same thing. Compile-time optimization would just get rid of the 1.0 scale or 0.0 bias if needed

dnkurek

LGTM but with minor issues

dnkurek · 2025-01-09T12:23:49Z

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/group_normalization_gpu_bfyx_opt.cl

    }
 }
-#elif GROUP_NORM_KERNEL_GROUP_MEAN
+#elif GROUP_NORM_KERNEL_GROUP_MEAN_VARIANCE
 #if !IS_DYNAMIC
 __attribute__((reqd_work_group_size(LWS0, LWS1, LWS2)))
 #endif
 KERNEL(calc_mean_per_group)(


I don't think this name is accurate anymore with the changes?

Now it does calculate mean and variance aswell

Same goes for the other file too

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/group_normalization_gpu_b_fs_yx_fsv16.cl

dnkurek · 2025-01-09T12:30:02Z

Also, have you tested the accuracy? Though mathematically this should be the same, the way you make these operations on floating point is different.

Really, looks like the win here is on memory since you don't have to read/write buffers many times, but I wonder if the RMS approach with just 1 kernel would actually be superior or actually not

clee30 · 2025-01-09T12:35:55Z

Are you aware that RMS and GroupNorm are basically the same thing, but RMS is just a specific case of Groupnorm where scale is 1.0 and bias is 0.0?

RMS has a different approach with just 1 kernel.

Plus they both can be actually merged into the same thing. Compile-time optimization would just get rid of the 1.0 scale or 0.0 bias if needed

No, I'm not aware of this. Will see in next PR if able to use this for Group Normalization.

clee30 · 2025-01-09T12:38:07Z

Also, have you tested the accuracy? Though mathematically this should be the same, the way you make these operations on floating point is different.

Really, looks like the win here is on memory since you don't have to read/write buffers many times, but I wonder if the RMS approach with just 1 kernel would actually be superior or actually not

I had compared vae_decoder model output before and after change. No different in floating point value. Yes, using one kernel will be more superior in performance for sure. Will look into this next PR.

dnkurek · 2025-01-09T12:44:10Z

Oh sorry to mislead but RMSnorm does actually not need to compute mean, just the RMS mean, which means it's not really the same.

Sorry I misremembered it

e-ddykim · 2025-01-09T12:56:58Z

Could you explain about how we can calculate variance by pow(input_data, 2) - pow(mean, 2)?

This is a simplify of the idea. First is get the average of square of all input data. After that, variance can compute by perform below

average of square of all input data - mean * mean.

Now I understood it. I also checked that SDXL generates correct outputs with your PR.

e-ddykim

Looks good to me

dnkurek · 2025-01-09T13:01:13Z

Even though RMS is not actually a specific case of GroupNorm (as I wrongly believed), the approach used in RMS may be tried here too. May be good to try

clee30 · 2025-01-09T13:49:33Z

Oh sorry to mislead but RMSnorm does actually not need to compute mean, just the RMS mean, which means it's not really the same.

Sorry I misremembered it

GroupNorm have another variable call number of group. So, it will divide channels into two sub groups For example, {1, 8, 2, 2} with number of group =2, will become {1, 4, 2, 2, 2}. I don't think RMS will do this.

clee30 · 2025-01-09T13:50:46Z

Could you explain about how we can calculate variance by pow(input_data, 2) - pow(mean, 2)?

This is a simplify of the idea. First is get the average of square of all input data. After that, variance can compute by perform below
average of square of all input data - mean * mean.

Now I understood it. I also checked that SDXL generates correct outputs with your PR.

Thanks for checking.

dnkurek · 2025-01-09T13:52:34Z

Sorry, I meant MVN instead of RMS, I misremembered

Which means that probably this optimization can be used for MVN aswell?

clee30 · 2025-01-09T13:59:41Z

Sorry, I meant MVN instead of RMS, I misremembered

Which means that probably this optimization can be used for MVN aswell?

Yes, look possible to reuse similar method as MVN also try to read input twice to perform almost similar calculation.

dnkurek · 2025-01-09T14:57:12Z

I did the same thing on MVN BFYX OPT kernel and I got a 50% performance improvement

p-durandin · 2025-01-09T17:27:56Z

build_jenkins

dnkurek · 2025-01-13T21:56:26Z

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/group_normalization_gpu_b_fs_yx_fsv16.cl

        }

-        ACCUMULATOR_TYPE variance = work_group_reduce_add(my_variance);
+        ACCUMULATOR_TYPE mean = work_group_reduce_add(mean_sum);
+        ACCUMULATOR_TYPE variance = work_group_reduce_add(variance_sum);


These two work_group_reduce_add calls can actually be merged into one loop. The compiler will do a full loop for each of these reductions when actually you can merge them both. This is out of scope for this PR, as I tested this with MVN and I saw the same pattern, and merging the work_group_reduce_add calls improved performance. However, this requires doing it manually without work_group_reduce add since there is no OCL support for such thing. I just say this for info

I am not sure if the compiler is smart enough to figure out the merge of two subsequent work_group_reduce operations that don't depend on each other

Yes, I agree if both merge into one will be better, but will need to do manually. However, for group normalization kernel, I think merge both might not yield much improvement. This second kernel only use run on # of threads based on # of channels. So, the time to run this second kernel is very small.

…ls (openvinotoolkit#28339) ### Details: Extra 2 kernels to calculate variance can be eliminated by using by compute variance using pow(input_data, 2) - pow(mean, 2). This will avoid reading input buffer twice and perform almost similar calculation for mean and variance Achieve about 30% performance improvement. ### Tickets: CVS-158816

clee30 requested review from a team as code owners January 9, 2025 09:45

github-actions bot added the category: GPU OpenVINO GPU plugin label Jan 9, 2025

sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Jan 9, 2025

dnkurek approved these changes Jan 9, 2025

View reviewed changes

e-ddykim approved these changes Jan 9, 2025

View reviewed changes

Update kernel name to reflect the kernel function

f62c7fa

Merge branch 'master' into group_normalization_optimization

5d7e468

yeonbok approved these changes Jan 11, 2025

View reviewed changes

yeonbok added this pull request to the merge queue Jan 11, 2025

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Jan 11, 2025

dnkurek reviewed Jan 13, 2025

View reviewed changes

dnkurek added this pull request to the merge queue Jan 13, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 14, 2025

dnkurek added this pull request to the merge queue Jan 14, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 14, 2025

p-durandin added the Code Freeze label Jan 14, 2025

dnkurek added this pull request to the merge queue Jan 14, 2025

Merged via the queue into openvinotoolkit:master with commit 1522455 Jan 14, 2025
168 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] group normalization optimization to reduce 5 kernels to 3 kernels #28339

[GPU] group normalization optimization to reduce 5 kernels to 3 kernels #28339

clee30 commented Jan 9, 2025 •

edited

Loading

e-ddykim commented Jan 9, 2025

clee30 commented Jan 9, 2025

p-durandin commented Jan 9, 2025

dnkurek commented Jan 9, 2025

dnkurek left a comment

dnkurek Jan 9, 2025

dnkurek commented Jan 9, 2025

clee30 commented Jan 9, 2025

clee30 commented Jan 9, 2025 •

edited

Loading

dnkurek commented Jan 9, 2025

e-ddykim commented Jan 9, 2025

e-ddykim left a comment

dnkurek commented Jan 9, 2025

clee30 commented Jan 9, 2025

clee30 commented Jan 9, 2025

dnkurek commented Jan 9, 2025

clee30 commented Jan 9, 2025

dnkurek commented Jan 9, 2025

p-durandin commented Jan 9, 2025

dnkurek Jan 13, 2025

dnkurek Jan 13, 2025

clee30 Jan 14, 2025

[GPU] group normalization optimization to reduce 5 kernels to 3 kernels #28339

[GPU] group normalization optimization to reduce 5 kernels to 3 kernels #28339

Conversation

clee30 commented Jan 9, 2025 • edited Loading

Details:

Tickets:

e-ddykim commented Jan 9, 2025

clee30 commented Jan 9, 2025

p-durandin commented Jan 9, 2025

dnkurek commented Jan 9, 2025

dnkurek left a comment

Choose a reason for hiding this comment

dnkurek Jan 9, 2025

Choose a reason for hiding this comment

dnkurek commented Jan 9, 2025

clee30 commented Jan 9, 2025

clee30 commented Jan 9, 2025 • edited Loading

dnkurek commented Jan 9, 2025

e-ddykim commented Jan 9, 2025

e-ddykim left a comment

Choose a reason for hiding this comment

dnkurek commented Jan 9, 2025

clee30 commented Jan 9, 2025

clee30 commented Jan 9, 2025

dnkurek commented Jan 9, 2025

clee30 commented Jan 9, 2025

dnkurek commented Jan 9, 2025

p-durandin commented Jan 9, 2025

dnkurek Jan 13, 2025

Choose a reason for hiding this comment

dnkurek Jan 13, 2025

Choose a reason for hiding this comment

clee30 Jan 14, 2025

Choose a reason for hiding this comment

clee30 commented Jan 9, 2025 •

edited

Loading

clee30 commented Jan 9, 2025 •

edited

Loading