Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] group normalization optimization to reduce 5 kernels to 3 kernels #28339

Merged

Conversation

clee30
Copy link
Contributor

@clee30 clee30 commented Jan 9, 2025

Details:

Extra 2 kernels to calculate variance can be eliminated by using by compute variance using pow(input_data, 2) - pow(mean, 2). This will avoid reading input buffer twice and perform almost similar calculation for mean and variance

Achieve about 30% performance improvement.

Tickets:

CVS-158816

Extra 2 kernels to calculate variance can be eliminated by using
by compute variance using pow(input_data, 2) - pow(mean, 2).
This will avoid reading input buffer twice and perform almost similar
calculation for mean and variance
@clee30 clee30 requested review from a team as code owners January 9, 2025 09:45
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Jan 9, 2025
@sys-openvino-ci sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Jan 9, 2025
@e-ddykim
Copy link
Contributor

e-ddykim commented Jan 9, 2025

Could you explain about how we can calculate variance by pow(input_data, 2) - pow(mean, 2)?

@clee30
Copy link
Contributor Author

clee30 commented Jan 9, 2025

Could you explain about how we can calculate variance by pow(input_data, 2) - pow(mean, 2)?

This is a simplify of the idea. First is get the average of square of all input data. After that, variance can compute by perform below

average of square of all input data - mean * mean.

@p-durandin
Copy link
Contributor

build_jenkins

@dnkurek
Copy link
Contributor

dnkurek commented Jan 9, 2025

Are you aware that RMS and GroupNorm are basically the same thing, but RMS is just a specific case of Groupnorm where scale is 1.0 and bias is 0.0?

RMS has a different approach with just 1 kernel.

Plus they both can be actually merged into the same thing. Compile-time optimization would just get rid of the 1.0 scale or 0.0 bias if needed

Copy link
Contributor

@dnkurek dnkurek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but with minor issues

}
}
#elif GROUP_NORM_KERNEL_GROUP_MEAN
#elif GROUP_NORM_KERNEL_GROUP_MEAN_VARIANCE
#if !IS_DYNAMIC
__attribute__((reqd_work_group_size(LWS0, LWS1, LWS2)))
#endif
KERNEL(calc_mean_per_group)(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this name is accurate anymore with the changes?

Now it does calculate mean and variance aswell

Same goes for the other file too

@dnkurek
Copy link
Contributor

dnkurek commented Jan 9, 2025

Also, have you tested the accuracy? Though mathematically this should be the same, the way you make these operations on floating point is different.

Really, looks like the win here is on memory since you don't have to read/write buffers many times, but I wonder if the RMS approach with just 1 kernel would actually be superior or actually not

@clee30
Copy link
Contributor Author

clee30 commented Jan 9, 2025

Are you aware that RMS and GroupNorm are basically the same thing, but RMS is just a specific case of Groupnorm where scale is 1.0 and bias is 0.0?

RMS has a different approach with just 1 kernel.

Plus they both can be actually merged into the same thing. Compile-time optimization would just get rid of the 1.0 scale or 0.0 bias if needed

No, I'm not aware of this. Will see in next PR if able to use this for Group Normalization.

@clee30
Copy link
Contributor Author

clee30 commented Jan 9, 2025

Also, have you tested the accuracy? Though mathematically this should be the same, the way you make these operations on floating point is different.

Really, looks like the win here is on memory since you don't have to read/write buffers many times, but I wonder if the RMS approach with just 1 kernel would actually be superior or actually not

I had compared vae_decoder model output before and after change. No different in floating point value. Yes, using one kernel will be more superior in performance for sure. Will look into this next PR.

@dnkurek
Copy link
Contributor

dnkurek commented Jan 9, 2025

Oh sorry to mislead but RMSnorm does actually not need to compute mean, just the RMS mean, which means it's not really the same.

Sorry I misremembered it

@e-ddykim
Copy link
Contributor

e-ddykim commented Jan 9, 2025

Could you explain about how we can calculate variance by pow(input_data, 2) - pow(mean, 2)?

This is a simplify of the idea. First is get the average of square of all input data. After that, variance can compute by perform below

average of square of all input data - mean * mean.

Now I understood it. I also checked that SDXL generates correct outputs with your PR.

Copy link
Contributor

@e-ddykim e-ddykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@dnkurek
Copy link
Contributor

dnkurek commented Jan 9, 2025

Even though RMS is not actually a specific case of GroupNorm (as I wrongly believed), the approach used in RMS may be tried here too. May be good to try

@clee30
Copy link
Contributor Author

clee30 commented Jan 9, 2025

Oh sorry to mislead but RMSnorm does actually not need to compute mean, just the RMS mean, which means it's not really the same.

Sorry I misremembered it

GroupNorm have another variable call number of group. So, it will divide channels into two sub groups For example, {1, 8, 2, 2} with number of group =2, will become {1, 4, 2, 2, 2}. I don't think RMS will do this.

@clee30
Copy link
Contributor Author

clee30 commented Jan 9, 2025

Could you explain about how we can calculate variance by pow(input_data, 2) - pow(mean, 2)?

This is a simplify of the idea. First is get the average of square of all input data. After that, variance can compute by perform below
average of square of all input data - mean * mean.

Now I understood it. I also checked that SDXL generates correct outputs with your PR.

Thanks for checking.

@dnkurek
Copy link
Contributor

dnkurek commented Jan 9, 2025

Sorry, I meant MVN instead of RMS, I misremembered

Which means that probably this optimization can be used for MVN aswell?

@clee30
Copy link
Contributor Author

clee30 commented Jan 9, 2025

Sorry, I meant MVN instead of RMS, I misremembered

Which means that probably this optimization can be used for MVN aswell?

Yes, look possible to reuse similar method as MVN also try to read input twice to perform almost similar calculation.

@dnkurek
Copy link
Contributor

dnkurek commented Jan 9, 2025

I did the same thing on MVN BFYX OPT kernel and I got a 50% performance improvement

@p-durandin
Copy link
Contributor

build_jenkins

@yeonbok yeonbok added this pull request to the merge queue Jan 11, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Jan 11, 2025
}

ACCUMULATOR_TYPE variance = work_group_reduce_add(my_variance);
ACCUMULATOR_TYPE mean = work_group_reduce_add(mean_sum);
ACCUMULATOR_TYPE variance = work_group_reduce_add(variance_sum);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two work_group_reduce_add calls can actually be merged into one loop. The compiler will do a full loop for each of these reductions when actually you can merge them both. This is out of scope for this PR, as I tested this with MVN and I saw the same pattern, and merging the work_group_reduce_add calls improved performance. However, this requires doing it manually without work_group_reduce add since there is no OCL support for such thing. I just say this for info

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the compiler is smart enough to figure out the merge of two subsequent work_group_reduce operations that don't depend on each other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree if both merge into one will be better, but will need to do manually. However, for group normalization kernel, I think merge both might not yield much improvement. This second kernel only use run on # of threads based on # of channels. So, the time to run this second kernel is very small.

@dnkurek dnkurek added this pull request to the merge queue Jan 13, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 14, 2025
@dnkurek dnkurek added this pull request to the merge queue Jan 14, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 14, 2025
@dnkurek dnkurek added this pull request to the merge queue Jan 14, 2025
Merged via the queue into openvinotoolkit:master with commit 1522455 Jan 14, 2025
168 checks passed
MirceaDan99 pushed a commit to MirceaDan99/openvino that referenced this pull request Jan 22, 2025
…ls (openvinotoolkit#28339)

### Details:
Extra 2 kernels to calculate variance can be eliminated by using by
compute variance using pow(input_data, 2) - pow(mean, 2). This will
avoid reading input buffer twice and perform almost similar calculation
for mean and variance

Achieve about 30% performance improvement.

### Tickets:
CVS-158816
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin Code Freeze ExternalIntelPR External contributor from Intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants