Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add control code scf.forall to scf.for pass #916

Merged
merged 1 commit into from
Nov 21, 2024

Conversation

jtuyls
Copy link
Collaborator

@jtuyls jtuyls commented Nov 21, 2024

Adds a pass that converts control code scf.forall ops into scf.for ops. This enables more loop subsumption opportunities as scf.forall needs to be either fully subsumed into a DMA or not at all, while converting to scf.for allows the loop to be partially subsumed.

With this pass added, the 4096x512x512 matmul on 2x2 lowers to 6532 words of control code instead of 52228 before this pass. For this configuration with ukernel, the latency is now 10ms compared with 19ms earlier.

Copy link
Contributor

@newling newling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just nit comments which can be ignored

Question though. Might the order of the scf.for loops matter? For example, if the final test only had affine.apply #map(%l) might be it be better to have that for loop be the outer-most loop? Not sure if there's an easy way to do this analysis and subsequent loop inversion (and not sure if 'inversion' is the correct name)

@jtuyls
Copy link
Collaborator Author

jtuyls commented Nov 21, 2024

Question though. Might the order of the scf.for loops matter? For example, if the final test only had affine.apply #map(%l) might be it be better to have that for loop be the outer-most loop? Not sure if there's an easy way to do this analysis and subsequent loop inversion (and not sure if 'inversion' is the correct name)

Yes, it does certainly matter for different shapes, but I think this should be done at the tiling level instead.

@yzhang93
Copy link
Contributor

Question though. Might the order of the scf.for loops matter? For example, if the final test only had affine.apply #map(%l) might be it be better to have that for loop be the outer-most loop? Not sure if there's an easy way to do this analysis and subsequent loop inversion (and not sure if 'inversion' is the correct name)

Yes, it does certainly matter for different shapes, but I think this should be done at the tiling level instead.

Yes, we can control the order of loops when generating tiling strategy. The question is how to define such strategy. Previously we were thinking the simple way is to compare the M and N sizes, and make the outer loop with the smaller input size. But from the latest results, it looks like 4096x512x512 has better performance than 512x4096x512.

@jtuyls jtuyls merged commit a528a85 into nod-ai:main Nov 21, 2024
6 checks passed
@jtuyls jtuyls deleted the controlcode-forall-to-for branch November 21, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants