Add control code scf.forall to scf.for pass #916

jtuyls · 2024-11-21T15:38:20Z

Adds a pass that converts control code scf.forall ops into scf.for ops. This enables more loop subsumption opportunities as scf.forall needs to be either fully subsumed into a DMA or not at all, while converting to scf.for allows the loop to be partially subsumed.

With this pass added, the 4096x512x512 matmul on 2x2 lowers to 6532 words of control code instead of 52228 before this pass. For this configuration with ukernel, the latency is now 10ms compared with 19ms earlier.

newling

Looks good, just nit comments which can be ignored

Question though. Might the order of the scf.for loops matter? For example, if the final test only had affine.apply #map(%l) might be it be better to have that for loop be the outer-most loop? Not sure if there's an easy way to do this analysis and subsequent loop inversion (and not sure if 'inversion' is the correct name)

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEControlCodeForallToFor.cpp

jtuyls · 2024-11-21T18:24:31Z

Question though. Might the order of the scf.for loops matter? For example, if the final test only had affine.apply #map(%l) might be it be better to have that for loop be the outer-most loop? Not sure if there's an easy way to do this analysis and subsequent loop inversion (and not sure if 'inversion' is the correct name)

Yes, it does certainly matter for different shapes, but I think this should be done at the tiling level instead.

yzhang93 · 2024-11-21T18:45:35Z

Question though. Might the order of the scf.for loops matter? For example, if the final test only had affine.apply #map(%l) might be it be better to have that for loop be the outer-most loop? Not sure if there's an easy way to do this analysis and subsequent loop inversion (and not sure if 'inversion' is the correct name)

Yes, it does certainly matter for different shapes, but I think this should be done at the tiling level instead.

Yes, we can control the order of loops when generating tiling strategy. The question is how to define such strategy. Previously we were thinking the simple way is to compare the M and N sizes, and make the outer loop with the smaller input size. But from the latest results, it looks like 4096x512x512 has better performance than 512x4096x512.

jtuyls requested review from MaheshRavishankar, nirvedhmeshram, yzhang93 and Abhishek-Varma as code owners November 21, 2024 15:38

newling reviewed Nov 21, 2024

View reviewed changes

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEControlCodeForallToFor.cpp Show resolved Hide resolved

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEControlCodeForallToFor.cpp Outdated Show resolved Hide resolved

newling approved these changes Nov 21, 2024

View reviewed changes

jtuyls force-pushed the controlcode-forall-to-for branch from b354f08 to e18da58 Compare November 21, 2024 18:22

Add control code forall to for pass

1e1f87e

jtuyls force-pushed the controlcode-forall-to-for branch from e18da58 to 1e1f87e Compare November 21, 2024 18:22

jtuyls merged commit a528a85 into nod-ai:main Nov 21, 2024
6 checks passed

jtuyls deleted the controlcode-forall-to-for branch November 21, 2024 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add control code scf.forall to scf.for pass #916

Add control code scf.forall to scf.for pass #916

jtuyls commented Nov 21, 2024 •

edited

Loading

newling left a comment

jtuyls commented Nov 21, 2024

yzhang93 commented Nov 21, 2024

Add control code scf.forall to scf.for pass #916

Add control code scf.forall to scf.for pass #916

Conversation

jtuyls commented Nov 21, 2024 • edited Loading

newling left a comment

Choose a reason for hiding this comment

jtuyls commented Nov 21, 2024

yzhang93 commented Nov 21, 2024

jtuyls commented Nov 21, 2024 •

edited

Loading