`dma_task` in programming examples #1919

hunhoffe · 2024-11-13T22:10:09Z

This PR is an incremental step towards implementation of my IRON augmentations (found here: #1732).

The IRON extensions/augmentations generate runtime sequences using dma tasks. However, I noticed that the dma task infrastructure does not seem to work out of the box for all examples. Thus, enabling the programming examples to use dma tasks is a necessary preliminary step before I can merge my other work.

While having alternative versions of the program examples will slow down the CI, I believe this is a temporary step until we are ready to deprecate examples using the current IRON syntax. I plan to replace the alt versions with jupyter notebooks in a subsequent PR.

hunhoffe · 2024-11-13T23:10:28Z

I am seeing some failures with the programming examples when using the dma task interface. There are several separate errors I suspect so far:

Verification: Does not accept single element sizes, e.g., [1, 1, 1, 1], which is used in some examples of transfer a scalar value.
Special Case: Does not accept sizes of form [1, 1, 1, N] where N >= 1024; this is a special case for npu_dma_memcpy_nd so hopefully not too hard to port
Functionality: It looks like all examples with a repeat count/step in the upper dimension fail; there appears to be a repeat_count parameter in the call to dma_configure_task_for which I believe works; however, this is not sufficient for examples that also have an iteration_step value in the upper dimension of the strides.

andrej · 2024-11-14T04:24:36Z

Nice, I'm happy to see this feature getting some attention. Apologies in advance if you run into some roadblocks ... I believe this is the first time somebody besides me is seriously using this syntax., so it is unfortunately likely that some troubleshooting may be necessary.

Verification: Does not accept single element sizes, e.g., [1, 1, 1, 1], which is used in some examples of transfer a scalar value.

What are the verification errors you are getting for this? This should pass verification for i32. If it fails for smaller data types, that is correct, since the minimal transfer size is 4 bytes.

Special Case: Does not accept sizes of form [1, 1, 1, N] where N >= 1024; this is a special case for npu_dma_memcpy_nd so hopefully not too hard to port

This may warrant some discussion. Making this fail was a deliberate choice on my side. I am against handling this as a special case.

Here's my reasoning: If we were to support [1, 1, 1, 1025], like dma_memcpy_nd currently does, this gives the appearance that this configures a data layout transformation and that e.g. [2, 3, 4, 1025] would also be possible. However such a transformation is not in fact representable due to hardware limits. Therefore, especially for a low-level dialect like ours that is supposed to mirror the hardware, an overflow in the lowest level should yield an error i.m.o.

Linear transfers of size 1024+ (i.e. without data layout transformations) are still possible in the DMA task syntax: Just leave off the dimensions and only specify the total transfer length. Or, alternatively, you can split up the dimensions, e.g. sizes=[1,1,1,1024], strides=[1,1,1,1] is not possible but sizes=[1,1,2,512], strides=[1,1,512,1] should work and represents the same (linear) transfer (unnecessarily complicated since it's a linear transfer but you get the gist).

This keeps the mental model clear for users: If you supply dimensions, they will go into the registers that configure data layout transformations. If you supply dimensions that can't be expressed in hardware, you get an error. The current dma_memcpy_nd implementation violates this mental model and does not set the layout transformation registers if it detects the special [1,1,1,...] case.

Happy to hear other thoughts, but that's the reason I did things the way I did.

^{*: I'm trusting you on the 1024 number here as I don't know it off the top of my head. My argument still should apply even if the actual number is different.}

Functionality: It looks like all examples with a repeat count/step in the upper dimension fail; there appears to be a repeat_count parameter in the call to dma_configure_task_for which I believe works; however, this is not sufficient for examples that also have an iteration_step value in the upper dimension of the strides.

Here is an example of repeat count. It worked for me for the matmul, but it is entirely possible that it's buggy. The iteration step would be defined by the highest-dim stride a couple lines down.

The highest dimension of the data layout transformations should dictate the iteration_step that you mentioned. Note however that the repeat count is 0 (zero) if a BD is to be executed exactly once, and 1 (one) if it is to be executed twice (i.e., repeated once), etc. The rationale why we're not using the highest dimension size as the repeat count is BD chaining: You can specify a chain of BDs, and the repeat count will apply to the entire chain, but I believe each BD within it can have a different iteration step (not 100% sure on this). This wasn't a consideration for dma_memcpy_nd because chaining is not a thing there.

andrej

Looks good to me overall. I left some nitpicky comments.

andrej · 2024-11-14T04:31:48Z

programming_examples/basic/dma_transpose/aie2_alt.py

+                            sizes=[1, 1, K, M],
+                            strides=[1, 1, 1, K],
+                        )
+                        EndOp()


Can you confirm if these EndOps are still necessary? I believe I've seen other places where the block terminators could be left off and that would clean this up, but it may not be possible here.

Can confirm though trying this in an example: EndOp is currently necessary.

Error without it is:

error: "-":21:9: block with no terminator, has "aie.dma_bd"(%arg0) <{dimensions = #aie<bd_dim_layout_array[<size = 1, stride = 1>, <size = 1, stride = 1>, <size = 32, stride = 1>, <size = 64, stride = 32>]>, len = 2048 : i32, offset = 0 : i32}> : (memref<64x32xi32>) -> () note: "-":21:9: see current operation: "aie.dma_bd"(%arg0) <{dimensions = #aie<bd_dim_layout_array[<size = 1, stride = 1>, <size = 1, stride = 1>, <size = 32, stride = 1>, <size = 64, stride = 32>]>, len = 2048 : i32, offset = 0 : i32}> : (memref<64x32xi32>) -> ()

However, seen in the linked test, it seems like the EndOp is not necessary if you call next_bd but I haven't fully investigated: https://github.com/Xilinx/mlir-aie/blob/main/test/python/dma_tasks.py

I think with some additional time, I could clean this up. However, the python bindings I'm working on autogenerate the dma_task code, so the user will never see this anyways. At the moment, I don't see the EndOp cleanup as an urgent task at this time (but it should probably be done eventually).

However, seen in the linked test, it seems like the EndOp is not necessary if you call next_bd

Because aie.next_bd is also a terminator.

next_bd is also a terminator so that makes sense. I agree that this isn't important, just was curious.

andrej · 2024-11-14T04:36:49Z

python/dialects/aiex.py

+            "dma_start_task must receive at least one DMAConfigureTaskForOp to free"
+        )
+    for dma_task in args:
+        _orig_dma_start_task(dma_task)


Maybe dma_start_tasks() (plural) would be clearer for this? Also, could we make this function take one argument which is a list? To me multiple arguments that are all the same looks weird; and it blocks us from ever adding other attributes to this operation, since then it will be ambiguous if an argument is a task or an attribute.

There is a tradition of not forcing a user to make a list for a single element elsewhere in the python bindings (e.g., here:

mlir-aie/python/dialects/aie.py

Line 441 in 774e0e6

if not isinstance(fifoIns, List):

). I believe the implementation I chose is not an uncommon way to implement this pattern in python. I've also used this syntax here:

mlir-aie/python/dialects/aiex.py

Line 38 in 774e0e6

def dma_wait(*args: ObjectFifoCreateOp | str):

Since I hadn't made other functions plural (since they work for single elements) I did not make this function plural for consistency.

I don't think adding attributes is a problem as you can easily add keyword arguments to this function still.

I personally think the current implementation is ok, but if you feel strongly I can change the implementation and naming convention of this and other functions to be consistent in another way.

If there's precedent elsewhere that's fine. It's only personal preference so not important.

python/dialects/aiex.py

andrej · 2024-11-14T04:43:07Z

python/dialects/aiex.py

+    if sizes is None:
+        sizes = [0] * 4
+    if strides is None:
+        strides = [0] * 3 + [1]


Is this really necessary over just leaving it at None? As is, this leads to us having two defaults for sizes and strides, once in the C++ code and once here in the Python bindings. If someone changes the C++ this will get forgotten (or vice versa) leading to potential inconsistencies.

If we don't know the sizes, we can't calculate the length for the user. So I believe there should be defaults at the python level, unless we move length calculation lower.

This code was also copy/pasted from the npu_dma_memcpy_nd python wrapper. I think it's important for users to have consistent behavior between interfaces that do similar things, if possible.

If we don't know the sizes, we can't calculate the length for the user. So I believe there should be defaults at the python level, unless we move length calculation lower.

This code was also copy/pasted from the npu_dma_memcpy_nd python wrapper. I think it's important for users to have consistent behavior between interfaces that do similar things, if possible.

I won't pretend I've ingested all the details here but consistent interfaces are important. One way to get that is to put things at the C++ level so that iron, normal mlir-aie, as well as other users (e.g. mlir-air) all see the same behavior.

@andrej @jgmelber are there any technical reasons you can think of why the length cannot be calculated at a lower level if the user does not specify a length?

I believe for dma_memcpy_nd we already have the length calculation at the lower level. So no I don't think there's anything preventing us from doing it at the lower level.

python/dialects/aiex.py

andrej · 2024-11-14T04:53:40Z

python/dialects/aiex.py

+
+    if transfer_len is None:
+        if len(sizes) >= 4:
+            # For shim dma bd, highest dimension is repeat count which is not included in the length


This comment is wrong. The highest dimension is the wrap of the iteration step (=highest level stride). Repeat count is denoted at the task level, when a (chain of) BD(s) is submitted, and applies to all BDs within a task.

Examples: If the highest-dimension size/wrap is 3, the higest-dimension stride (=iteration step) is 2, and the repeat count is 6, the highest dimension offset added will be

Task repetition: 0 1 2 3 4 5 (=repeat count) Highest-dim offset added: 0 2 4 0 2 4 ^ wrap (=3)

If the highest-dimension wrap matches the repeat count of the task, the iteration step will be added each repetition:

Task repetition: 0 1 2 3 4 5 (=repeat count) Highest-dim offset added: 0 2 4 6 8 10 ^ wrap (=6)

In the other extreme, if the wrap is 1, the iteration step never gets added (BD repeated with no offset).

Task repetition: 0 1 2 3 4 5 (=repeat count) Highest-dim offset added: 0 0 0 0 0 0 ^ ^ ^ ^ ^ ^ w. w. w. w. w. wrap(=1)

It might be worth it to explain this concept more clearly somewhere in the documentation.

Yes, thank you for pointing this out! I really dislike that the highest dimension (repeat count + iteration stride) is handled differently dma task infrastructure and npu_dma_memcpy_nd. I was hoping we can get close to having a consistent interface. However, that might not make sense here.

I will wait to change the comment until we maybe have a plan for how to present this idea to the user in a way that feels clear/consistent.

I agree that the interface could use some cleaning up. For dma_memcpy_nd it made some sense to include it with the dimensions because there only ever was one BD in a task, so it behaved consistently. However, even for that, doing it that way restricts users to only be able to use the second example from above, 1 and 3 would not be possible since dma_memcpy_nd forces the iteration wrap to be equal to the repeat count. For chained BDs the simplification breaks down completely i.m.o.

jgmelber · 2024-11-14T05:14:17Z

Linear transfers of size 1024+ (i.e. without data layout transformations) are still possible in the DMA task syntax: Just leave off the dimensions and only specify the total transfer length.

@andrej I just pushed some commits to fix this issue. I believe the verifier wasn't allowing the special case for linear transfers that the hardware supports.

I will go through more of your detailed comments tomorrow. There are still some failing tests to fix too.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

This reverts commit bbcde4c.

github-actions · 2024-11-14T06:11:44Z

Coverage Report

Created: 2024-11-22 20:00

Click here for information about interpreting this report.

Filename	Function Coverage	Line Coverage	Region Coverage	Branch Coverage
IR/AIEXDialect.cpp	100.00%	86.27%	89.14%	78.57%
Transforms/AIEDMATasksToNPU.cpp	86.36%	83.66%	87.50%	80.16%
Totals	94.55%	84.98%	88.51%	79.17%

Generated by llvm-cov -- llvm version 14.0.0

hunhoffe · 2024-11-14T17:00:13Z

@andrej Thanks much for the review and all the information!

andrej · 2024-11-14T17:43:29Z

@andrej Thanks much for the review and all the information!

Thanks for taking them into consideration. I hope I'm not getting on your nerves with the nitpicky comments, I'm just sharing my thoughts since I was asked to review. Probably none of my comments should be blocking, even if I feel strongly about some.

hunhoffe · 2024-11-14T20:16:30Z

@andrej Thanks much for the review and all the information!

Thanks for taking them into consideration. I hope I'm not getting on your nerves with the nitpicky comments, I'm just sharing my thoughts since I was asked to review. Probably none of my comments should be blocking, even if I feel strongly about some.

Of course! I am not annoyed (sorry if I seemed short or something!) and I'm glad for the comments - it's good to have another perspective on the Python code, because I'm not sure if many people have scrutinized some aspects of my previous PRs.

hunhoffe · 2024-11-14T22:28:58Z

Implementation plans after discussion:

Calculate length of bd_op within bd_op constructor or similar (will not include in this PR due to breadth of changes required: aie.dma_bd (DMABDOp) parsing of optional attributes is fragile/buggy #1921)
Remove default values in Python (deferring, issue recorded here: Choose dimensions or sizes + strides for user-facing operations #1922)
Add python wrapper function for chains of length one, because syntax can be a little verbose in this case; this function could safely take 4 sizes and 4 strides and apply the repeat_count to the task creation.

…erations

andrej

Left some comments on the programming guide

programming_guide/section-2/section-2g/README.md

hunhoffe · 2024-11-22T19:05:29Z

@andrej Thank you for the excellent feedback! Your suggestions have been incorporated.

hunhoffe added 7 commits November 13, 2024 12:57

Start to port programming examples to use dma task

b6d7180

remove unneeded field

49fa9e1

Finish adding alternate (dma task) impls of programming_examples/vision

b43a48d

Add alt version for ml programming examples

fb59e20

Add convenience wrappers around dma_*_task functions

bbc991f

Start porting some of the basic examples to use the dma task structure

2bfb325

Merge branch 'main' into port-examples-dma-task

88a033f

hunhoffe changed the title ~~Create alternative programming examples that use dma_task method for runtime sequence operations~~ dma_task in programming examples Nov 13, 2024

Finish rewriting programming examples to use dma task

774e0e6

jgmelber requested a review from andrej November 13, 2024 23:16

jgmelber added 2 commits November 13, 2024 21:12

Fix for [1, 1, 1, N]

f95bf36

Additional verification linear case patch

174cdf0

jgmelber added 2 commits November 13, 2024 21:33

Default sizes to 1

bbcde4c

Use uint32_t for sizes to match transfer length for dim 0

0d86176

andrej reviewed Nov 14, 2024

View reviewed changes

jgmelber and others added 3 commits November 13, 2024 22:14

Apply suggestions from code review

c7f38ec

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Revert "Default sizes to 1"

daa4598

This reverts commit bbcde4c.

Init sizes for vec scalar mul

259b2a6

hunhoffe added 2 commits November 14, 2024 13:25

calculate transfer len with less lines of code

10b386b

Merge branch 'main' into port-examples-dma-task

d3ee4e6

hunhoffe added 2 commits November 14, 2024 15:30

Merge branch 'main' into port-examples-dma-task

efbff0e

Remove lingering npu_dma_memcpy_nd from alt examples

ee0fa1a

hunhoffe added 12 commits November 20, 2024 15:12

Try fixing vector exp build error

7cb6228

matrix vector working locally

416af28

Revert vector exp change

3611676

Another attempt to fix vector exp

6256719

small fix to cascade alt design

384e213

Small fix, cascade working locally

e5b6f10

Start porting examples to use helper function

d29571a

Continue porting examples to use helper function

dc28e5c

Finish porting basic alt examples to use helper function

cdd28ec

Continue fixing up examples

6d38385

Finished cleaning up alt examples

8e96a57

Merge branch 'main' into port-examples-dma-task

a5eda19

hunhoffe marked this pull request as ready for review November 22, 2024 02:35

hunhoffe requested review from stephenneuendorffer, jgmelber, jackl-xilinx, AndraBisca and denolf as code owners November 22, 2024 02:35

hunhoffe requested review from fifield and andrej November 22, 2024 02:35

Add some documentation to the programming guide regarding DMA task op…

4e11994

…erations

jgmelber approved these changes Nov 22, 2024

View reviewed changes

andrej reviewed Nov 22, 2024

View reviewed changes

programming_guide/section-2/section-2g/README.md Outdated Show resolved Hide resolved

programming_guide/section-2/section-2g/README.md Outdated Show resolved Hide resolved

programming_guide/section-2/section-2g/README.md Outdated Show resolved Hide resolved

Commit improvements to dma_task section of programming guide

15649f9

Minor formatting fixes in section-2g

4727379

hunhoffe added this pull request to the merge queue Nov 22, 2024

Merged via the queue into main with commit 7419211 Nov 22, 2024
52 checks passed

hunhoffe deleted the port-examples-dma-task branch November 22, 2024 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dma_task` in programming examples #1919

`dma_task` in programming examples #1919

hunhoffe commented Nov 13, 2024 •

edited

Loading

hunhoffe commented Nov 13, 2024 •

edited

Loading

andrej commented Nov 14, 2024

andrej left a comment

andrej Nov 14, 2024

hunhoffe Nov 14, 2024 •

edited

Loading

fifield Nov 14, 2024

andrej Nov 14, 2024

andrej Nov 14, 2024

hunhoffe Nov 14, 2024

andrej Nov 14, 2024

andrej Nov 14, 2024

hunhoffe Nov 14, 2024 •

edited

Loading

fifield Nov 14, 2024

hunhoffe Nov 14, 2024

andrej Nov 14, 2024

andrej Nov 14, 2024

hunhoffe Nov 14, 2024

andrej Nov 14, 2024 •

edited

Loading

jgmelber commented Nov 14, 2024

github-actions bot commented Nov 14, 2024 •

edited

Loading

hunhoffe commented Nov 14, 2024

andrej commented Nov 14, 2024

hunhoffe commented Nov 14, 2024

hunhoffe commented Nov 14, 2024 •

edited

Loading

andrej left a comment

hunhoffe commented Nov 22, 2024

dma_task in programming examples #1919

dma_task in programming examples #1919

Conversation

hunhoffe commented Nov 13, 2024 • edited Loading

hunhoffe commented Nov 13, 2024 • edited Loading

andrej commented Nov 14, 2024

andrej left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hunhoffe Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hunhoffe Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrej Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

jgmelber commented Nov 14, 2024

github-actions bot commented Nov 14, 2024 • edited Loading

Coverage Report

Created: 2024-11-22 20:00

Generated by llvm-cov -- llvm version 14.0.0

hunhoffe commented Nov 14, 2024

andrej commented Nov 14, 2024

hunhoffe commented Nov 14, 2024

hunhoffe commented Nov 14, 2024 • edited Loading

andrej left a comment

Choose a reason for hiding this comment

hunhoffe commented Nov 22, 2024

`dma_task` in programming examples #1919

`dma_task` in programming examples #1919

hunhoffe commented Nov 13, 2024 •

edited

Loading

hunhoffe commented Nov 13, 2024 •

edited

Loading

hunhoffe Nov 14, 2024 •

edited

Loading

hunhoffe Nov 14, 2024 •

edited

Loading

andrej Nov 14, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024 •

edited

Loading

hunhoffe commented Nov 14, 2024 •

edited

Loading