-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dma_task
in programming examples
#1919
Conversation
dma_task
method for runtime sequence operationsdma_task
in programming examples
I am seeing some failures with the programming examples when using the dma task interface. There are several separate errors I suspect so far:
|
Nice, I'm happy to see this feature getting some attention. Apologies in advance if you run into some roadblocks ... I believe this is the first time somebody besides me is seriously using this syntax., so it is unfortunately likely that some troubleshooting may be necessary.
What are the verification errors you are getting for this? This should pass verification for
This may warrant some discussion. Making this fail was a deliberate choice on my side. I am against handling this as a special case. Here's my reasoning: If we were to support Linear transfers of size 1024+ (i.e. without data layout transformations) are still possible in the DMA task syntax: Just leave off the dimensions and only specify the total transfer length. Or, alternatively, you can split up the dimensions, e.g. This keeps the mental model clear for users: If you supply dimensions, they will go into the registers that configure data layout transformations. If you supply dimensions that can't be expressed in hardware, you get an error. The current Happy to hear other thoughts, but that's the reason I did things the way I did. *: I'm trusting you on the 1024 number here as I don't know it off the top of my head. My argument still should apply even if the actual number is different.
Here is an example of repeat count. It worked for me for the matmul, but it is entirely possible that it's buggy. The iteration step would be defined by the highest-dim stride a couple lines down. The highest dimension of the data layout transformations should dictate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall. I left some nitpicky comments.
sizes=[1, 1, K, M], | ||
strides=[1, 1, 1, K], | ||
) | ||
EndOp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm if these EndOp
s are still necessary? I believe I've seen other places where the block terminators could be left off and that would clean this up, but it may not be possible here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can confirm though trying this in an example: EndOp
is currently necessary.
Error without it is:
error: "-":21:9: block with no terminator, has "aie.dma_bd"(%arg0) <{dimensions = #aie<bd_dim_layout_array[<size = 1, stride = 1>, <size = 1, stride = 1>, <size = 32, stride = 1>, <size = 64, stride = 32>]>, len = 2048 : i32, offset = 0 : i32}> : (memref<64x32xi32>) -> ()
note: "-":21:9: see current operation: "aie.dma_bd"(%arg0) <{dimensions = #aie<bd_dim_layout_array[<size = 1, stride = 1>, <size = 1, stride = 1>, <size = 32, stride = 1>, <size = 64, stride = 32>]>, len = 2048 : i32, offset = 0 : i32}> : (memref<64x32xi32>) -> ()
However, seen in the linked test, it seems like the EndOp
is not necessary if you call next_bd
but I haven't fully investigated: https://github.com/Xilinx/mlir-aie/blob/main/test/python/dma_tasks.py
I think with some additional time, I could clean this up. However, the python bindings I'm working on autogenerate the dma_task code, so the user will never see this anyways. At the moment, I don't see the EndOp
cleanup as an urgent task at this time (but it should probably be done eventually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, seen in the linked test, it seems like the EndOp is not necessary if you call next_bd
Because aie.next_bd is also a terminator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next_bd is also a terminator so that makes sense. I agree that this isn't important, just was curious.
"dma_start_task must receive at least one DMAConfigureTaskForOp to free" | ||
) | ||
for dma_task in args: | ||
_orig_dma_start_task(dma_task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe dma_start_tasks()
(plural) would be clearer for this? Also, could we make this function take one argument which is a list? To me multiple arguments that are all the same looks weird; and it blocks us from ever adding other attributes to this operation, since then it will be ambiguous if an argument is a task or an attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a tradition of not forcing a user to make a list for a single element elsewhere in the python bindings (e.g., here:
mlir-aie/python/dialects/aie.py
Line 441 in 774e0e6
if not isinstance(fifoIns, List): |
mlir-aie/python/dialects/aiex.py
Line 38 in 774e0e6
def dma_wait(*args: ObjectFifoCreateOp | str): |
Since I hadn't made other functions plural (since they work for single elements) I did not make this function plural for consistency.
I don't think adding attributes is a problem as you can easily add keyword arguments to this function still.
I personally think the current implementation is ok, but if you feel strongly I can change the implementation and naming convention of this and other functions to be consistent in another way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's precedent elsewhere that's fine. It's only personal preference so not important.
if sizes is None: | ||
sizes = [0] * 4 | ||
if strides is None: | ||
strides = [0] * 3 + [1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really necessary over just leaving it at None
? As is, this leads to us having two defaults for sizes and strides, once in the C++ code and once here in the Python bindings. If someone changes the C++ this will get forgotten (or vice versa) leading to potential inconsistencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't know the sizes, we can't calculate the length for the user. So I believe there should be defaults at the python level, unless we move length calculation lower.
This code was also copy/pasted from the npu_dma_memcpy_nd
python wrapper. I think it's important for users to have consistent behavior between interfaces that do similar things, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't know the sizes, we can't calculate the length for the user. So I believe there should be defaults at the python level, unless we move length calculation lower.
This code was also copy/pasted from the
npu_dma_memcpy_nd
python wrapper. I think it's important for users to have consistent behavior between interfaces that do similar things, if possible.
I won't pretend I've ingested all the details here but consistent interfaces are important. One way to get that is to put things at the C++ level so that iron, normal mlir-aie, as well as other users (e.g. mlir-air) all see the same behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe for dma_memcpy_nd we already have the length calculation at the lower level. So no I don't think there's anything preventing us from doing it at the lower level.
python/dialects/aiex.py
Outdated
|
||
if transfer_len is None: | ||
if len(sizes) >= 4: | ||
# For shim dma bd, highest dimension is repeat count which is not included in the length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is wrong. The highest dimension is the wrap of the iteration step (=highest level stride). Repeat count is denoted at the task level, when a (chain of) BD(s) is submitted, and applies to all BDs within a task.
Examples: If the highest-dimension size/wrap is 3, the higest-dimension stride (=iteration step) is 2, and the repeat count is 6, the highest dimension offset added will be
Task repetition: 0 1 2 3 4 5 (=repeat count)
Highest-dim offset added: 0 2 4 0 2 4
^
wrap (=3)
If the highest-dimension wrap matches the repeat count of the task, the iteration step will be added each repetition:
Task repetition: 0 1 2 3 4 5 (=repeat count)
Highest-dim offset added: 0 2 4 6 8 10
^
wrap (=6)
In the other extreme, if the wrap is 1, the iteration step never gets added (BD repeated with no offset).
Task repetition: 0 1 2 3 4 5 (=repeat count)
Highest-dim offset added: 0 0 0 0 0 0
^ ^ ^ ^ ^ ^
w. w. w. w. w. wrap(=1)
It might be worth it to explain this concept more clearly somewhere in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you for pointing this out! I really dislike that the highest dimension (repeat count + iteration stride) is handled differently dma task infrastructure and npu_dma_memcpy_nd. I was hoping we can get close to having a consistent interface. However, that might not make sense here.
I will wait to change the comment until we maybe have a plan for how to present this idea to the user in a way that feels clear/consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the interface could use some cleaning up. For dma_memcpy_nd it made some sense to include it with the dimensions because there only ever was one BD in a task, so it behaved consistently. However, even for that, doing it that way restricts users to only be able to use the second example from above, 1 and 3 would not be possible since dma_memcpy_nd
forces the iteration wrap to be equal to the repeat count. For chained BDs the simplification breaks down completely i.m.o.
@andrej I just pushed some commits to fix this issue. I believe the verifier wasn't allowing the special case for linear transfers that the hardware supports. I will go through more of your detailed comments tomorrow. There are still some failing tests to fix too. |
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This reverts commit bbcde4c.
Coverage ReportCreated: 2024-11-22 20:00Click here for information about interpreting this report.
Generated by llvm-cov -- llvm version 14.0.0 |
@andrej Thanks much for the review and all the information! |
Thanks for taking them into consideration. I hope I'm not getting on your nerves with the nitpicky comments, I'm just sharing my thoughts since I was asked to review. Probably none of my comments should be blocking, even if I feel strongly about some. |
Of course! I am not annoyed (sorry if I seemed short or something!) and I'm glad for the comments - it's good to have another perspective on the Python code, because I'm not sure if many people have scrutinized some aspects of my previous PRs. |
Implementation plans after discussion:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments on the programming guide
@andrej Thank you for the excellent feedback! Your suggestions have been incorporated. |
This PR is an incremental step towards implementation of my IRON augmentations (found here: #1732).
The IRON extensions/augmentations generate runtime sequences using dma tasks. However, I noticed that the dma task infrastructure does not seem to work out of the box for all examples. Thus, enabling the programming examples to use dma tasks is a necessary preliminary step before I can merge my other work.
While having alternative versions of the program examples will slow down the CI, I believe this is a temporary step until we are ready to deprecate examples using the current IRON syntax. I plan to replace the alt versions with jupyter notebooks in a subsequent PR.