Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dma_task in programming examples #1919

Merged
merged 48 commits into from
Nov 22, 2024
Merged

dma_task in programming examples #1919

merged 48 commits into from
Nov 22, 2024

Conversation

hunhoffe
Copy link
Contributor

@hunhoffe hunhoffe commented Nov 13, 2024

This PR is an incremental step towards implementation of my IRON augmentations (found here: #1732).

The IRON extensions/augmentations generate runtime sequences using dma tasks. However, I noticed that the dma task infrastructure does not seem to work out of the box for all examples. Thus, enabling the programming examples to use dma tasks is a necessary preliminary step before I can merge my other work.

While having alternative versions of the program examples will slow down the CI, I believe this is a temporary step until we are ready to deprecate examples using the current IRON syntax. I plan to replace the alt versions with jupyter notebooks in a subsequent PR.

@hunhoffe hunhoffe changed the title Create alternative programming examples that use dma_task method for runtime sequence operations dma_task in programming examples Nov 13, 2024
@hunhoffe
Copy link
Contributor Author

hunhoffe commented Nov 13, 2024

I am seeing some failures with the programming examples when using the dma task interface. There are several separate errors I suspect so far:

  • Verification: Does not accept single element sizes, e.g., [1, 1, 1, 1], which is used in some examples of transfer a scalar value.
  • Special Case: Does not accept sizes of form [1, 1, 1, N] where N >= 1024; this is a special case for npu_dma_memcpy_nd so hopefully not too hard to port
  • Functionality: It looks like all examples with a repeat count/step in the upper dimension fail; there appears to be a repeat_count parameter in the call to dma_configure_task_for which I believe works; however, this is not sufficient for examples that also have an iteration_step value in the upper dimension of the strides.

@jgmelber jgmelber requested a review from andrej November 13, 2024 23:16
@andrej
Copy link
Collaborator

andrej commented Nov 14, 2024

Nice, I'm happy to see this feature getting some attention. Apologies in advance if you run into some roadblocks ... I believe this is the first time somebody besides me is seriously using this syntax., so it is unfortunately likely that some troubleshooting may be necessary.

Verification: Does not accept single element sizes, e.g., [1, 1, 1, 1], which is used in some examples of transfer a scalar value.

What are the verification errors you are getting for this? This should pass verification for i32. If it fails for smaller data types, that is correct, since the minimal transfer size is 4 bytes.

Special Case: Does not accept sizes of form [1, 1, 1, N] where N >= 1024; this is a special case for npu_dma_memcpy_nd so hopefully not too hard to port

This may warrant some discussion. Making this fail was a deliberate choice on my side. I am against handling this as a special case.

Here's my reasoning: If we were to support [1, 1, 1, 1025], like dma_memcpy_nd currently does, this gives the appearance that this configures a data layout transformation and that e.g. [2, 3, 4, 1025] would also be possible. However such a transformation is not in fact representable due to hardware limits. Therefore, especially for a low-level dialect like ours that is supposed to mirror the hardware, an overflow in the lowest level should yield an error i.m.o.

Linear transfers of size 1024+ (i.e. without data layout transformations) are still possible in the DMA task syntax: Just leave off the dimensions and only specify the total transfer length. Or, alternatively, you can split up the dimensions, e.g. sizes=[1,1,1,1024], strides=[1,1,1,1] is not possible but sizes=[1,1,2,512], strides=[1,1,512,1] should work and represents the same (linear) transfer (unnecessarily complicated since it's a linear transfer but you get the gist).

This keeps the mental model clear for users: If you supply dimensions, they will go into the registers that configure data layout transformations. If you supply dimensions that can't be expressed in hardware, you get an error. The current dma_memcpy_nd implementation violates this mental model and does not set the layout transformation registers if it detects the special [1,1,1,...] case.

Happy to hear other thoughts, but that's the reason I did things the way I did.

*: I'm trusting you on the 1024 number here as I don't know it off the top of my head. My argument still should apply even if the actual number is different.

Functionality: It looks like all examples with a repeat count/step in the upper dimension fail; there appears to be a repeat_count parameter in the call to dma_configure_task_for which I believe works; however, this is not sufficient for examples that also have an iteration_step value in the upper dimension of the strides.

Here is an example of repeat count. It worked for me for the matmul, but it is entirely possible that it's buggy. The iteration step would be defined by the highest-dim stride a couple lines down.

The highest dimension of the data layout transformations should dictate the iteration_step that you mentioned. Note however that the repeat count is 0 (zero) if a BD is to be executed exactly once, and 1 (one) if it is to be executed twice (i.e., repeated once), etc. The rationale why we're not using the highest dimension size as the repeat count is BD chaining: You can specify a chain of BDs, and the repeat count will apply to the entire chain, but I believe each BD within it can have a different iteration step (not 100% sure on this). This wasn't a consideration for dma_memcpy_nd because chaining is not a thing there.

Copy link
Collaborator

@andrej andrej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. I left some nitpicky comments.

sizes=[1, 1, K, M],
strides=[1, 1, 1, K],
)
EndOp()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm if these EndOps are still necessary? I believe I've seen other places where the block terminators could be left off and that would clean this up, but it may not be possible here.

Copy link
Contributor Author

@hunhoffe hunhoffe Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm though trying this in an example: EndOp is currently necessary.

Error without it is:

error: "-":21:9: block with no terminator, has "aie.dma_bd"(%arg0) <{dimensions = #aie<bd_dim_layout_array[<size = 1, stride = 1>, <size = 1, stride = 1>, <size = 32, stride = 1>, <size = 64, stride = 32>]>, len = 2048 : i32, offset = 0 : i32}> : (memref<64x32xi32>) -> ()
 note: "-":21:9: see current operation: "aie.dma_bd"(%arg0) <{dimensions = #aie<bd_dim_layout_array[<size = 1, stride = 1>, <size = 1, stride = 1>, <size = 32, stride = 1>, <size = 64, stride = 32>]>, len = 2048 : i32, offset = 0 : i32}> : (memref<64x32xi32>) -> ()

However, seen in the linked test, it seems like the EndOp is not necessary if you call next_bd but I haven't fully investigated: https://github.com/Xilinx/mlir-aie/blob/main/test/python/dma_tasks.py

I think with some additional time, I could clean this up. However, the python bindings I'm working on autogenerate the dma_task code, so the user will never see this anyways. At the moment, I don't see the EndOp cleanup as an urgent task at this time (but it should probably be done eventually).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, seen in the linked test, it seems like the EndOp is not necessary if you call next_bd

Because aie.next_bd is also a terminator.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next_bd is also a terminator so that makes sense. I agree that this isn't important, just was curious.

"dma_start_task must receive at least one DMAConfigureTaskForOp to free"
)
for dma_task in args:
_orig_dma_start_task(dma_task)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe dma_start_tasks() (plural) would be clearer for this? Also, could we make this function take one argument which is a list? To me multiple arguments that are all the same looks weird; and it blocks us from ever adding other attributes to this operation, since then it will be ambiguous if an argument is a task or an attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a tradition of not forcing a user to make a list for a single element elsewhere in the python bindings (e.g., here:

if not isinstance(fifoIns, List):
). I believe the implementation I chose is not an uncommon way to implement this pattern in python. I've also used this syntax here:
def dma_wait(*args: ObjectFifoCreateOp | str):

Since I hadn't made other functions plural (since they work for single elements) I did not make this function plural for consistency.

I don't think adding attributes is a problem as you can easily add keyword arguments to this function still.

I personally think the current implementation is ok, but if you feel strongly I can change the implementation and naming convention of this and other functions to be consistent in another way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's precedent elsewhere that's fine. It's only personal preference so not important.

python/dialects/aiex.py Show resolved Hide resolved
if sizes is None:
sizes = [0] * 4
if strides is None:
strides = [0] * 3 + [1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really necessary over just leaving it at None? As is, this leads to us having two defaults for sizes and strides, once in the C++ code and once here in the Python bindings. If someone changes the C++ this will get forgotten (or vice versa) leading to potential inconsistencies.

Copy link
Contributor Author

@hunhoffe hunhoffe Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't know the sizes, we can't calculate the length for the user. So I believe there should be defaults at the python level, unless we move length calculation lower.

This code was also copy/pasted from the npu_dma_memcpy_nd python wrapper. I think it's important for users to have consistent behavior between interfaces that do similar things, if possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't know the sizes, we can't calculate the length for the user. So I believe there should be defaults at the python level, unless we move length calculation lower.

This code was also copy/pasted from the npu_dma_memcpy_nd python wrapper. I think it's important for users to have consistent behavior between interfaces that do similar things, if possible.

I won't pretend I've ingested all the details here but consistent interfaces are important. One way to get that is to put things at the C++ level so that iron, normal mlir-aie, as well as other users (e.g. mlir-air) all see the same behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrej @jgmelber are there any technical reasons you can think of why the length cannot be calculated at a lower level if the user does not specify a length?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe for dma_memcpy_nd we already have the length calculation at the lower level. So no I don't think there's anything preventing us from doing it at the lower level.

python/dialects/aiex.py Outdated Show resolved Hide resolved

if transfer_len is None:
if len(sizes) >= 4:
# For shim dma bd, highest dimension is repeat count which is not included in the length
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is wrong. The highest dimension is the wrap of the iteration step (=highest level stride). Repeat count is denoted at the task level, when a (chain of) BD(s) is submitted, and applies to all BDs within a task.

Examples: If the highest-dimension size/wrap is 3, the higest-dimension stride (=iteration step) is 2, and the repeat count is 6, the highest dimension offset added will be

Task repetition:             0   1   2   3   4   5  (=repeat count)
Highest-dim offset added:    0   2   4   0   2   4
                                         ^
                                      wrap (=3)

If the highest-dimension wrap matches the repeat count of the task, the iteration step will be added each repetition:

Task repetition:             0   1   2   3   4   5  (=repeat count)
Highest-dim offset added:    0   2   4   6   8   10 
                                                     ^
                                                 wrap (=6)

In the other extreme, if the wrap is 1, the iteration step never gets added (BD repeated with no offset).

Task repetition:             0   1   2   3   4   5  (=repeat count)
Highest-dim offset added:    0   0   0   0   0   0
                             ^   ^   ^   ^   ^   ^
                             w.  w.  w.  w.  w.  wrap(=1)

It might be worth it to explain this concept more clearly somewhere in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you for pointing this out! I really dislike that the highest dimension (repeat count + iteration stride) is handled differently dma task infrastructure and npu_dma_memcpy_nd. I was hoping we can get close to having a consistent interface. However, that might not make sense here.

I will wait to change the comment until we maybe have a plan for how to present this idea to the user in a way that feels clear/consistent.

Copy link
Collaborator

@andrej andrej Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the interface could use some cleaning up. For dma_memcpy_nd it made some sense to include it with the dimensions because there only ever was one BD in a task, so it behaved consistently. However, even for that, doing it that way restricts users to only be able to use the second example from above, 1 and 3 would not be possible since dma_memcpy_nd forces the iteration wrap to be equal to the repeat count. For chained BDs the simplification breaks down completely i.m.o.

@jgmelber
Copy link
Collaborator

Linear transfers of size 1024+ (i.e. without data layout transformations) are still possible in the DMA task syntax: Just leave off the dimensions and only specify the total transfer length.

@andrej I just pushed some commits to fix this issue. I believe the verifier wasn't allowing the special case for linear transfers that the hardware supports.

I will go through more of your detailed comments tomorrow. There are still some failing tests to fix too.

jgmelber and others added 3 commits November 13, 2024 22:14
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This reverts commit bbcde4c.
Copy link
Contributor

github-actions bot commented Nov 14, 2024

Coverage Report

Created: 2024-11-22 20:00

Click here for information about interpreting this report.

FilenameFunction CoverageLine CoverageRegion CoverageBranch Coverage
IR/AIEXDialect.cpp 100.00% 86.27% 89.14% 78.57%
Transforms/AIEDMATasksToNPU.cpp 86.36% 83.66% 87.50% 80.16%
Totals 94.55% 84.98% 88.51% 79.17%
Generated by llvm-cov -- llvm version 14.0.0

@hunhoffe
Copy link
Contributor Author

@andrej Thanks much for the review and all the information!

@andrej
Copy link
Collaborator

andrej commented Nov 14, 2024

@andrej Thanks much for the review and all the information!

Thanks for taking them into consideration. I hope I'm not getting on your nerves with the nitpicky comments, I'm just sharing my thoughts since I was asked to review. Probably none of my comments should be blocking, even if I feel strongly about some.

@hunhoffe
Copy link
Contributor Author

@andrej Thanks much for the review and all the information!

Thanks for taking them into consideration. I hope I'm not getting on your nerves with the nitpicky comments, I'm just sharing my thoughts since I was asked to review. Probably none of my comments should be blocking, even if I feel strongly about some.

Of course! I am not annoyed (sorry if I seemed short or something!) and I'm glad for the comments - it's good to have another perspective on the Python code, because I'm not sure if many people have scrutinized some aspects of my previous PRs.

@hunhoffe
Copy link
Contributor Author

hunhoffe commented Nov 14, 2024

Implementation plans after discussion:

@hunhoffe hunhoffe marked this pull request as ready for review November 22, 2024 02:35
@hunhoffe hunhoffe requested review from fifield and andrej November 22, 2024 02:35
Copy link
Collaborator

@andrej andrej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on the programming guide

programming_guide/section-2/section-2g/README.md Outdated Show resolved Hide resolved
programming_guide/section-2/section-2g/README.md Outdated Show resolved Hide resolved
programming_guide/section-2/section-2g/README.md Outdated Show resolved Hide resolved
@hunhoffe
Copy link
Contributor Author

@andrej Thank you for the excellent feedback! Your suggestions have been incorporated.

@hunhoffe hunhoffe added this pull request to the merge queue Nov 22, 2024
Merged via the queue into main with commit 7419211 Nov 22, 2024
52 checks passed
@hunhoffe hunhoffe deleted the port-examples-dma-task branch November 22, 2024 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants