In the preceding sections, we looked at how we can describe data movement between tiles within the AIE-array. However, to do anything useful, we need to get data from outside the array, i.e., from the "host", into the AIE-array and back. On NPU devices, we can achieve this with the operations described in this section.
The operations that will be described in this section must be placed in a separate aie.runtime_sequence
operation. The arguments to this function describe buffers that will be available on the host side; the body of the function describes how those buffers are moved into the AIE-array. Section 3 contains an example.
In high-performance computing applications, efficiently managing data movement and synchronization is crucial. This guide provides a comprehensive overview of how to utilize the npu_dma_memcpy_nd
and dma_wait
functions to manage data movement at runtime from/to host memory to/from the AIE array (for example, in the Ryzen™ AI NPU).
The npu_dma_memcpy_nd
function is key for enabling non-blocking, multi-dimensional data transfers between different memory regions between the AI Engine array and external memory. This function is essential in developing real applications like signal processing, machine learning, and video processing.
Function Signature and Parameters:
npu_dma_memcpy_nd(metadata, bd_id, mem, offsets=None, sizes=None, strides=None)
metadata
: This is a reference to the object FIFO or the string name of an object FIFO that records a Shim Tile and one of its DMA channels allocated for the host-side memory transfer. In order to associate the memcpy operation with an object FIFO, this metadata string needs to match the object FIFO name string.bd_id
: Identifier integer for the particular Buffer Descriptor control registers used for this memcpy. A buffer descriptor contains all information needed for a DMA transfer described in the parameters below.mem
: Reference to a host buffer, given as an argument to the sequence function, that this transfer will read from or write to.tap
(optional): ATensorAccessPattern
is an alternative method of specifyingoffset
/sizes
/strides
for determining an access pattern over themem
buffer.offsets
(optional): Start points for data transfer in each dimension. There is a maximum of four offset dimensions.sizes
: The extent of data to be transferred across each dimension. There is a maximum of four size dimensions.strides
(optional): Interval steps between data points in each dimension, useful for striding-across and reshaping data.
The strides and sizes express data transformations analogously to those described in Section 2C.
Example Usage:
npu_dma_memcpy_nd(of_in, 0, input_buffer, sizes=[1, 1, 1, 30])
The example above describes a linear transfer of 30 data elements, or 120 Bytes, from the input_buffer
in host memory into an object FIFO with matching metadata labeled "of_in". The size
dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to 1
.
For high-performance computing applications on AMD's AI Engine, mastering the npu_dma_memcpy_nd
function for complex data movements is crucial. Here, we focus on using the sizes
, strides
, and offsets
parameters to effectively manage intricate data transfers.
A common tasks such as tiling a 2D matrix can be implemented using the npu_dma_memcpy_nd
operation. Here’s a simplified example that demonstrates the description.
Scenario: Tiling a 2D matrix from shape [100, 200] to [20, 20] and the data type int16
. With the convention [row, col].
1. Configuration to transfer one tile:
metadata = of_in
bd_id = 3
mem = matrix_memory # Memory object for the matrix
# Sizes define the extent of the tile to copy
sizes = [1, 1, 20, 10]
# Strides set to '0' in the higher (unused) dimensions and to '100' (length of a row in 4B or "i32s") in the minor dimension
strides = [0, 0, 0, 100]
# Offsets set to zero since we start from the beginning
offsets = [0, 0, 0, 0]
npu_dma_memcpy_nd(metadata, bd_id, mem, offsets, sizes, strides)
2. Configuration to tile the whole matrix:
metadata = of_in
bd_id = 3
mem = matrix_memory # Memory object for the matrix
# Sizes define the extent of the tile to copy.
# Dimension 0 is 10 to transfer 20 int16s for one row of the tile,
# Dimension 1 repeats that row transfer 20 times to complete a [20, 20] tile,
# Dimension 2 repeats that tile transfer 10 times along a row,
# Dimension 3 repeats the row of tiles transfer 5 times to complete.
sizes = [5, 10, 20, 10]
# Strides set to '0' in the highest (unused) dimension,
# '2000' for the next row of tile below the last (200 x 20 x 2B / 4B),
# '10' for the next tile to the 'right' of the last [20, 20] tile,
# and '100' (length of a row in 4B or "i32s") in dimension 0.
strides = [0, 2000, 10, 100]
# Offsets set to zero since we start from the beginning
offsets = [0, 0, 0, 0]
npu_dma_memcpy_nd(metadata, bd_id, mem, offsets, sizes, strides)
Synchronization between DMA channels and the host is facilitated by the dma_wait
operation, ensuring data consistency and proper execution order. The dma_wait
operation waits until the BD associated with the ObjectFifo is complete, issuing a task complete token.
Function Signature:
dma_wait(metadata)
- **
metadata
: The ObjectFifo python object or the name of the object fifo associated with the DMA option we will wait on.
Example Usage:
Waiting on DMAs associated with one object fifo:
# Waits for the output data to transfer from the output object fifo to the host
dma_wait(of_out)
Waiting on DMAs associated with more than one object fifo:
dma_wait(of_in, of_out)
- Sync to Reuse Buffer Descriptors: Each
npu_dma_memcpy_nd
is assigned abd_id
. There are a maximum of16
BDs available to use in each Shim Tile. It is "safe" to reuse BDs once all transfers are complete, this can be managed by properly synchronizing taking into account the BDs that must have completed to transfer data into the array to complete a compute operation. And then sync on the BD that receives the data produced by the compute operation to write it back to host memory. - Note Non-blocking Transfers: Overlap data transfers with computation by leveraging the non-blocking nature of
npu_dma_memcpy_nd
. - Minimize Synchronization Overhead: Synchronize/wait judiciously to avoid excessive overhead that might degrade performance.
As an alternative to npu_dma_memcpy_nd
and dma_wait
, there is a series of operations around DMA tasks that can serve a similar purpose.
There are two advantages of using the DMA task operations over using npu_dma_memcpy_nd
:
- The user does not have to specify a BD number
- DMA task operations are capable of chaining BD operations; however, this is an advance use-case beyond the scope of this guide.
All programming examples have an *_alt.py
version that is written using DMA task operations.
Function Signature and Parameters:
def shim_dma_single_bd_task(
alloc,
mem,
tap: TensorAccessPatter | None = None,
offset: int | None = None,
sizes: MixedValues | None = None,
strides: MixedValues | None = None,
transfer_len: int | None = None,
issue_token: bool = False,
)
alloc
: Thealloc
argument associates the DMA task with an ObjectFIFO. This argument is calledalloc
becuase the shim-side end of a data transfer (specifically a channel on a shim tile) is referenced through a so-called "shim DMA allocation". When an ObjectFIFO is created with a Shim Tile endpoint, an allocation with the same name as the ObjectFIFO is automatically generated.mem
: Reference to a host buffer, given as an argument to the sequence function, that this transfer will read from or write to.tap
(optional): ATensorAccessPattern
is an alternative method of specifyingoffset
/sizes
/strides
for determining an access pattern over themem
buffer.offset
(optional): Starting point for the data transfer. Default values is0
.sizes
: The extent of data to be transferred across each dimension. There is a maximum of four size dimensions.strides
(optional): Interval steps between data points in each dimension, useful for striding-across and reshaping data.issue_token
(optional): If a token is issued, one may calldma_await_task
on the returned task. Default isFalse
.
The strides and strides express data transformations analogously to those described in Section 2C.
Example Usage:
out_task = shim_dma_single_bd_task(of_out, C, sizes=[1, 1, 1, N], issue_token=True)
The example above describes a linear transfer of N
data elements from the C
buffer in host memory into an object FIFO with matching metadata labeled "of_out". The sizes
dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to 1
.
Synchronization between DMA channels and the host is facilitated by the dma_await_task
operations, ensuring data consistency and proper execution order. The dma_await_task
operation waits until all the BDs associated with a task have completed.
Function Signature:
def dma_await_task(*args: DMAConfigureTaskForOp)
args
: One or moredma_task
objects, wheredma_task
objects are the value returned byshim_dma_single_bd_task
.
Example Usage:
Waiting on task completion of one DMA task:
# Waits for the output task to complete
dma_await_task(out_task)
Waiting on task completion of more than one DMA task:
# Waits for the input task and then the output task to complete
dma_await_task(in_task, out_task)
dma_await_task
can only be called on a task created with issue_token=True
. If issue_token=False
(which is default), then dma_free_task
should be called when the programmer knows that task if complete. dma_free_task
allows the compiler to reuse the BDs of a task without synchronization. Using dma_free_task(X)
before task X
has completed will lead to a race condition and unpredictable behavior. Only use dma_free_task(X)
in conjunction with some other means of synchronization. For example, you may issue dma_free_task(X)
after a call to dma_await_task(Y)
if you can reason that task Y
can only complete after task X
has completed.
Function Signature:
def dma_free_task(*args: DMAConfigureTaskForOp)
args
: One or moredma_task
objects, wheredma_task
objects are the value returned byshim_dma_single_bd_task
.
Example Usage:
Release BDs belonging to DMAs associated with one task:
# Allow compiler to reuse BDs of a a task. Should only be called if the programmer is sure the task is completed.
dma_free_task(out_task)
Release BDs belonging to DMAs associated with more than one task:
# Allow compiler to reuse BDs of more than one task. Should only be called if the programmer is sure all tasks are completed.
dma_free_task(in_task, out_task)
- Await or Free to Reuse Buffer Descriptors: While the exact buffer descriptor (BD) used for each operation is not visible to the user with the
dma_task
operations, there are still a finite number (maximum of16
on a Shim Tile). Thus, it is important to usedma_await_task
ordma_free_task
before the number of BDs are exhausted so that they may be reused. - Note Non-blocking Transfers: Overlap data transfers with computation by leveraging the non-blocking nature of
dma_start_task
. - Minimize Synchronization Overhead: Synchronize/wait judiciously to avoid excessive overhead that might degrade performance.
The npu_dma_memcpy_nd
and dma_wait
functions are powerful tools for managing data transfers and synchronization with AI Engines in the Ryzen™ AI NPU. By understanding and effectively implementing applications leveraging these functions, developers can enhance the performance, efficiency, and accuracy of their high-performance computing applications.