Skip to content

Commit

Permalink
TilerHelper and 2D Tiling Visualizations (#1870)
Browse files Browse the repository at this point in the history
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
  • Loading branch information
hunhoffe and github-actions[bot] authored Nov 18, 2024
1 parent cc3f772 commit 3b5799b
Show file tree
Hide file tree
Showing 61 changed files with 7,855 additions and 94 deletions.
4 changes: 4 additions & 0 deletions programming_examples/basic/dma_transpose/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -44,5 +44,9 @@ endif
run: ${targetname}.exe build/final.xclbin
${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --M ${M} --K ${K}

generate_access_map: ${srcdir}/aie2.py
mkdir -p ${@D}
python3 $< --generate-access-map ${M} ${K}

clean:
rm -rf build _build inst ${targetname}.exe
15 changes: 14 additions & 1 deletion programming_examples/basic/dma_transpose/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,24 @@ This reference design can be run on a Ryzen™ AI NPU.
In the [design](./aie2.py), a 2-D array in a row-major layout is read from external memory to `ComputeTile2` with a transposed layout,
by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through the Shim tile (`col`, 0).

This data movement transformation can be visualized as a map which shows the order the data the data is streamed (e.g., in transposed layout):
<p align="center">
<img
src="transpose_data.png">
<h3 align="center"> Visualization of the Transpose Data Transformation for M=64, K=32.
</h3>
</p>

The implicit copy is performed using the `object_fifo_link` operation that specifies how input data arriving via `of_in` should be sent further via `of_out` by specifically leveraging the compute tile's DMA. This operation and its functionality are described in more depth in [Section-2b](../../../programming_guide/section-2/section-2b/README.md/#object-fifo-link) of the programming guide.


To compile and run the design for NPU:
```
```bash
make
make run
```

To generate a data visualization of the transpose (like that above), run:
```bash
make generate_access_map
```
48 changes: 34 additions & 14 deletions programming_examples/basic/dma_transpose/aie2.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,28 @@
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
#
# (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates
import argparse
import numpy as np
import sys

from aie.dialects.aie import *
from aie.dialects.aiex import *
from aie.extras.context import mlir_mod_ctx
from aie.helpers.dialects.ext.scf import _for as range_
from aie.helpers.tensortiler import TensorTile

N = 4096
M = 64
K = 64

if len(sys.argv) == 3:
M = int(sys.argv[1])
K = int(sys.argv[2])
N = M * K
def my_passthrough(M, K, N, generate_access_map=False):
tensor_ty = np.ndarray[(M, K), np.dtype[np.int32]]
data_transform = TensorTile(
(M, K), offset=0, sizes=[1, 1, K, M], strides=[1, 1, 1, K]
)
if generate_access_map:
data_transform.visualize(
show_arrows=True, plot_access_count=False, file_path="transpose_data.png"
)
return

tensor_ty = np.ndarray[(M, K), np.dtype[np.int32]]


def my_passthrough():
with mlir_mod_ctx() as ctx:

@device(AIEDevice.npu1_1col)
Expand Down Expand Up @@ -56,8 +57,7 @@ def sequence(A, B, C):
metadata=of_in,
bd_id=1,
mem=A,
sizes=[1, 1, K, M],
strides=[1, 1, 1, K],
tensor_tile=data_transform,
issue_token=True,
)
npu_dma_memcpy_nd(metadata=of_out, bd_id=0, mem=C, sizes=[1, 1, 1, N])
Expand All @@ -66,4 +66,24 @@ def sequence(A, B, C):
print(ctx.module)


my_passthrough()
if __name__ == "__main__":
p = argparse.ArgumentParser()
p.add_argument("dims", help="M K", type=int, nargs="*", default=[64, 64])
p.add_argument(
"--generate-access-map",
action="store_true",
help="Produce a file showing data access order",
)
args = p.parse_args()

if len(args.dims) != 2:
print(
"ERROR: Must provide either no dimensions or both M and K", file=sys.stderr
)
exit(-1)
my_passthrough(
M=args.dims[0],
K=args.dims[1],
N=args.dims[0] * args.dims[1],
generate_access_map=args.generate_access_map,
)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,30 @@ Of note is the `object_fifo_link()` operation. This operation establishes a conn

We assume our data are stored in **row-major format** in the host's memory. For processing on the AIE compute cores, we need to transform the data layouts, such the above listed *sub-matrix tiles* are laid out contiguously in AIE compute core memory. Thankfully, AIE hardware has extensive support for transforming data using the DMAs as it is received and sent with zero cost. In the following, we will explain how we make use of this hardware feature to transform our data.

#### Runtime Sequence Tiling and Data Layout Transformations Notebook

There is a notebook that includes visualization for the runtime sequence `npu_dma_memcpy_nd` operations use to transfer matrices A, B, and C.

To run the notebook:
* Start a jupyter server at the root directory of your clone of `mlir-aie`.
Make sure you use a terminal that has run the `utils/setup_env.sh` script
so that the correct environment variables are percolated to jupyter.
Below is an example of how to start a jupyter server:
```bash
python3 -m jupyter notebook --no-browser --port=8080
```
* In your browser, navigate to the URL (which includes a token) which is found
in the output of the above command.
* Navigate to `programming_examples/basic/matrix_multiplication/whole_array`
* Double click `mat_mul_whole_array_visualization.ipynb` to start the notebook; choose the ipykernel called `ironenv`.
* You should now be good to go! Note that generating the animations in the notebook can take several minutes.

#### Run the Notebook as a Script
```bash
make clean
make run
```

##### Tiling to Vector Intrinsic Size

The `memA_fifos` and `memB_fifos` receive sub-matrices of size `m`&times;`k` and `k`&times;`n`, respectively. The FIFOs translate those matrices from a row-major format (or, alternatively, column-major for `B` if `b_col_maj` is set) into the `r`&times;`s`-sized and `s`&times;`t`-sized blocks required by the hardware's vector instrinsics before sending them into the compute cores memory.
Expand Down
119 changes: 94 additions & 25 deletions programming_examples/basic/matrix_multiplication/whole_array/aie2.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from aie.dialects.aie import *
from aie.dialects.aiex import *
from aie.helpers.dialects.ext.scf import _for as range_
from aie.helpers.tensortiler import TensorTile, TensorTileSequence

dtype_map = {
"bf16": bfloat16,
Expand Down Expand Up @@ -47,9 +48,15 @@ def main():
default="i16",
)
argparser.add_argument("--trace_size", type=int, default=0)
argparser.add_argument(
"--generate-tiles",
action="store_true",
help="Generate TensorTiles, a Python object to represent each data transfer"
"of the input/output matrices. These objects can be used for visualization.",
)
args = argparser.parse_args()
with mlir_mod_ctx() as ctx:
my_matmul(
maybe_tiles = my_matmul(
args.M,
args.K,
args.N,
Expand All @@ -61,19 +68,33 @@ def main():
args.dtype_out,
args.b_col_maj,
args.trace_size,
args.generate_tiles,
)
# print(ctx.module.operation.verify())
print(ctx.module)

if args.generate_tiles:
return maybe_tiles


def ceildiv(a, b):
return (a + b - 1) // b


def my_matmul(
M, K, N, m, k, n, n_aie_cols, dtype_in_str, dtype_out_str, b_col_maj, trace_size
M,
K,
N,
m,
k,
n,
n_aie_cols,
dtype_in_str,
dtype_out_str,
b_col_maj,
trace_size,
generate_tiles=False,
):

n_aie_rows = 4
n_aie_cores = n_aie_rows * n_aie_cols

Expand Down Expand Up @@ -148,6 +169,12 @@ def my_matmul(
elif n_aie_cols == 4:
dev = AIEDevice.npu1_4col

# These will hold TensorTile objects that represent the runtime
# npu_dma_memcpy_nd operations of this design. They are only used if generate_tiles is true
A_tensor_tiles = []
B_tensor_tiles = []
C_tensor_tiles = []

@device(dev)
def device_body():
A_l2_ty = np.ndarray[(m * k * n_A_tiles_per_shim,), np.dtype[dtype_in]]
Expand Down Expand Up @@ -375,13 +402,26 @@ def sequence(A, B, C):
C_row_offset = row_base * m * n_aie_rows * N
C_col_offset = col * n
C_offset = C_col_offset + C_row_offset
C_sizes = [tb_n_rows, N // n // n_aie_cols, m * n_aie_rows, n]
C_strides = [m * n_aie_rows * N, n * n_aie_cols, N, 1]
npu_dma_memcpy_nd(
metadata=C_l2l3_fifos[col],
bd_id=bd_id_base,
mem=C,
offsets=[0, 0, 0, C_offset],
sizes=[tb_n_rows, N // n // n_aie_cols, m * n_aie_rows, n],
strides=[m * n_aie_rows * N, n * n_aie_cols, N, 1],
sizes=C_sizes,
strides=C_strides,
)
# Use the calculated sizes/strides/offsets to record the data movement
# caused by the above call to npu_dma_memcpy_nd.
# This line does not change MLIR output at all.
C_tensor_tiles.append(
TensorTile(
(M, N),
offset=C_offset,
sizes=C_sizes,
strides=C_strides,
)
)

for tile_row in range(tb_n_rows):
Expand Down Expand Up @@ -411,18 +451,31 @@ def sequence(A, B, C):
col * n_A_tiles_per_shim * m * K
) # base address for the shim in this column
A_offset = A_block_offset + A_row_offset
A_sizes = [
N // n // n_aie_cols,
K // k,
m * n_A_tiles_per_shim,
k,
]
A_strides = [0, k, K, 1]
npu_dma_memcpy_nd(
metadata=A_l3l2_fifos[col],
bd_id=bd_id_base + 2 * tile_row + 1,
mem=A,
offsets=[0, 0, 0, A_offset],
sizes=[
N // n // n_aie_cols,
K // k,
m * n_A_tiles_per_shim,
k,
],
strides=[0, k, K, 1],
sizes=A_sizes,
strides=A_strides,
)
# Use the calculated sizes/strides/offsets to record the data movement
# caused by the above call to npu_dma_memcpy_nd.
# This line does not change MLIR output at all.
A_tensor_tiles.append(
TensorTile(
(M, K),
offset=A_offset,
sizes=A_sizes,
strides=A_strides,
)
)

# B input transfer:
Expand All @@ -444,29 +497,45 @@ def sequence(A, B, C):
# |0011 0011 |
# ----------------
B_col_offset = col * n if not b_col_maj else col * n * K
if not b_col_maj:
B_sizes = [N // n // n_aie_cols, K // k, k, n]
B_strides = [n * n_aie_cols, k * N, N, 1]
else:
B_sizes = [N // n // n_aie_cols, K // k, n, k]
B_strides = [n * n_aie_cols * K, k, K, 1]

npu_dma_memcpy_nd(
metadata=B_l3l2_fifos[col],
bd_id=bd_id_base + 2 * tile_row + 2,
mem=B,
offsets=[0, 0, 0, B_col_offset],
sizes=(
[N // n // n_aie_cols, K // k, k, n]
if not b_col_maj
else [N // n // n_aie_cols, K // k, n, k]
),
strides=(
[n * n_aie_cols, k * N, N, 1]
if not b_col_maj
else [n * n_aie_cols * K, k, K, 1]
),
sizes=B_sizes,
strides=B_strides,
)
# Use the calculated sizes/strides/offsets to record the data movement
# caused by the above call to npu_dma_memcpy_nd.
# This line does not change MLIR output at all.
B_tensor_tiles.append(
TensorTile(
(K, N),
offset=B_col_offset,
sizes=B_sizes,
strides=B_strides,
)
)
if tb > 0 or (tb == 0 and pingpong > 0):
dma_wait(*C_l2l3_fifos)
dma_wait(*C_l2l3_fifos)

if generate_tiles:
# If generate tiles is true, return a representation of tensor tiles
# representing all the npu_dma_memcpy_nd runtime sequence operations per input/ouput tensor.
return (
TensorTileSequence.from_tiles(A_tensor_tiles),
TensorTileSequence.from_tiles(B_tensor_tiles),
TensorTileSequence.from_tiles(C_tensor_tiles),
)


if __name__ == "__main__":
main()
else:
print("Not meant to be imported")
sys.exit(1)
Loading

0 comments on commit 3b5799b

Please sign in to comment.