TilerHelper and 2D Tiling Visualizations (#1870)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Xilinx · Nov 18, 2024 · 3b5799b · 3b5799b
1 parent cc3f772
commit 3b5799b
Show file tree

Hide file tree

Showing 61 changed files with 7,855 additions and 94 deletions.
diff --git a/programming_examples/basic/dma_transpose/Makefile b/programming_examples/basic/dma_transpose/Makefile
@@ -44,5 +44,9 @@ endif
 run: ${targetname}.exe build/final.xclbin
 	${powershell} ./$< -x build/final.xclbin -i build/insts.txt -k MLIR_AIE --M ${M} --K ${K}
 
+generate_access_map: ${srcdir}/aie2.py
+	mkdir -p ${@D}
+	python3 $< --generate-access-map ${M} ${K}
+
 clean:
 	rm -rf build _build inst ${targetname}.exe
diff --git a/programming_examples/basic/dma_transpose/README.md b/programming_examples/basic/dma_transpose/README.md
@@ -15,11 +15,24 @@ This reference design can be run on a Ryzen™ AI NPU.
 In the [design](./aie2.py), a 2-D array in a row-major layout is read from external memory to `ComputeTile2` with a transposed layout,
 by using an implicit copy via the compute tile's Data Movement Accelerator (DMA). The data is read from and written to external memory through the Shim tile (`col`, 0).
 
+This data movement transformation can be visualized as a map which shows the order the data the data is streamed (e.g., in transposed layout):
+<p align="center">
+  <img
+    src="transpose_data.png">
+    <h3 align="center"> Visualization of the Transpose Data Transformation for M=64, K=32. 
+ </h3> 
+</p>
+
 The implicit copy is performed using the `object_fifo_link` operation that specifies how input data arriving via `of_in` should be sent further via `of_out` by specifically leveraging the compute tile's DMA. This operation and its functionality are described in more depth in [Section-2b](../../../programming_guide/section-2/section-2b/README.md/#object-fifo-link) of the programming guide.
 
 
 To compile and run the design for NPU:
-```
+```bash
 make
 make run
+```
+
+To generate a data visualization of the transpose (like that above), run:
+```bash
+make generate_access_map
 ```
diff --git a/programming_examples/basic/dma_transpose/aie2.py b/programming_examples/basic/dma_transpose/aie2.py
@@ -5,27 +5,28 @@
 # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 #
 # (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates
+import argparse
 import numpy as np
 import sys
 
 from aie.dialects.aie import *
 from aie.dialects.aiex import *
 from aie.extras.context import mlir_mod_ctx
 from aie.helpers.dialects.ext.scf import _for as range_
+from aie.helpers.tensortiler import TensorTile
 
-N = 4096
-M = 64
-K = 64
 
-if len(sys.argv) == 3:
-    M = int(sys.argv[1])
-    K = int(sys.argv[2])
-    N = M * K
+def my_passthrough(M, K, N, generate_access_map=False):
+    tensor_ty = np.ndarray[(M, K), np.dtype[np.int32]]
+    data_transform = TensorTile(
+        (M, K), offset=0, sizes=[1, 1, K, M], strides=[1, 1, 1, K]
+    )
+    if generate_access_map:
+        data_transform.visualize(
+            show_arrows=True, plot_access_count=False, file_path="transpose_data.png"
+        )
+        return
 
-tensor_ty = np.ndarray[(M, K), np.dtype[np.int32]]
-
-
-def my_passthrough():
     with mlir_mod_ctx() as ctx:
 
         @device(AIEDevice.npu1_1col)
@@ -56,8 +57,7 @@ def sequence(A, B, C):
                     metadata=of_in,
                     bd_id=1,
                     mem=A,
-                    sizes=[1, 1, K, M],
-                    strides=[1, 1, 1, K],
+                    tensor_tile=data_transform,
                     issue_token=True,
                 )
                 npu_dma_memcpy_nd(metadata=of_out, bd_id=0, mem=C, sizes=[1, 1, 1, N])
@@ -66,4 +66,24 @@ def sequence(A, B, C):
     print(ctx.module)
 
 
-my_passthrough()
+if __name__ == "__main__":
+    p = argparse.ArgumentParser()
+    p.add_argument("dims", help="M K", type=int, nargs="*", default=[64, 64])
+    p.add_argument(
+        "--generate-access-map",
+        action="store_true",
+        help="Produce a file showing data access order",
+    )
+    args = p.parse_args()
+
+    if len(args.dims) != 2:
+        print(
+            "ERROR: Must provide either no dimensions or both M and K", file=sys.stderr
+        )
+        exit(-1)
+    my_passthrough(
+        M=args.dims[0],
+        K=args.dims[1],
+        N=args.dims[0] * args.dims[1],
+        generate_access_map=args.generate_access_map,
+    )
diff --git a/programming_examples/basic/dma_transpose/transpose_data.png b/programming_examples/basic/dma_transpose/transpose_data.png
diff --git a/programming_examples/basic/matrix_multiplication/whole_array/README.md b/programming_examples/basic/matrix_multiplication/whole_array/README.md
@@ -141,6 +141,30 @@ Of note is the `object_fifo_link()` operation. This operation establishes a conn
 
 We assume our data are stored in **row-major format** in the host's memory. For processing on the AIE compute cores, we need to transform the data layouts, such the above listed *sub-matrix tiles* are laid out contiguously in AIE compute core memory. Thankfully, AIE hardware has extensive support for transforming data using the DMAs as it is received and sent with zero cost. In the following, we will explain how we make use of this hardware feature to transform our data.
 
+#### Runtime Sequence Tiling and Data Layout Transformations Notebook
+
+There is a notebook that includes visualization for the runtime sequence `npu_dma_memcpy_nd` operations use to transfer matrices A, B, and C.
+
+To run the notebook:
+* Start a jupyter server at the root directory of your clone of `mlir-aie`.
+  Make sure you use a terminal that has run the `utils/setup_env.sh` script
+  so that the correct environment variables are percolated to jupyter.
+  Below is an example of how to start a jupyter server:
+  ```bash
+  python3 -m jupyter notebook --no-browser --port=8080
+  ```
+* In your browser, navigate to the URL (which includes a token) which is found
+  in the output of the above command.
+* Navigate to `programming_examples/basic/matrix_multiplication/whole_array`
+* Double click `mat_mul_whole_array_visualization.ipynb` to start the notebook; choose the ipykernel called `ironenv`.
+* You should now be good to go! Note that generating the animations in the notebook can take several minutes.
+
+#### Run the Notebook as a Script
+```bash
+make clean
+make run
+```
+
 ##### Tiling to Vector Intrinsic Size
 
 The `memA_fifos` and `memB_fifos` receive sub-matrices of size `m`&times;`k` and `k`&times;`n`, respectively. The FIFOs translate those matrices from a row-major format (or, alternatively, column-major for `B` if `b_col_maj` is set) into the `r`&times;`s`-sized and `s`&times;`t`-sized blocks required by the hardware's vector instrinsics before sending them into the compute cores memory.

diff --git a/programming_examples/basic/matrix_multiplication/whole_array/aie2.py b/programming_examples/basic/matrix_multiplication/whole_array/aie2.py
@@ -14,6 +14,7 @@
 from aie.dialects.aie import *
 from aie.dialects.aiex import *
 from aie.helpers.dialects.ext.scf import _for as range_
+from aie.helpers.tensortiler import TensorTile, TensorTileSequence
 
 dtype_map = {
     "bf16": bfloat16,
@@ -47,9 +48,15 @@ def main():
         default="i16",
     )
     argparser.add_argument("--trace_size", type=int, default=0)
+    argparser.add_argument(
+        "--generate-tiles",
+        action="store_true",
+        help="Generate TensorTiles, a Python object to represent each data transfer"
+        "of the input/output matrices. These objects can be used for visualization.",
+    )
     args = argparser.parse_args()
     with mlir_mod_ctx() as ctx:
-        my_matmul(
+        maybe_tiles = my_matmul(
             args.M,
             args.K,
             args.N,
@@ -61,19 +68,33 @@ def main():
             args.dtype_out,
             args.b_col_maj,
             args.trace_size,
+            args.generate_tiles,
         )
         # print(ctx.module.operation.verify())
         print(ctx.module)
 
+    if args.generate_tiles:
+        return maybe_tiles
+
 
 def ceildiv(a, b):
     return (a + b - 1) // b
 
 
 def my_matmul(
-    M, K, N, m, k, n, n_aie_cols, dtype_in_str, dtype_out_str, b_col_maj, trace_size
+    M,
+    K,
+    N,
+    m,
+    k,
+    n,
+    n_aie_cols,
+    dtype_in_str,
+    dtype_out_str,
+    b_col_maj,
+    trace_size,
+    generate_tiles=False,
 ):
-
     n_aie_rows = 4
     n_aie_cores = n_aie_rows * n_aie_cols
 
@@ -148,6 +169,12 @@ def my_matmul(
     elif n_aie_cols == 4:
         dev = AIEDevice.npu1_4col
 
+    # These will hold TensorTile objects that represent the runtime
+    # npu_dma_memcpy_nd operations of this design. They are only used if generate_tiles is true
+    A_tensor_tiles = []
+    B_tensor_tiles = []
+    C_tensor_tiles = []
+
     @device(dev)
     def device_body():
         A_l2_ty = np.ndarray[(m * k * n_A_tiles_per_shim,), np.dtype[dtype_in]]
@@ -375,13 +402,26 @@ def sequence(A, B, C):
                         C_row_offset = row_base * m * n_aie_rows * N
                         C_col_offset = col * n
                         C_offset = C_col_offset + C_row_offset
+                        C_sizes = [tb_n_rows, N // n // n_aie_cols, m * n_aie_rows, n]
+                        C_strides = [m * n_aie_rows * N, n * n_aie_cols, N, 1]
                         npu_dma_memcpy_nd(
                             metadata=C_l2l3_fifos[col],
                             bd_id=bd_id_base,
                             mem=C,
                             offsets=[0, 0, 0, C_offset],
-                            sizes=[tb_n_rows, N // n // n_aie_cols, m * n_aie_rows, n],
-                            strides=[m * n_aie_rows * N, n * n_aie_cols, N, 1],
+                            sizes=C_sizes,
+                            strides=C_strides,
+                        )
+                        # Use the calculated sizes/strides/offsets to record the data movement
+                        # caused by the above call to npu_dma_memcpy_nd.
+                        # This line does not change MLIR output at all.
+                        C_tensor_tiles.append(
+                            TensorTile(
+                                (M, N),
+                                offset=C_offset,
+                                sizes=C_sizes,
+                                strides=C_strides,
+                            )
                         )
 
                         for tile_row in range(tb_n_rows):
@@ -411,18 +451,31 @@ def sequence(A, B, C):
                                 col * n_A_tiles_per_shim * m * K
                             )  # base address for the shim in this column
                             A_offset = A_block_offset + A_row_offset
+                            A_sizes = [
+                                N // n // n_aie_cols,
+                                K // k,
+                                m * n_A_tiles_per_shim,
+                                k,
+                            ]
+                            A_strides = [0, k, K, 1]
                             npu_dma_memcpy_nd(
                                 metadata=A_l3l2_fifos[col],
                                 bd_id=bd_id_base + 2 * tile_row + 1,
                                 mem=A,
                                 offsets=[0, 0, 0, A_offset],
-                                sizes=[
-                                    N // n // n_aie_cols,
-                                    K // k,
-                                    m * n_A_tiles_per_shim,
-                                    k,
-                                ],
-                                strides=[0, k, K, 1],
+                                sizes=A_sizes,
+                                strides=A_strides,
+                            )
+                            # Use the calculated sizes/strides/offsets to record the data movement
+                            # caused by the above call to npu_dma_memcpy_nd.
+                            # This line does not change MLIR output at all.
+                            A_tensor_tiles.append(
+                                TensorTile(
+                                    (M, K),
+                                    offset=A_offset,
+                                    sizes=A_sizes,
+                                    strides=A_strides,
+                                )
                             )
 
                             # B input transfer:
@@ -444,29 +497,45 @@ def sequence(A, B, C):
                             #     |0011    0011    |
                             #      ----------------
                             B_col_offset = col * n if not b_col_maj else col * n * K
+                            if not b_col_maj:
+                                B_sizes = [N // n // n_aie_cols, K // k, k, n]
+                                B_strides = [n * n_aie_cols, k * N, N, 1]
+                            else:
+                                B_sizes = [N // n // n_aie_cols, K // k, n, k]
+                                B_strides = [n * n_aie_cols * K, k, K, 1]
+
                             npu_dma_memcpy_nd(
                                 metadata=B_l3l2_fifos[col],
                                 bd_id=bd_id_base + 2 * tile_row + 2,
                                 mem=B,
                                 offsets=[0, 0, 0, B_col_offset],
-                                sizes=(
-                                    [N // n // n_aie_cols, K // k, k, n]
-                                    if not b_col_maj
-                                    else [N // n // n_aie_cols, K // k, n, k]
-                                ),
-                                strides=(
-                                    [n * n_aie_cols, k * N, N, 1]
-                                    if not b_col_maj
-                                    else [n * n_aie_cols * K, k, K, 1]
-                                ),
+                                sizes=B_sizes,
+                                strides=B_strides,
+                            )
+                            # Use the calculated sizes/strides/offsets to record the data movement
+                            # caused by the above call to npu_dma_memcpy_nd.
+                            # This line does not change MLIR output at all.
+                            B_tensor_tiles.append(
+                                TensorTile(
+                                    (K, N),
+                                    offset=B_col_offset,
+                                    sizes=B_sizes,
+                                    strides=B_strides,
+                                )
                             )
                     if tb > 0 or (tb == 0 and pingpong > 0):
                         dma_wait(*C_l2l3_fifos)
             dma_wait(*C_l2l3_fifos)
 
+    if generate_tiles:
+        # If generate tiles is true, return a representation of tensor tiles
+        # representing all the npu_dma_memcpy_nd runtime sequence operations per input/ouput tensor.
+        return (
+            TensorTileSequence.from_tiles(A_tensor_tiles),
+            TensorTileSequence.from_tiles(B_tensor_tiles),
+            TensorTileSequence.from_tiles(C_tensor_tiles),
+        )
+
 
 if __name__ == "__main__":
     main()
-else:
-    print("Not meant to be imported")
-    sys.exit(1)