From 81ff8e9202729b031ad438c439468e0e2f3a4d4a Mon Sep 17 00:00:00 2001
From: Joseph Melber <jgmelber@gmail.com>
Date: Tue, 16 Apr 2024 23:26:43 -0600
Subject: [PATCH] [ASPLOS] Passthrough Kernel README and test.py (#1262)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Kristof Denolf <kristof.denolf@amd.com>
Co-authored-by: Jack Lo <jack.lo@amd.com>
Co-authored-by: Jack Lo <36210336+jackl-xilinx@users.noreply.github.com>
---
 .../basic/passthrough_kernel/Makefile         |  14 +-
 .../basic/passthrough_kernel/README.md        | 100 +++++++++
 .../basic/passthrough_kernel/aie2.py          | 203 ++++++------------
 .../basic/passthrough_kernel/run.lit          |   1 +
 .../basic/passthrough_kernel/test.cpp         |  26 +--
 .../basic/passthrough_kernel/test.py          | 111 ++++++++++
 python/utils/test.py                          |   2 +-
 python/utils/xrt.py                           |   2 +-
 8 files changed, 310 insertions(+), 149 deletions(-)
 create mode 100644 programming_examples/basic/passthrough_kernel/README.md
 create mode 100644 programming_examples/basic/passthrough_kernel/test.py

diff --git a/programming_examples/basic/passthrough_kernel/Makefile b/programming_examples/basic/passthrough_kernel/Makefile
index 440068afe5..fbfc7580c4 100644
--- a/programming_examples/basic/passthrough_kernel/Makefile
+++ b/programming_examples/basic/passthrough_kernel/Makefile
@@ -43,7 +43,19 @@ else
 endif
 
 run: ${targetname}.exe build/final_${PASSTHROUGH_SIZE}.xclbin build/insts.txt
-	${powershell} ./$< -x build/final_${PASSTHROUGH_SIZE}.xclbin -i build/insts.txt -k MLIR_AIE
+	${powershell} ./$< -x build/final_${PASSTHROUGH_SIZE}.xclbin -i build/insts.txt -k MLIR_AIE 
+
+run-g: ${targetname}.exe build/final_${PASSTHROUGH_SIZE}.xclbin build/insts.txt
+	${powershell} ./$< -x build/final_${PASSTHROUGH_SIZE}.xclbin -i build/insts.txt -k MLIR_AIE -t 8192
+
+run_py: build/final_${PASSTHROUGH_SIZE}.xclbin build/insts.txt
+	${powershell} python3 test.py -s ${PASSTHROUGH_SIZE} -x build/final_${PASSTHROUGH_SIZE}.xclbin -i build/insts.txt -k MLIR_AIE
+
+trace:
+	../../utils/parse_eventIR.py --filename trace.txt --mlir build/aie.mlir --colshift 1 > parse_eventIR_vs.json
+
+clean_trace:
+	rm -rf tmpTrace trace.txt
 
 clean:
 	rm -rf build _build ${targetname}.exe
diff --git a/programming_examples/basic/passthrough_kernel/README.md b/programming_examples/basic/passthrough_kernel/README.md
new file mode 100644
index 0000000000..ddbcf970dc
--- /dev/null
+++ b/programming_examples/basic/passthrough_kernel/README.md
@@ -0,0 +1,100 @@
+<!---//===- README.md -----------------------------------------*- Markdown -*-===//
+//
+// This file is licensed under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// Copyright (C) 2024, Advanced Micro Devices, Inc.
+// 
+//===----------------------------------------------------------------------===//-->
+
+# Passthrough Kernel:
+
+This IRON design flow example, called "Passthrough Kernel", demonstrates the process of creating a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length `4096`. The kernel is configured to work on `1024` element sized subvectors, and is invoked multiple times to complete the full copy. The example consists of two primary design files: `aie2.py` and `passThrough.cc`, and a testbench `test.cpp` or `test.py`.
+
+## Source Files Overview
+
+1. `aie2.py`: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using `aiecc.py` to produce design binaries (ie. XCLBIN and inst.txt for the NPU in Ryzen AI). 
+
+1. `passThrough.cc`: A C++ implementation of vectorized memcpy operations for AIE cores. Found [here](../../../aie_kernels/generic/passThrough.cc).
+
+1. `test.cpp`: This C++ code is a testbench for the Passthrough Kernel design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data.
+
+1. `test.py`: This Python code is a testbench for the Passthrough Kernel design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data.
+
+## Design Overview
+
+<img align="right" width="300" height="300" src="../../../programming_guide/assets/passthrough_simple.svg"> 
+
+This simple example effectively passes data through a single compute tile in the NPU's AIE array. The design is described as shown in the figure to the right. The overall design flow is as follows:
+1. An object FIFO called "of_in" connects a Shim Tile to a Compute Tile, and another called "of_out" connects the Compute Tile back to the Shim Tile. 
+1. The runtime data movement is expressed to read `4096` uint8_t data from host memory to the compute tile and write the `4096` data back to host memory. 
+1. The compute tile acquires this input data in "object" sized (`1024`) blocks from "of_in" and copies them to another output "object" it has acquired from "of_out". Note that a vectorized kernel running on the Compute Tile's AIE core copies the data from the input "object" to the output "object".
+1. After the vectorized copy is performed the Compute Tile releases the "objects" allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile,  "of_out" and "of_in" respectively.
+
+It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core is also processing data concurrent with the data movement. This is made possible by expressing depth `2` in declaring, for example, `object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)` to denote ping-pong buffers.
+
+## Design Component Details
+
+### AIE Array Structural Design
+
+This design performs a memcpy operation on a vector of input data. The AIE design is described in a python module as follows:
+
+1. **Constants & Configuration:** The script defines input/output dimensions (`N`, `n`), buffer sizes in `lineWidthInBytes` and `lineWidthInInt32s`, and tracing support.
+
+1. **AIE Device Definition:** `@device` defines the target device. The `device_body` function contains the AIE array design definition.
+
+1. **Scaling Function Declarations:** `passThroughLine` is an external function imported from `passThrough.cc`.
+
+1. **Tile Definitions:** `ShimTile` handles data movement, and `ComputeTile2` processes the scaling operations.
+
+1. **Object Fifos:** `of_in` and `of_out` are defined to facilitate communication between `ShimTile` and `ComputeTile2`.
+
+1. **Tracing Flow Setup (Optional):** A circuit-switched flow is set up for tracing information when enabled.
+
+1. **Core Definition:** The `core_body` function loops through sub-vectors of the input data, acquiring elements from `of_in`, processing using `passThroughLine`, and outputting the result to `of_out`.
+
+1. **Data Movement Configuration:** The `sequence` function configures data movement and synchronization on the `ShimTile` for input and output buffer management.
+
+1. **Tracing Configuration (Optional):** Trace control, event groups, and buffer descriptors are set up in the `sequence` function when tracing is enabled.
+
+1. **Generate the design:** The `passthroughKernel()` function triggers the code generation process. The final print statement outputs the MLIR representation of the AIE array configuration.
+
+### AIE Core Kernel Code
+
+`passThrough.cc` contains a C++ implementation of vectorized memcpy operation designed for AIE cores. It consists of two main sections:
+
+1. **Vectorized Copying:** The `passThrough_aie()` function processes multiple data elements simultaneously, taking advantage of AIE vector datapath capabilities to load, copy and store data elements.
+
+1. **C-style Wrapper Functions:** `passThroughLine()` and `passThroughTile()` are two C-style wrapper functions to call the templated `passThrough_aie()` vectorized memcpy implementation from the AIE design implemented in `aie2.py`. The `passThroughLine()` and `passThroughTile()` functions are compiled for `uint8_t`, `int16_t`, or `int32_t` determined by the value the `BIT_WIDTH` variable defines. 
+
+## Usage
+
+### C++ Testbench
+
+To compile the design and C++ testbench:
+
+```
+make
+make build/passThroughKernel.exe
+```
+
+To run the design:
+
+```
+make run
+```
+
+### Python Testbench
+
+To compile the design and run the Python testbench:
+
+```
+make
+```
+
+To run the design:
+
+```
+python3 test.py -x build/final_4096.xclbin -i build/insts.txt -k MLIR_AIE
+```
diff --git a/programming_examples/basic/passthrough_kernel/aie2.py b/programming_examples/basic/passthrough_kernel/aie2.py
index b401f5801f..baec4415fa 100644
--- a/programming_examples/basic/passthrough_kernel/aie2.py
+++ b/programming_examples/basic/passthrough_kernel/aie2.py
@@ -12,6 +12,8 @@
 from aie.dialects.scf import *
 from aie.extras.context import mlir_mod_ctx
 
+import aie.utils.trace as trace_utils
+
 N = 1024
 
 if len(sys.argv) == 2:
@@ -26,145 +28,80 @@
 
 
 def passthroughKernel():
-    with mlir_mod_ctx() as ctx:
 
-        @device(AIEDevice.ipu)
-        def device_body():
-            # define types
-            memRef_ty = T.memref(lineWidthInBytes, T.ui8())
+    @device(AIEDevice.ipu)
+    def device_body():
+        # define types
+        memRef_ty = T.memref(lineWidthInBytes, T.ui8())
 
-            # AIE Core Function declarations
-            passThroughLine = external_func(
-                "passThroughLine", inputs=[memRef_ty, memRef_ty, T.i32()]
-            )
+        # AIE Core Function declarations
+        passThroughLine = external_func(
+            "passThroughLine", inputs=[memRef_ty, memRef_ty, T.i32()]
+        )
+
+        # Tile declarations
+        ShimTile = tile(0, 0)
+        ComputeTile2 = tile(0, 2)
+
+        if enableTrace:
+            flow(ComputeTile2, WireBundle.Trace, 0, ShimTile, WireBundle.DMA, 1)
+
+        # AIE-array data movement with object fifos
+        of_in = object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)
+        of_out = object_fifo("out", ComputeTile2, ShimTile, 2, memRef_ty)
+
+        # Set up compute tiles
+
+        # Compute tile 2
+        @core(ComputeTile2, "passThrough.cc.o")
+        def core_body():
+            for _ in for_(sys.maxsize):
+                elemOut = of_out.acquire(ObjectFifoPort.Produce, 1)
+                elemIn = of_in.acquire(ObjectFifoPort.Consume, 1)
+                call(passThroughLine, [elemIn, elemOut, lineWidthInBytes])
+                of_in.release(ObjectFifoPort.Consume, 1)
+                of_out.release(ObjectFifoPort.Produce, 1)
+                yield_([])
 
-            # Tile declarations
-            ShimTile = tile(0, 0)
-            ComputeTile2 = tile(0, 2)
+        #    print(ctx.module.operation.verify())
 
+        tensorSize = N
+        tensorSizeInInt32s = tensorSize // 4
+        tensor_ty = T.memref(lineWidthInInt32s, T.i32())
+
+        compute_tile2_col, compute_tile2_row = 0, 2
+
+        @FuncOp.from_py_func(tensor_ty, tensor_ty, tensor_ty)
+        def sequence(inTensor, outTensor, notUsed):
             if enableTrace:
-                flow(ComputeTile2, "Trace", 0, ShimTile, "DMA", 1)
-
-            # AIE-array data movement with object fifos
-            of_in = object_fifo("in", ShimTile, ComputeTile2, 2, memRef_ty)
-            of_out = object_fifo("out", ComputeTile2, ShimTile, 2, memRef_ty)
-
-            # Set up compute tiles
-
-            # Compute tile 2
-            @core(ComputeTile2, "passThrough.cc.o")
-            def core_body():
-                for _ in for_(sys.maxsize):
-                    elemOut = of_out.acquire(ObjectFifoPort.Produce, 1)
-                    elemIn = of_in.acquire(ObjectFifoPort.Consume, 1)
-                    call(passThroughLine, [elemIn, elemOut, lineWidthInBytes])
-                    of_in.release(ObjectFifoPort.Consume, 1)
-                    of_out.release(ObjectFifoPort.Produce, 1)
-                    yield_([])
-
-            #    print(ctx.module.operation.verify())
-
-            tensorSize = N
-            tensorSizeInInt32s = tensorSize // 4
-            tensor_ty = T.memref(lineWidthInInt32s, T.i32())
-
-            @FuncOp.from_py_func(tensor_ty, tensor_ty, tensor_ty)
-            def sequence(inTensor, outTensor, notUsed):
-                if enableTrace:
-                    # Trace output
-
-                    # Trace_Event0, Trace_Event1: Select which events to trace.
-                    # Note that the event buffers only appear to be transferred to DDR in
-                    # bursts of 256 bytes. If less than 256 bytes are written, you may not
-                    # see trace output, or only see it on the next iteration of your
-                    # kernel invocation, as the buffer gets filled up. Note that, even
-                    # though events are encoded as 4 byte words, it may take more than 64
-                    # events to fill the buffer to 256 bytes and cause a flush, since
-                    # multiple repeating events can be 'compressed' by the trace mechanism.
-                    # In order to always generate sufficient events, we add the "assert
-                    # TRUE" event to one slot, which fires every cycle, and thus fills our
-                    # buffer quickly.
-
-                    # Some events:
-                    # TRUE                       (0x01)
-                    # STREAM_STALL               (0x18)
-                    # LOCK_STALL                 (0x1A)
-                    # EVENTS_CORE_INSTR_EVENT_1  (0x22)
-                    # EVENTS_CORE_INSTR_EVENT_0  (0x21)
-                    # INSTR_VECTOR               (0x25)  Core executes a vecotr MAC, ADD or compare instruction
-                    # INSTR_LOCK_ACQUIRE_REQ     (0x2C)  Core executes a lock acquire instruction
-                    # INSTR_LOCK_RELEASE_REQ     (0x2D)  Core executes a lock release instruction
-                    # EVENTS_CORE_PORT_RUNNING_1 (0x4F)
-                    # EVENTS_CORE_PORT_RUNNING_0 (0x4B)
-
-                    # Trace_Event0  (4 slots)
-                    IpuWrite32(0, 2, 0x340E0, 0x4B222125)
-                    # Trace_Event1  (4 slots)
-                    IpuWrite32(0, 2, 0x340E4, 0x2D2C1A4F)
-
-                    # Event slots as configured above:
-                    # 0: Kernel executes vector instruction
-                    # 1: Event 0 -- Kernel starts
-                    # 2: Event 1 -- Kernel done
-                    # 3: Port_Running_0
-                    # 4: Port_Running_1
-                    # 5: Lock Stall
-                    # 6: Lock Acquire Instr
-                    # 7: Lock Release Instr
-
-                    # Stream_Switch_Event_Port_Selection_0
-                    # This is necessary to capture the Port_Running_0 and Port_Running_1 events
-                    IpuWrite32(0, 2, 0x3FF00, 0x121)
-
-                    # Trace_Control0: Define trace start and stop triggers. Set start event TRUE.
-                    IpuWrite32(0, 2, 0x340D0, 0x10000)
-
-                    # Start trace copy out.
-                    IpuWriteBdShimTile(
-                        bd_id=3,
-                        buffer_length=traceSizeInBytes,
-                        buffer_offset=tensorSize,
-                        enable_packet=0,
-                        out_of_order_id=0,
-                        packet_id=0,
-                        packet_type=0,
-                        column=0,
-                        column_num=1,
-                        d0_stride=0,
-                        d0_wrap=0,
-                        d1_stride=0,
-                        d1_wrap=0,
-                        d2_stride=0,
-                        ddr_id=2,
-                        iteration_current=0,
-                        iteration_stride=0,
-                        iteration_wrap=0,
-                        lock_acq_enable=0,
-                        lock_acq_id=0,
-                        lock_acq_val=0,
-                        lock_rel_id=0,
-                        lock_rel_val=0,
-                        next_bd=0,
-                        use_next_bd=0,
-                        valid_bd=1,
-                    )
-                    IpuWrite32(0, 0, 0x1D20C, 0x3)
-
-                ipu_dma_memcpy_nd(
-                    metadata="in",
-                    bd_id=0,
-                    mem=inTensor,
-                    sizes=[1, 1, 1, tensorSizeInInt32s],
+                trace_utils.configure_simple_tracing_aie2(
+                    ComputeTile2,
+                    ShimTile,
+                    channel=1,
+                    bd_id=13,
+                    ddr_id=1,
+                    size=traceSizeInBytes,
+                    offset=tensorSize,
+                    start=0x1,
+                    stop=0x0,
+                    events=[0x4B, 0x22, 0x21, 0x25, 0x2D, 0x2C, 0x1A, 0x4F],
                 )
-                ipu_dma_memcpy_nd(
-                    metadata="out",
-                    bd_id=1,
-                    mem=outTensor,
-                    sizes=[1, 1, 1, tensorSizeInInt32s],
-                )
-                ipu_sync(column=0, row=0, direction=0, channel=0)
 
-    print(ctx.module)
+            ipu_dma_memcpy_nd(
+                metadata="in",
+                bd_id=0,
+                mem=inTensor,
+                sizes=[1, 1, 1, tensorSizeInInt32s],
+            )
+            ipu_dma_memcpy_nd(
+                metadata="out",
+                bd_id=1,
+                mem=outTensor,
+                sizes=[1, 1, 1, tensorSizeInInt32s],
+            )
+            ipu_sync(column=0, row=0, direction=0, channel=0)
 
 
-passthroughKernel()
+with mlir_mod_ctx() as ctx:
+    passthroughKernel()
+    print(ctx.module)
diff --git a/programming_examples/basic/passthrough_kernel/run.lit b/programming_examples/basic/passthrough_kernel/run.lit
index 4e81d99fde..7ff95e8494 100644
--- a/programming_examples/basic/passthrough_kernel/run.lit
+++ b/programming_examples/basic/passthrough_kernel/run.lit
@@ -8,4 +8,5 @@
 // RUN: %python aiecc.py --xbridge --aie-generate-cdo --aie-generate-ipu --no-compile-host --xclbin-name=aie.xclbin --ipu-insts-name=insts.txt ./aie.mlir
 // RUN: clang %S/test.cpp -o test.exe -std=c++11 -Wall -DPASSTHROUGH_SIZE=4096 -I%S/../../../runtime_lib/test_lib %S/../../../runtime_lib/test_lib/test_utils.cpp %xrt_flags -lrt -lstdc++ -lboost_program_options -lboost_filesystem
 // RUN: %run_on_ipu ./test.exe -x aie.xclbin -k MLIR_AIE -i insts.txt | FileCheck %s
+// RUN: %run_on_ipu %python %S/test.py -x aie.xclbin -i insts.txt -k MLIR_AIE -s 4096 | FileCheck %s
 // CHECK: PASS!
diff --git a/programming_examples/basic/passthrough_kernel/test.cpp b/programming_examples/basic/passthrough_kernel/test.cpp
index 7b8779ca13..49e9564c18 100644
--- a/programming_examples/basic/passthrough_kernel/test.cpp
+++ b/programming_examples/basic/passthrough_kernel/test.cpp
@@ -30,24 +30,17 @@ int main(int argc, const char *argv[]) {
 
   // Program arguments parsing
   po::options_description desc("Allowed options");
-  desc.add_options()("help,h", "produce help message")(
-      "xclbin,x", po::value<std::string>()->required(),
-      "the input xclbin path")(
-      "kernel,k", po::value<std::string>()->required(),
-      "the kernel name in the XCLBIN (for instance PP_PRE_FD)")(
-      "verbosity,v", po::value<int>()->default_value(0),
-      "the verbosity of the output")(
-      "instr,i", po::value<std::string>()->required(),
-      "path of file containing userspace instructions to be sent to the LX6");
   po::variables_map vm;
+  test_utils::add_default_options(desc);
 
   test_utils::parse_options(argc, argv, desc, vm);
+  int verbosity = vm["verbosity"].as<int>();
+  int trace_size = vm["trace_sz"].as<int>();
 
   // Load instruction sequence
   std::vector<uint32_t> instr_v =
       test_utils::load_instr_sequence(vm["instr"].as<std::string>());
 
-  int verbosity = vm["verbosity"].as<int>();
   if (verbosity >= 1)
     std::cout << "Sequence instr count: " << instr_v.size() << "\n";
 
@@ -64,8 +57,9 @@ int main(int argc, const char *argv[]) {
                           XCL_BO_FLAGS_CACHEABLE, kernel.group_id(0));
   auto bo_inA = xrt::bo(device, PASSTHROUGH_SIZE * sizeof(DATATYPE),
                         XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(2));
-  auto bo_out = xrt::bo(device, PASSTHROUGH_SIZE * sizeof(DATATYPE),
-                        XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(3));
+  auto bo_out =
+      xrt::bo(device, PASSTHROUGH_SIZE * sizeof(DATATYPE) + trace_size,
+              XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(3));
 
   if (verbosity >= 1)
     std::cout << "Writing data into buffer objects.\n";
@@ -81,7 +75,7 @@ int main(int argc, const char *argv[]) {
 
   // Zero out buffer bo_out
   DATATYPE *bufOut = bo_out.map<DATATYPE *>();
-  memset(bufOut, 0, PASSTHROUGH_SIZE * sizeof(DATATYPE));
+  memset(bufOut, 0, PASSTHROUGH_SIZE * sizeof(DATATYPE) + trace_size);
 
   // sync host to device memories
   bo_instr.sync(XCL_BO_SYNC_BO_TO_DEVICE);
@@ -104,6 +98,12 @@ int main(int argc, const char *argv[]) {
       errors++;
   }
 
+  if (trace_size > 0) {
+    test_utils::write_out_trace(((char *)bufOut) +
+                                    (PASSTHROUGH_SIZE * sizeof(DATATYPE)),
+                                trace_size, vm["trace_file"].as<std::string>());
+  }
+
   // Print Pass/Fail result of our test
   if (!errors) {
     std::cout << std::endl << "PASS!" << std::endl << std::endl;
diff --git a/programming_examples/basic/passthrough_kernel/test.py b/programming_examples/basic/passthrough_kernel/test.py
new file mode 100644
index 0000000000..8d33eca0d9
--- /dev/null
+++ b/programming_examples/basic/passthrough_kernel/test.py
@@ -0,0 +1,111 @@
+# test.py -*- Python -*-
+#
+# Copyright (C) 2024, Advanced Micro Devices, Inc. All rights reserved.
+# SPDX-License-Identifier: MIT
+
+import numpy as np
+import pyxrt as xrt
+import sys
+import time
+
+from aie.dialects.aie import *
+from aie.dialects.aiex import *
+from aie.dialects.scf import *
+from aie.extras.context import mlir_mod_ctx
+from aie.extras.dialects.ext import memref, arith
+
+import aie.utils.test as test_utils
+
+
+def main(opts):
+
+    # Load instruction sequence
+    with open(opts.instr, "r") as f:
+        instr_text = f.read().split("\n")
+        instr_text = [l for l in instr_text if l != ""]
+        instr_v = np.array([int(i, 16) for i in instr_text], dtype=np.uint32)
+
+    # ------------------------------------------------------------
+    # Configure this to match your design's buffer size and type
+    # ------------------------------------------------------------
+    INOUT0_VOLUME = int(opts.size)  # Input only, 64x uint32_t in this example
+    INOUT1_VOLUME = int(opts.size)  # Output only, 64x uint32_t in this example
+
+    INOUT0_DATATYPE = np.uint8
+    INOUT1_DATATYPE = np.uint8
+
+    INOUT0_SIZE = INOUT0_VOLUME * INOUT0_DATATYPE().itemsize
+    INOUT1_SIZE = INOUT1_VOLUME * INOUT1_DATATYPE().itemsize
+
+    # ------------------------------------------------------
+    # Get device, load the xclbin & kernel and register them
+    # ------------------------------------------------------
+    (device, kernel) = test_utils.init_xrt_load_kernel(opts)
+
+    # ------------------------------------------------------
+    # Initialize input/ output buffer sizes and sync them
+    # ------------------------------------------------------
+    bo_instr = xrt.bo(device, len(instr_v) * 4, xrt.bo.cacheable, kernel.group_id(0))
+    bo_inout0 = xrt.bo(device, INOUT0_SIZE, xrt.bo.host_only, kernel.group_id(2))
+    bo_inout1 = xrt.bo(device, INOUT1_SIZE, xrt.bo.host_only, kernel.group_id(3))
+
+    # Initialize instruction buffer
+    bo_instr.write(instr_v, 0)
+
+    # Initialize data buffers
+    inout0 = np.arange(1, INOUT0_VOLUME + 1, dtype=INOUT0_DATATYPE)
+    inout1 = np.zeros(INOUT1_VOLUME, dtype=INOUT1_DATATYPE)
+    bo_inout0.write(inout0, 0)
+    bo_inout1.write(inout1, 0)
+
+    # Sync buffers to update input buffer values
+    bo_instr.sync(xrt.xclBOSyncDirection.XCL_BO_SYNC_BO_TO_DEVICE)
+    bo_inout0.sync(xrt.xclBOSyncDirection.XCL_BO_SYNC_BO_TO_DEVICE)
+    bo_inout1.sync(xrt.xclBOSyncDirection.XCL_BO_SYNC_BO_TO_DEVICE)
+
+    # ------------------------------------------------------
+    # Initialize run configs
+    # ------------------------------------------------------
+    errors = 0
+
+    # ------------------------------------------------------
+    # Main run loop
+    # ------------------------------------------------------
+
+    # Run kernel
+    if opts.verbosity >= 1:
+        print("Running Kernel.")
+    h = kernel(bo_instr, len(instr_v), bo_inout0, bo_inout1)
+    h.wait()
+    bo_inout1.sync(xrt.xclBOSyncDirection.XCL_BO_SYNC_BO_FROM_DEVICE)
+
+    # Copy output results and verify they are correct
+    out_size = INOUT1_SIZE
+    output_buffer = bo_inout1.read(out_size, 0).view(INOUT1_DATATYPE)
+    if opts.verify:
+        if opts.verbosity >= 1:
+            print("Verifying results ...")
+        ref = np.arange(1, INOUT0_VOLUME + 1, dtype=INOUT0_DATATYPE)
+        e = np.equal(output_buffer, ref)
+        errors = errors + np.size(e) - np.count_nonzero(e)
+
+    # ------------------------------------------------------
+    # Print verification and timing results
+    # ------------------------------------------------------
+
+    if not errors:
+        print("\nPASS!\n")
+        exit(0)
+    else:
+        print("\nError count: ", errors)
+        print("\nFailed.\n")
+        exit(-1)
+
+
+if __name__ == "__main__":
+    p = test_utils.create_default_argparser()
+    p.add_argument(
+        "-s", "--size", required=True, dest="size", help="Passthrough kernel size"
+    )
+    opts = p.parse_args(sys.argv[1:])
+    main(opts)
diff --git a/python/utils/test.py b/python/utils/test.py
index 81c3e51e59..b909f77139 100644
--- a/python/utils/test.py
+++ b/python/utils/test.py
@@ -1,4 +1,4 @@
-# ml.py -*- Python -*-
+# test.py -*- Python -*-
 #
 # This file is licensed under the Apache License v2.0 with LLVM Exceptions.
 # See https://llvm.org/LICENSE.txt for license information.
diff --git a/python/utils/xrt.py b/python/utils/xrt.py
index fa36ff096a..c5df0b66ca 100644
--- a/python/utils/xrt.py
+++ b/python/utils/xrt.py
@@ -1,4 +1,4 @@
-# ml.py -*- Python -*-
+# xrt.py -*- Python -*-
 #
 # This file is licensed under the Apache License v2.0 with LLVM Exceptions.
 # See https://llvm.org/LICENSE.txt for license information.