Add hoist mechanism for specific TTIR ops, with placeholder analysis pass #1586

vwellsTT · 2024-12-12T18:14:54Z

Goal: The end-to-end goal is to integrate a path to compile and execute specific ops or sets of ops on the CPU.

Context:

The entire task will be split into (tentatively) 7 PRs, as follows:

Hoist specific ops into isolated funcs in a separate module
Convert TTIR ops to linalg ops within the module of hoisted funcs
Build a pipeline to lower linalg to llvm from existing conversion passes
Translate LLVM Dialect into a dynamic library for packing into flatbuffer
Generate helper functions so that we can call all of our hoisted funcs with a common signature
Insert TTNN instructions to move operands to host before executing hoisted func, then back to device afterwards
Update ttir-to-ttnn and ttnn-to-flatbuffer pipelines to use new passes, generate dylibs, and embed them into output flatbuffers, and update update runtime to consume dylibs from flatbuffers

This PR represents the 1st point above. Here, we build hoisting (placeholder) analysis + transform pass to mark specific ops to be hoisted, and then actually pull them into separate functions in a new "cpu" module + replace the original op with a call to the func.

Example:

input:

    module {
      func.func @add(%arg0: tensor<32x32xbf16>, %arg1: tensor<32x32xbf16>) -> tensor<32x32xbf16> {
        %0 = tensor.empty() : tensor<32x32xbf16>
        %1 = "ttir.add"(%arg0, %arg1, %0) <{operandSegmentSizes = array<i32: 2, 1>}> : (tensor<32x32xbf16>, tensor<32x32xbf16>, tensor<32x32xbf16>) -> tensor<32x32xbf16> loc("add_op1")
        return %1 : tensor<32x32xbf16>
      }
    }

output:

    module {
      func.func @add(%arg0: tensor<32x32xbf16>, %arg1: tensor<32x32xbf16>) -> tensor<32x32xbf16> {
        %0 = tensor.empty() : tensor<32x32xbf16>
        %1 = call @hoisted_ttir.add_32x32xbf16_32x32xbf16_32x32xbf16_func(%arg0, %arg1, %0) : (tensor<32x32xbf16>, tensor<32x32xbf16>, tensor<32x32xbf16>) -> tensor<32x32xbf16>
        return %1 : tensor<32x32xbf16>
      }
      module @cpu_module attributes {ttir.cpu_module} {
        func.func @hoisted_ttir.add_32x32xbf16_32x32xbf16_32x32xbf16_func(%arg0: tensor<32x32xbf16>, %arg1: tensor<32x32xbf16>, %arg2: tensor<32x32xbf16>) -> tensor<32x32xbf16> attributes {arg_ranks = [2, 2, 2, 2]} {
          %0 = "ttir.add"(%arg0, %arg1, %arg2) <{operandSegmentSizes = array<i32: 2, 1>}> : (tensor<32x32xbf16>, tensor<32x32xbf16>, tensor<32x32xbf16>) -> tensor<32x32xbf16>
          return %0 : tensor<32x32xbf16>
        }
      }
      func.func private @hoisted_ttir.add_32x32xbf16_32x32xbf16_32x32xbf16_func(tensor<32x32xbf16>, tensor<32x32xbf16>, tensor<32x32xbf16>) -> tensor<32x32xbf16>
    }

nsmithtt

@vwellsTT, this looks great, couple of things we should add:

Let's add more context to the PR description. Let's call out that this is part _ of ?? parts with some additional sentences describing that this is part of CPU fallback path. It's probably worth having a top level statement that outlines the whole plan broken into parts and then stating this PR tackles part _ of above plan.
We need a directed test proving that the hoist pass hoisted correctly

include/ttmlir/Dialect/TTIR/Transforms/Passes.td

lib/Dialect/TTIR/Transforms/HoistCPUOps.cpp

sdjordjevicTT · 2024-12-13T15:42:20Z

lib/Dialect/TTIR/Transforms/HoistCPUOps.cpp

+      }
+    }
+
+    // If no CPU module exists, create one


What is the intention of splitting the IR into two different modules? Do we need this split? Currently, we have a very simple IR structure with one module and many functions, hence all our passes are working on a single module.

Well, the hoisted funcs will be lowered via a separate pipeline, so I don't think we want any of the existing passes running on them. @nsmithtt originally proposed splitting CPU funcs into new module I think

Yeah the intention for splitting into a new module is that we can have a 1-1 mapping of module to device target type. E.g. the main module is tagged with a system desc, i.e. all the functions within that module are intended to run on device/TTNN framework, a CPU module is tagged with a CPU target triple so organizationally it's a bit easier to read / discover that all functions inside that module are intended for CPU.

On top of that, there are two other reasons:

The logic for lowering to llvm / compiling to dylib becomes much simpler because we just hand over the CPU module op as the root node for conversion, whereas if we had CPU and TTNN entry point functions just freely together in the same module, we'd have to run a cleanup pass and remove the TTNN functions so that LLVM lowering passes don't run on them.

As far as we can tell, iree does it this way too.

lib/Dialect/TTIR/Transforms/HoistCPUOps.cpp

mtopalovicTT · 2024-12-19T12:27:10Z

test/ttmlir/Dialect/TTIR/hoist/simple_add.mlir

@@ -0,0 +1,37 @@
+// RUN: ttmlir-opt --ttir-cpu-hoist-transform="hoist-locations=add_op1,add_op2,add_op3,add_op4" %s | tee /tmp/output.mlir | FileCheck %s


@nsmithtt @vwellsTT would it make sense to instead of using loc we use optional attribute on ops (i.e should_hoist). To me it's more expressive and simplifies the check which ops need to be hoisted.

Good idea, I like that a lot more. Then I don't need specific param => I can have single analysis pass, vs "manual" and "automatic", etc. I will sync w/ Nick on this

svuckovicTT · 2024-12-19T13:08:33Z

lib/Dialect/TTIR/Transforms/HoistCPUOps.cpp

+  auto localFunc = llvm::dyn_cast_or_null<func::FuncOp>(
+      sourceModule.lookupSymbol(functionName));
+
+  // if we have not already emitted an equivalent function call, perform this


All comments should start with a capital letter, we have some recently introduced guidelines, here. Please update comments throughout the change to adhere to the guidelines.

svuckovicTT · 2024-12-19T13:12:53Z

test/ttmlir/Dialect/TTIR/hoist/simple_add.mlir

+
+func.func @add(%arg0: tensor<32x32xbf16>, %arg1: tensor<32x32xbf16>) -> tensor<32x32xbf16> {
+  %0 = tensor.empty() : tensor<32x32xbf16>
+  // CHECK: %[[C:.*]] = call @hoisted_ttir.add_32x32xbf16_32x32xbf16_32x32xbf16_func


Replace [[C:.*]] with {{.*}} if the captured text isn't used afterwards.

svuckovicTT · 2024-12-19T13:13:40Z

lib/Dialect/TTIR/Transforms/HoistCPUOps.cpp

+}
+
+// generate unique name base on operation type + argument tensors dims & types
+static llvm::SmallString<16> generateHoistedFuncName(mlir::Operation *op) {


This might end up generating 2 equal names if the parameters are equal, should we append a rand string to fn name to "guarantee" uniqueness?

That's right, but I think that's desirable. Below, I do a check to see if a func of the name already exists, and if so skip emitting new func, and simply call the same func, to avoid generating duplicate functions. In my testcase, I exercise this logic as well (2 of the 4 calls are same signature, only 3 funcs are generated)

svuckovicTT · 2024-12-19T17:28:58Z

test/ttmlir/Dialect/TTIR/hoist/simple_add.mlir

@@ -0,0 +1,37 @@
+// RUN: ttmlir-opt --ttir-cpu-hoist-transform="hoist-locations=add_op1,add_op2,add_op3,add_op4" %s | tee /tmp/output.mlir | FileCheck %s


vwellsTT requested review from sdjordjevicTT, svuckovicTT, mtopalovicTT, nobradovictt, jserbedzijaTT, jnie-TT, azecevicTT, nsmithtt and mrakitaTT as code owners December 12, 2024 18:14

vwellsTT force-pushed the vwells/cpu_hoist branch 4 times, most recently from cbded69 to 08b45f6 Compare December 12, 2024 18:20

nsmithtt reviewed Dec 12, 2024

View reviewed changes

vwellsTT force-pushed the vwells/cpu_hoist branch from 08b45f6 to d2e3e8a Compare December 12, 2024 19:49

sdjordjevicTT reviewed Dec 13, 2024

View reviewed changes

vwellsTT force-pushed the vwells/cpu_hoist branch 14 times, most recently from 43cb4bb to 929317f Compare December 16, 2024 20:23

vwellsTT removed request for pilkicTT, vmilosevic, odjuricicTT and vprajapati-tt December 18, 2024 15:11

Merge branch 'main' into vwells/cpu_hoist

b1ff4af

vwellsTT force-pushed the vwells/cpu_hoist branch from cf95320 to b1ff4af Compare December 18, 2024 19:28