Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hoist mechanism for specific TTIR ops, with placeholder analysis pass #1586

Merged
merged 15 commits into from
Dec 20, 2024

Conversation

vwellsTT
Copy link
Contributor

@vwellsTT vwellsTT commented Dec 12, 2024

Goal: The end-to-end goal is to integrate a path to compile and execute specific ops or sets of ops on the CPU.

Context:

The entire task will be split into (tentatively) 7 PRs, as follows:

  1. Hoist specific ops into isolated funcs in a separate module
  2. Convert TTIR ops to linalg ops within the module of hoisted funcs
  3. Build a pipeline to lower linalg to llvm from existing conversion passes
  4. Translate LLVM Dialect into a dynamic library for packing into flatbuffer
  5. Generate helper functions so that we can call all of our hoisted funcs with a common signature
  6. Insert TTNN instructions to move operands to host before executing hoisted func, then back to device afterwards
  7. Update ttir-to-ttnn and ttnn-to-flatbuffer pipelines to use new passes, generate dylibs, and embed them into output flatbuffers, and update update runtime to consume dylibs from flatbuffers

This PR represents the 1st point above. Here, we build hoisting (placeholder) analysis + transform pass to mark specific ops to be hoisted, and then actually pull them into separate functions in a new "cpu" module + replace the original op with a call to the func.

Example:

input:
    module {
      func.func @add(%arg0: tensor<32x32xbf16>, %arg1: tensor<32x32xbf16>) -> tensor<32x32xbf16> {
        %0 = tensor.empty() : tensor<32x32xbf16>
        %1 = "ttir.add"(%arg0, %arg1, %0) <{operandSegmentSizes = array<i32: 2, 1>}> : (tensor<32x32xbf16>, tensor<32x32xbf16>, tensor<32x32xbf16>) -> tensor<32x32xbf16> loc("add_op1")
        return %1 : tensor<32x32xbf16>
      }
    }
output:
    module {
      func.func @add(%arg0: tensor<32x32xbf16>, %arg1: tensor<32x32xbf16>) -> tensor<32x32xbf16> {
        %0 = tensor.empty() : tensor<32x32xbf16>
        %1 = call @hoisted_ttir.add_32x32xbf16_32x32xbf16_32x32xbf16_func(%arg0, %arg1, %0) : (tensor<32x32xbf16>, tensor<32x32xbf16>, tensor<32x32xbf16>) -> tensor<32x32xbf16>
        return %1 : tensor<32x32xbf16>
      }
      module @cpu_module attributes {ttir.cpu_module} {
        func.func @hoisted_ttir.add_32x32xbf16_32x32xbf16_32x32xbf16_func(%arg0: tensor<32x32xbf16>, %arg1: tensor<32x32xbf16>, %arg2: tensor<32x32xbf16>) -> tensor<32x32xbf16> attributes {arg_ranks = [2, 2, 2, 2]} {
          %0 = "ttir.add"(%arg0, %arg1, %arg2) <{operandSegmentSizes = array<i32: 2, 1>}> : (tensor<32x32xbf16>, tensor<32x32xbf16>, tensor<32x32xbf16>) -> tensor<32x32xbf16>
          return %0 : tensor<32x32xbf16>
        }
      }
      func.func private @hoisted_ttir.add_32x32xbf16_32x32xbf16_32x32xbf16_func(tensor<32x32xbf16>, tensor<32x32xbf16>, tensor<32x32xbf16>) -> tensor<32x32xbf16>
    }

Copy link
Contributor

@nsmithtt nsmithtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vwellsTT, this looks great, couple of things we should add:

  • Let's add more context to the PR description. Let's call out that this is part _ of ?? parts with some additional sentences describing that this is part of CPU fallback path. It's probably worth having a top level statement that outlines the whole plan broken into parts and then stating this PR tackles part _ of above plan.
  • We need a directed test proving that the hoist pass hoisted correctly

include/ttmlir/Dialect/TTIR/Transforms/Passes.td Outdated Show resolved Hide resolved
lib/Dialect/TTIR/Transforms/HoistCPUOps.cpp Outdated Show resolved Hide resolved
lib/Dialect/TTIR/Transforms/HoistCPUOps.cpp Outdated Show resolved Hide resolved
lib/Dialect/TTIR/Transforms/HoistCPUOps.cpp Show resolved Hide resolved
}
}

// If no CPU module exists, create one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intention of splitting the IR into two different modules? Do we need this split? Currently, we have a very simple IR structure with one module and many functions, hence all our passes are working on a single module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the hoisted funcs will be lowered via a separate pipeline, so I don't think we want any of the existing passes running on them. @nsmithtt originally proposed splitting CPU funcs into new module I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the intention for splitting into a new module is that we can have a 1-1 mapping of module to device target type. E.g. the main module is tagged with a system desc, i.e. all the functions within that module are intended to run on device/TTNN framework, a CPU module is tagged with a CPU target triple so organizationally it's a bit easier to read / discover that all functions inside that module are intended for CPU.

On top of that, there are two other reasons:

  • The logic for lowering to llvm / compiling to dylib becomes much simpler because we just hand over the CPU module op as the root node for conversion, whereas if we had CPU and TTNN entry point functions just freely together in the same module, we'd have to run a cleanup pass and remove the TTNN functions so that LLVM lowering passes don't run on them.
  • As far as we can tell, iree does it this way too.

@vwellsTT vwellsTT force-pushed the vwells/cpu_hoist branch 14 times, most recently from 43cb4bb to 929317f Compare December 16, 2024 20:23
@@ -0,0 +1,37 @@
// RUN: ttmlir-opt --ttir-cpu-hoist-transform="hoist-locations=add_op1,add_op2,add_op3,add_op4" %s | tee /tmp/output.mlir | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsmithtt @vwellsTT would it make sense to instead of using loc we use optional attribute on ops (i.e should_hoist). To me it's more expressive and simplifies the check which ops need to be hoisted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I like that a lot more. Then I don't need specific param => I can have single analysis pass, vs "manual" and "automatic", etc. I will sync w/ Nick on this

auto localFunc = llvm::dyn_cast_or_null<func::FuncOp>(
sourceModule.lookupSymbol(functionName));

// if we have not already emitted an equivalent function call, perform this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments should start with a capital letter, we have some recently introduced guidelines, here. Please update comments throughout the change to adhere to the guidelines.


func.func @add(%arg0: tensor<32x32xbf16>, %arg1: tensor<32x32xbf16>) -> tensor<32x32xbf16> {
%0 = tensor.empty() : tensor<32x32xbf16>
// CHECK: %[[C:.*]] = call @hoisted_ttir.add_32x32xbf16_32x32xbf16_32x32xbf16_func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace [[C:.*]] with {{.*}} if the captured text isn't used afterwards.

}

// generate unique name base on operation type + argument tensors dims & types
static llvm::SmallString<16> generateHoistedFuncName(mlir::Operation *op) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might end up generating 2 equal names if the parameters are equal, should we append a rand string to fn name to "guarantee" uniqueness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, but I think that's desirable. Below, I do a check to see if a func of the name already exists, and if so skip emitting new func, and simply call the same func, to avoid generating duplicate functions. In my testcase, I exercise this logic as well (2 of the 4 calls are same signature, only 3 funcs are generated)

@@ -0,0 +1,37 @@
// RUN: ttmlir-opt --ttir-cpu-hoist-transform="hoist-locations=add_op1,add_op2,add_op3,add_op4" %s | tee /tmp/output.mlir | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove tee

@vwellsTT vwellsTT force-pushed the vwells/cpu_hoist branch 3 times, most recently from 6ef45fd to 64011b2 Compare December 19, 2024 19:38
@vwellsTT vwellsTT merged commit 432f8d8 into main Dec 20, 2024
21 checks passed
@vwellsTT vwellsTT deleted the vwells/cpu_hoist branch December 20, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants