Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Direct Metal] Insert scaler CB in reduce kernels #1153

Merged
merged 5 commits into from
Nov 18, 2024

Conversation

rpavlovicTT
Copy link
Contributor

@rpavlovicTT rpavlovicTT commented Nov 4, 2024

fixes #781

When lowering reduce op, compiler will add additional CB that will represent scaler CB needed by LLK. ScalerCB is filled by constant according to type of reduce operation.

An idea is to leave preallocated space at the top of L1 memory for misc usage in kernels. Scaler CB should be placed inside of this region. Also, TTIR ops should not have context of scaler CB. It should appear only in TTKernel dialect as additional arg when lowering reduce operations.

ScalerCB is populated during kernel runtime, and that should be done in NOC thread since that thread can read zeros in a burst.

For example, loading scaler cb in data movement kernel looks like

 ::tt::CB v3 = ::tt::CB::c_in1;
  ::tt::CB v4 = v3;
  int32_t v5 = 1065369472;
  int32_t v6 = 1;
  cb_reserve_back(v4, v6);
  int32_t v7 = MEM_ZEROS_BASE;
  int64_t v8 = get_noc_addr(v7);
  int32_t v9 = MEM_ZEROS_SIZE;
  noc_async_read_one_packet_set_state(v8, v9);
  int32_t v10 = get_write_ptr(v4);
  volatile tt_l1_ptr uint32_t* v11 = reinterpret_cast<volatile tt_l1_ptr uint32_t*>(v10);
  int32_t v12 = 0;
  int32_t v13 = 2048;
  int32_t v14 = v13 / v9;
  int32_t v15 = 1;
  int32_t v16;
  v16 = v10;
  for (int32_t v17 = v12; v17 < v14; v17 += v15) {
    int32_t v18 = v16;
    noc_async_read_one_packet_with_state(v8, v18);
    int32_t v19 = v18 + v9;
    v16 = v19;
  }
  int32_t v20 = v16;
  noc_async_read_barrier();
  int32_t v21 = 4;
  for (int32_t v22 = v12; v22 < v21; v22 += v15) {
    int32_t v23 = 128;
    int32_t v24 = v22 * v23;
    int32_t v25 = 8;
    for (int32_t v26 = v12; v26 < v25; v26 += v15) {
      int32_t v27 = v24 + v26;
      volatile tt_l1_ptr uint32_t v28 = (volatile tt_l1_ptr uint32_t) v5;
      v11[v27] = v28;
    };
  }
  cb_push_back(v4, v6);

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Outdated Show resolved Hide resolved
lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Outdated Show resolved Hide resolved
runtime/lib/ttmetal/command_queue.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@nsmithtt nsmithtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass, I will try to review the rest of TTIRToTTMetal.cpp tonight / tomorrow.

lib/Target/TTMetal/TTMetalToFlatbuffer.cpp Outdated Show resolved Hide resolved
runtime/lib/ttmetal/command_queue.cpp Outdated Show resolved Hide resolved
runtime/lib/ttmetal/command_queue.cpp Outdated Show resolved Hide resolved
include/ttmlir/Dialect/TTKernel/IR/TTKernelOpsTypes.td Outdated Show resolved Hide resolved
lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Outdated Show resolved Hide resolved
runtime/lib/ttmetal/command_queue.cpp Outdated Show resolved Hide resolved
lib/Target/TTMetal/TTMetalToFlatbuffer.cpp Outdated Show resolved Hide resolved
lib/Dialect/TT/IR/TTOpsTypes.cpp Outdated Show resolved Hide resolved
lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Outdated Show resolved Hide resolved
When lowering reduce op, compiler will add additional CB that
will represent scaler CB needed by LLK. ScalerCB is filled by
constant according to type of reduce operation.

An idea is to leave preallocated space at the top of L1 memory for
misc usage in kernels. Scaler CB should be placed inside of this region.
Also, TTIR ops should not have context of scaler CB. It should appear
only in TTKernel dialect as additional arg when lowering reduce
operations.

ScalerCB is populated during kernel runtime, and that should be done
in NOC thread since that thread can read zeros in a burst.
Internal CBs that are not associated with tensor operands have their
address assigned locally.
@rpavlovicTT rpavlovicTT force-pushed the rpavlovic/reduce_scaler_cb branch from c82bc83 to d0eeefc Compare November 15, 2024 14:35
Copy link
Contributor

@nsmithtt nsmithtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@rpavlovicTT rpavlovicTT merged commit 9aa7adc into main Nov 18, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load scaler CB to reduce op
3 participants