[Direct Metal] Insert scaler CB in reduce kernels #1153

rpavlovicTT · 2024-11-04T16:29:42Z

fixes #781

When lowering reduce op, compiler will add additional CB that will represent scaler CB needed by LLK. ScalerCB is filled by constant according to type of reduce operation.

An idea is to leave preallocated space at the top of L1 memory for misc usage in kernels. Scaler CB should be placed inside of this region. Also, TTIR ops should not have context of scaler CB. It should appear only in TTKernel dialect as additional arg when lowering reduce operations.

ScalerCB is populated during kernel runtime, and that should be done in NOC thread since that thread can read zeros in a burst.

For example, loading scaler cb in data movement kernel looks like

 ::tt::CB v3 = ::tt::CB::c_in1;
  ::tt::CB v4 = v3;
  int32_t v5 = 1065369472;
  int32_t v6 = 1;
  cb_reserve_back(v4, v6);
  int32_t v7 = MEM_ZEROS_BASE;
  int64_t v8 = get_noc_addr(v7);
  int32_t v9 = MEM_ZEROS_SIZE;
  noc_async_read_one_packet_set_state(v8, v9);
  int32_t v10 = get_write_ptr(v4);
  volatile tt_l1_ptr uint32_t* v11 = reinterpret_cast<volatile tt_l1_ptr uint32_t*>(v10);
  int32_t v12 = 0;
  int32_t v13 = 2048;
  int32_t v14 = v13 / v9;
  int32_t v15 = 1;
  int32_t v16;
  v16 = v10;
  for (int32_t v17 = v12; v17 < v14; v17 += v15) {
    int32_t v18 = v16;
    noc_async_read_one_packet_with_state(v8, v18);
    int32_t v19 = v18 + v9;
    v16 = v19;
  }
  int32_t v20 = v16;
  noc_async_read_barrier();
  int32_t v21 = 4;
  for (int32_t v22 = v12; v22 < v21; v22 += v15) {
    int32_t v23 = 128;
    int32_t v24 = v22 * v23;
    int32_t v25 = 8;
    for (int32_t v26 = v12; v26 < v25; v26 += v15) {
      int32_t v27 = v24 + v26;
      volatile tt_l1_ptr uint32_t v28 = (volatile tt_l1_ptr uint32_t) v5;
      v11[v27] = v28;
    };
  }
  cb_push_back(v4, v6);

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp

runtime/lib/ttmetal/command_queue.cpp

nsmithtt

Did a first pass, I will try to review the rest of TTIRToTTMetal.cpp tonight / tomorrow.

lib/Target/TTMetal/TTMetalToFlatbuffer.cpp

runtime/lib/ttmetal/command_queue.cpp

include/ttmlir/Dialect/TTKernel/IR/TTKernelOpsTypes.td

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp

runtime/lib/ttmetal/command_queue.cpp

lib/Target/TTMetal/TTMetalToFlatbuffer.cpp

lib/Dialect/TT/IR/TTOpsTypes.cpp

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp

When lowering reduce op, compiler will add additional CB that will represent scaler CB needed by LLK. ScalerCB is filled by constant according to type of reduce operation. An idea is to leave preallocated space at the top of L1 memory for misc usage in kernels. Scaler CB should be placed inside of this region. Also, TTIR ops should not have context of scaler CB. It should appear only in TTKernel dialect as additional arg when lowering reduce operations. ScalerCB is populated during kernel runtime, and that should be done in NOC thread since that thread can read zeros in a burst.

Internal CBs that are not associated with tensor operands have their address assigned locally.

nsmithtt

Thank you!

rpavlovicTT requested review from jnie-TT, kmabeeTT, AleksKnezevic, pilkicTT, sdjordjevicTT, nsmithtt, svuckovicTT, mtopalovicTT, mrakitaTT, nobradovictt, jserbedzijaTT and rjakovljevicTT as code owners November 4, 2024 16:29

rpavlovicTT commented Nov 4, 2024

View reviewed changes

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Outdated Show resolved Hide resolved

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Outdated Show resolved Hide resolved

runtime/lib/ttmetal/command_queue.cpp Outdated Show resolved Hide resolved

nsmithtt reviewed Nov 4, 2024

View reviewed changes

nsmithtt reviewed Nov 5, 2024

View reviewed changes

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Show resolved Hide resolved

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Outdated Show resolved Hide resolved

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Show resolved Hide resolved

nsmithtt reviewed Nov 14, 2024

View reviewed changes

rpavlovicTT added 4 commits November 15, 2024 14:33

Small fix after LLVM uplift

f27a5c7

Fix for empty regions of dispatch op

163d087

Address comments last round

d0eeefc

Internal CBs that are not associated with tensor operands have their address assigned locally.

rpavlovicTT force-pushed the rpavlovic/reduce_scaler_cb branch from c82bc83 to d0eeefc Compare November 15, 2024 14:35

Fix clang tidy error

d1d343a

nsmithtt approved these changes Nov 15, 2024

View reviewed changes

kmabeeTT approved these changes Nov 18, 2024

View reviewed changes

rpavlovicTT merged commit 9aa7adc into main Nov 18, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Direct Metal] Insert scaler CB in reduce kernels #1153

[Direct Metal] Insert scaler CB in reduce kernels #1153

rpavlovicTT commented Nov 4, 2024 •

edited

Loading

nsmithtt left a comment

nsmithtt left a comment

[Direct Metal] Insert scaler CB in reduce kernels #1153

[Direct Metal] Insert scaler CB in reduce kernels #1153

Conversation

rpavlovicTT commented Nov 4, 2024 • edited Loading

nsmithtt left a comment

Choose a reason for hiding this comment

nsmithtt left a comment

Choose a reason for hiding this comment

rpavlovicTT commented Nov 4, 2024 •

edited

Loading