Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initial CCL V2 infra push - add cmd interpreter and reduce scatter, a…
…ll-gather Without going to deep into the weeds, there were numerous reasons why CCLs needed to be fundamentally rewritten but to summarize some of the reasons: - Writing CCLs was not scalable from a development effort - Even within a single op (e.g. all-gather) we need to be able to support many topologies (ring, line, mesh, tree, tree of mesh, etc.) and use cases (BW bound, latency bound, high reliability vs lower reliability with potentially better perf) - CCLs need to be able to be fused with just about any ops without it being a Herculean effort - New concepts like "async tensor" need to be supported to account for performance artifacts like (dispatch) skew between chips and to effectively hide latency of various operations - (minor) support the new fabric projects with CCLs ### Initial test coverage - Gtests that provide basic coverage for the CCL Command interpreter running on the transitionary EDM fabric (both in persistent and non-persistent modes) - Gtests for reduce scatter and all-gather also added - Basic all gather pytests Future work will expand test coverage ### What's changed Lots to discuss here: - What's the command interpreter - How's it work - How do we build ops with it - What's new with these CCLs? The bulk of this information is or will be included in a much larger doc that will be circulated more widely in the coming weeks so a summary is provided below (if you want more details before the doc is provided, ask and I will point you to what's in progress): A new "command interpreter" kernel is provided which executes various different command types. Some commands map nearly directly to the low level noc API but others map to higher level operations. High Level Operation Example: - Stream Tensor Slice (from: CB/addrgen) (to:raw addr, CB, (fabric) addrgen) Low Level Command: - Wait for semaphore value - Send semaphore update - Raw Read/Write These commands are specifiable on host and there is a whole optimization story for performance but to provide the general idea, here's the primary functional code needed for all-gather as an example (code reorganized for purpose of PR example - not 1:1 to `all_gather_async_program.cpp`: ``` // Create a "reader kernel" command stream std::vector<ttnn::ccl::cmd::CclHostLowLevelWorkerCommand> reader_cmd_stream; reader_cmd_stream.push_back(ttnn::ccl::cmd::uops::read_tensor_slice_to_cb(input_worker_slice_v2, src0_cb_index)); // Create a "writer kernel" command stream std::vector<ttnn::ccl::cmd::CclHostLowLevelWorkerCommand> writer_cmd_stream; // 1, do mcast of the tensor slice to all the destinations writer_cmd_stream.push_back(ttnn::ccl::cmd::uops::fabric_write_cb_to_tensor_slice( output_worker_slice_v2, src0_cb_index, mcast_dest_args)); // Really, for all-gather, that's basically it - the rest of the code is to choose core placement and get info - l // ike which core(s) are fabric endpoints to connect to fabric, etc.) // Now pass the commands to the kernel(s) ttnn::ccl::worker_detail::generate_multi_input_command_stream_kernel_rt_args( program, worker_sender_reader_kernel_id, ..., reader_cmd_stream, std::nullopt, std::nullopt, std::nullopt); ttnn::ccl::worker_detail::generate_multi_input_command_stream_kernel_rt_args( program, worker_sender_writer_kernel_id, ..., writer_cmd_stream, std::nullopt, {forward_fabric_connection}, {backward_fabric_connection}); ``` With the above, operations such as fusion become far simpler (in some cases, trivial). For example, in the case of fusing an all-reduce with split-qkv heads operation for example (note that the output side of all-reduce is basically all-gather in an optimized ring implementation), the basic fusion operation is first identifying the split/slice boundaries of split-qkv (this could potentially be obtained from the op directly) and propagating those cut lines to all of the tensor slices of the producer (like the tensor slices in the commands shown above) and simply splitting those slices and setting the correct output tensors for each accordingly. Note that many commands can be added to each given command stream - all-gather is just very simple. Reduce scatter is an example of one that is more complicated. ### Expanding to other operations: Here are some simple examples #### Send/receive - Take the all-gather as example, and rather than specifying an mcast on the tensor write command: ``` writer_cmd_stream.push_back(ttnn::ccl::cmd::uops::fabric_write_cb_to_tensor_slice( output_worker_slice_v2, src0_cb_index, mcast_dest_args)) ``` you would unicast it to the desired destination (replace `mcast_dest_args`) If running in synchronous tensor mode, add a command interpreter kernel at the destination chip with a wait_val command to wait on a sem inc. Append a seminc to the sender command stream #### Broadcast Invoke all-gather above but just from one chip. If running in synchronous tensor mode, add a command interpreter kernel at all the destination chips with a wait_val command to wait on a sem inc. Append a fabric multicast seminc to the sender command stream. #### Reduce - Build a tree on the cluster - Each producer chip unicast sends to the next node towards the root of the tree, send sync signal to downstream - if not a leaf, perform partial reduction with your received data and your local data and forward to the next node toward the root - Add a wait val before accepting your input data - Root node can do any number of reductions to reduce the incoming data streams (ensuring to first sync on any input stream before consuming We do something similar to the above for reduce scatter ### Note on APIs These APIs are expected to be refined over time. In the mean-time, I have introduces the named "micro-ops" as commands to grant us some flexibilitiy in changing underlying command encodings (both on host and device). This will let us optimize and improve the "IR" over time without requiring constant op implementation updates. --------- Co-authored-by: Jack Cai <[email protected]>
- Loading branch information