Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation for adding new ttnn operation #9841

Merged
merged 1 commit into from
Jul 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 53 additions & 129 deletions docs/source/ttnn/ttnn/adding_new_ttnn_operation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,174 +7,98 @@ Adding New ttnn Operation
Not all operations may be functional on all Tenstorrent hardware (Grayskull,
Wormhole, or others).

C++ Implementation
------------------


Add `tt_eager/tt_dnn/op_library/<new_operation>/<new_operation>.hpp`:

.. code-block:: cpp

#pragma once

#include <optional>

#include "tensor/tensor.hpp"
#include "tt_dnn/op_library/operation.hpp"

namespace tt {
namespace tt_metal {

struct <NewOperation> {
bool some_arg;

// These methods are needed if the operation takes in input tensor and produces output tensors
void validate(const std::vector<Tensor> &input_tensors) const;
std::vector<Shape> compute_output_shapes(const std::vector<Tensor> &input_tensors) const;
std::vector<Tensor> create_output_tensors(const std::vector<Tensor> &input_tensors) const;
operation::ProgramWithCallbacks create_program(const std::vector<Tensor>& input_tensors, std::vector<Tensor> &output_tensors) const;

// This is needed until we get C++20
static constexpr auto attribute_names = std::forward_as_tuple("some_arg");
const auto attribute_values() const {
return std::forward_as_tuple(this->some_arg);
}
};

Tensor <new_operation>(const Tensor &input_tensor, bool some_arg);

} // namespace tt_metal
} // namespace tt
What is a ttnn operation?
-------------------------

A ttnn operation is a function that takes in one or more input tensors and produces one or more output tensors. It is implemented in C++ and can be called from Python.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function but not really :D


.. note:
What steps are needed to add ttnn operation in C++?
---------------------------------------------------
1. (Optional) Implement device operation in C++. Device operation is a struct that specifies how to create output tensors and a program to run on the device. If the ttnn operation is composed of other ttnn operations, then you can skip this step.
2. Implement ttnn operation in C++ and register it using `ttnn::register_operation`.

If you need optional input tensors or would like to pass in optional output tensors, then refer to :doc:`Operations </ttnn/dependencies/tt_lib>` for how to write ops that use them
What steps are needed to add ttnn operation in Python?
------------------------------------------------------
1. Take an existing registerd C++ operation and add a Python binding for it using `ttnn::bind_registered_operation`.
2. In python, decorate the operation using `ttnn.register_operation`. (This step will be deprecated in the future)


Add `tt_eager/tt_dnn/op_library/<new_operation>/<new_operation>.cpp`:

.. code-block:: cpp

#include "tt_metal/host_api.hpp"
#include "tt_dnn/op_library/run_operation.hpp"

namespace tt {
namespace tt_metal {


void <NewOperation>::validate(const std::vector<Tensor> &input_tensors) const {
...
}

std::vector<Shape> <NewOperation>::compute_output_shapes(const std::vector<Tensor> &input_tensors) const {
std::vector<Shape> output_shapes = ...;
return output_shapes;
}

std::vector<Tensor> create_output_tensors(const std::vector<Tensor> &input_tensors) const {
std::vector<Tensor> output_tensors = ...;
return output_tensors;
}

operation::ProgramWithCallbacks create_program(const std::vector<Tensor>& input_tensors, std::vector<Tensor> &output_tensors) const {
Program program = ...;
return operation::ProgramWithCallbacks{program};
}

};
C++ Implementation
------------------

Tensor <new_operation>(const Tensor &input_tensor, bool some_arg) {
std::vector<Tensor> input_tensors = {input_tensor};
std::vector<Tensor> output_tensors operation::run(DeviceOperation(<NewOperation>{some_arg}, {input_tensor}));
return output_tensors[0];
}
Step 1: Implement device operation (Optional)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

} // namespace tt_metal
} // namespace tt
In order to add a new device operation, follow the directory structure shown below:

`ttnn/cpp/ttnn/operations/<category>/<operation_name>/device/<operation_name>_device_operation.hpp`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned about category part. People often mess up with categorizaton, specially when it is not 100% clear.. I am ok to keep it like that, just want to raise a Q for discussion.

`ttnn/cpp/ttnn/operations/<category>/<operation_name>/device/<operation_name>_device_operation.cpp`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_device_operation is what we called <operation_name>_op.hpp/.cpp before?

`ttnn/cpp/ttnn/operations/<category>/<operation_name>/device/<program_factory_0>_program_factory.cpp`

Add pybindings
--------------
.. note::
Add as many program factories as needed

In `tt_eager/tt_lib/csrc/tt_lib_bindings_tensor.cpp`, add the following lines
A concrete example of a device operation can be found in `ttnn/cpp/ttnn/operations/examples/example/device`

.. code-block:: cpp
`ttnn/cpp/ttnn/operations/examples/example/device/example_device_operation.hpp`:

m_tensor.def("<new_operation>", &<new_operation>, py::arg("input_tensor").noconvert(), py::arg("some_arg").noconvert(), R"doc(
<NewOperation> runs new operation on input tensor.
.. literalinclude:: examples/example/device/example_device_operation.hpp

.. csv-table::
:header: "Argument", "Description", "Data type", "Valid range", "Required"
`ttnn/cpp/ttnn/operations/examples/example/device/example_device_operation.cpp`:

"input_tensor", "Input tensor", "Tensor", "Tensor of shape [W0, Z0, Y0, X0]", "Yes"
"some_arg", "Some arg", "bool", "Some arg to do some stuff in new operation", "Yes"
)doc");
.. literalinclude:: examples/example/device/example_device_operation.cpp

`ttnn/cpp/ttnn/operations/examples/example/device/single_core_program_factory.cpp`:

.. literalinclude:: examples/example/device/single_core_program_factory.cpp

Adding a unit test
------------------
`ttnn/cpp/ttnn/operations/examples/example/device/multi_core_program_factory.cpp`:

Add `tests/ttnn/unit_tests/ttl/test_<new_operation>.py`:
.. literalinclude:: examples/example/device/multi_core_program_factory.cpp

.. code-block:: python

import pytest
import torch
import ttnn
Step 2: Implement the operation in C++
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

from tests.ttnn.utils_for_testing import assert_with_pcc
In order to add a new operation, add the following file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a brief description of this file.
It basically provides a structure and registers the operation.
The cool thing about the structure is that it allows to provide multiple overloads.
The cool thing about registering that it allows to instrument calls to an operation with goodies like "...list...".


@pytest.mark.parametrize("height", [32])
@pytest.mark.parametrize("width", [32])
def test_<new_operation>(device, height, width):
torch.manual_seed(0)
`ttnn/cpp/ttnn/operations/<category>/<operation_name>/<operation_name>.hpp`

torch_input_tensor = torch.rand(1, 1, height, width)
torch_output_tensor = torch.exp(torch_input_tensor)
A concrete example:

input_tensor = ttnn.from_torch(torch_input_tensor, device=device)
output_tensor = ttnn.experimental.tensor.<new_operation>(input_tensor)
`ttnn/cpp/ttnn/operations/examples/example/example.hpp`:

output_tensor = ttnn.to_torch(output_tensor)
.. literalinclude:: examples/example/example.hpp

assert_with_pcc(torch_output_tensor, output_tensor)

Python Implementation
---------------------

Step 1: Add Python binding
~~~~~~~~~~~~~~~~~~~~~~~~~~

Adding a sweep test
-------------------
In order to add a python binding for the operation, follow the directory structure shown below:

Add `tests/ttnn/sweep_tests/sweeps/ttl_<new_operation>.py`:
`ttnn/python/ttnn/operations/<category>/<operation_name>/<operation_name>_pybind.hpp`
`ttnn/python/ttnn/operations/<category>/<category>_pybind.hpp`

.. code-block:: python
A concrete example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe lets add some notes about consistency? not sure if here or in general.
we want ops to be consistent.
e.g.
kw_only memory_config = None
kw_only queue_id in python and overload in c++
kw_only optional output tensor/s
kw_only dtype
Specific names, specific defaults EXCEPT when not possible and something else is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do in the next PR


from typing import Optional, Tuples
import torch
import ttnn
from tests.ttnn.utils_for_testing import check_with_pcc
`ttnn/python/ttnn/operations/examples/example/example_pybind.hpp`:

.. literalinclude:: examples/example/example_pybind.hpp

parameters = {
"height": [384, 1024],
"width": [1024, 4096],
}
`ttnn/python/ttnn/operations/examples/examples_pybind.hpp`:

.. literalinclude:: examples/example/example_pybind.hpp

def run(
height,
width,
*,
device,
) -> Tuple[bool, Optional[str]]:
Finally, call the module defined in `examples/example/example_pybind.hpp` wherever you want it to be added.

torch_input_tensor = torch.rand(1, 1, height, width)
torch_output_tensor = torch.exp(torch_input_tensor)

input_tensor = ttnn.from_torch(torch_input_tensor, device=device)
output_tensor = ttnn.experimental.tensor.<new_operation>(input_tensor)

output_tensor = ttnn.to_torch(output_tensor)
Step 2: Register the operation in Python
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

assert_with_pcc(torch_output_tensor, output_tensor)
TODO: Add the description of how to register the operation in Python.
1 change: 1 addition & 0 deletions docs/source/ttnn/ttnn/examples
4 changes: 1 addition & 3 deletions tt_eager/tt_dnn/op_library/composite/composite_ops.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,8 @@
#include "tt_dnn/op_library/eltwise_unary/eltwise_unary_op.hpp"
#include "tt_metal/common/constants.hpp"
#include "ttnn/cpp/ttnn/operations/creation.hpp"


#include "ttnn/operations/eltwise/binary/device/binary_op.hpp"
#include "ttnn/operations/data_movement/pad/pad.hpp"
#include "ttnn/operations/eltwise/binary/device/binary_device_operation.hpp"

namespace tt {

Expand Down
8 changes: 7 additions & 1 deletion tt_metal/tools/profiler/op_profiler.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -385,7 +385,13 @@ inline std::string op_meta_data_serialized_json(

j["optional_input_tensors"] = std::vector<json>{};

auto perfModel = operation_t::create_op_performance_model(operation_attributes, tensor_args, tensor_return_value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to write some more info on "how to perf model?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto perfModel = [&]() {
if constexpr (requires { operation_t::create_op_performance_model; }) {
return operation_t::create_op_performance_model(operation_attributes, tensor_args, tensor_return_value);
} else {
return operation::OpPerformanceModel{};
}
}();
j["performance_model"]["compute_ns"] = perfModel.get_compute_ns();
j["performance_model"]["ideal_ns"] = perfModel.get_ideal_ns();
j["performance_model"]["bandwidth_ns"] = perfModel.get_bandwidth_ns();
Expand Down
5 changes: 4 additions & 1 deletion ttnn/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,15 @@ set(TTNN_SRCS
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/conv2d.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/matmul.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/reduction/argmax/device/argmax_op.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/eltwise/binary/device/binary_op.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/eltwise/binary/device/binary_device_operation.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/eltwise/binary/device/broadcast_height_and_width_multi_core_program_factory.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/eltwise/binary/device/broadcast_height_multi_core_program_factory.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/eltwise/binary/device/broadcast_width_multi_core_program_factory.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/eltwise/binary/device/element_wise_multi_core_program_factory.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/eltwise/unary/device/unary_op.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/examples/example/device/example_device_operation.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/examples/example/device/single_core_program_factory.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/examples/example/device/multi_core_program_factory.cpp
${CMAKE_CURRENT_SOURCE_DIR}/cpp/ttnn/operations/reduction/topk/device/topk_op.cpp
)

Expand Down
8 changes: 6 additions & 2 deletions ttnn/cpp/pybind11/operations/__init__.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

#include "pybind11/operations/ccl.hpp"
#include "pybind11/operations/conv2d.hpp"
#include "pybind11/operations/copy.hpp"
#include "pybind11/operations/core.hpp"
#include "pybind11/operations/creation.hpp"
#include "pybind11/operations/embedding.hpp"
Expand All @@ -17,12 +18,12 @@
#include "pybind11/operations/maxpool2d.hpp"
#include "pybind11/operations/normalization.hpp"
#include "pybind11/operations/pool.hpp"
#include "pybind11/operations/copy.hpp"
#include "pybind11/operations/ternary.hpp"
#include "pybind11/operations/transformer.hpp"

#include "ttnn/operations/eltwise/binary/binary_pybind.hpp"
#include "ttnn/operations/eltwise/binary_backward/binary_backward_pybind.hpp"
#include "ttnn/operations/eltwise/unary/unary_pybind.hpp"
#include "ttnn/operations/examples/examples_pybind.hpp"
#include "ttnn/operations/reduction/reduction_pybind.hpp"
#include "ttnn/operations/eltwise/binary_backward/binary_backward_pybind.hpp"
#include "ttnn/operations/data_movement/data_movement_pybind.hpp"
Expand All @@ -35,6 +36,9 @@ namespace ttnn {
namespace operations {

void py_module(py::module& module) {
auto m_example = module.def_submodule("example", "example operation");
examples::py_module(m_example);

auto m_unary = module.def_submodule("unary", "unary operations");
unary::py_module(m_unary);

Expand Down
16 changes: 8 additions & 8 deletions ttnn/cpp/ttnn/device_operation.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@ namespace ttnn {

namespace device_operation {

template <typename program_attributes_t>
template <typename shared_variables_t>
struct CachedProgram {
tt::tt_metal::Program program;
// Cached program needs to share program_attributes between create and override_runtime_arguments functions
program_attributes_t program_attributes;
// Cached program needs to share shared_variables between create and override_runtime_arguments functions
shared_variables_t shared_variables;

CachedProgram(tt::tt_metal::Program&& program, program_attributes_t&& program_attributes) :
program{std::move(program)}, program_attributes{program_attributes} {}
CachedProgram(tt::tt_metal::Program&& program, shared_variables_t&& shared_variables) :
program{std::move(program)}, shared_variables{shared_variables} {}
};

struct CachedProgramFactory {
Expand All @@ -38,8 +38,8 @@ struct CachedProgramFactory {
// program_factory_index is used to map a runtime value to a program factory type that is being used
std::size_t program_factory_index;

template <typename program_attributes_t>
CachedProgramFactory(CachedProgram<program_attributes_t>&& cached_program, std::size_t program_factory_index) :
template <typename shared_variables_t>
CachedProgramFactory(CachedProgram<shared_variables_t>&& cached_program, std::size_t program_factory_index) :
cached_program{std::move(cached_program)}, program_factory_index{program_factory_index} {}
};

Expand Down Expand Up @@ -91,7 +91,7 @@ concept DeviceOperationWithCustomProgramCacheConcept = DeviceOperationConcept<de
template <typename... Ts>
[[nodiscard]] std::variant<Ts...> constexpr map_index_to_variant(std::size_t i, std::variant<Ts...>) {
assert(i < sizeof...(Ts));
static constexpr std::variant<Ts...> table[] = { Ts{ }... };
static constexpr std::variant<Ts...> table[] = {Ts{}...};
return table[i];
}

Expand Down
Loading
Loading