-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Infrastructure to exactly calculate L1 Memory Usage for Conv2D #15088 #15455
Conversation
2b3a64d
to
d153dac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (1/1)
d153dac
to
be2a301
Compare
be2a301
to
8d9ba30
Compare
277ae36
to
73a0fe8
Compare
d7e027d
to
41c2469
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. (I was looking at the wrong diff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/weights_reader_width_sharded.cpp
Outdated
Show resolved
Hide resolved
uint32_t input_channels_alignment = 32; | ||
bool deallocate_activation = false; | ||
bool reallocate_halo_output = false; | ||
uint32_t act_block_h_override = 0; // This argument is ignored when shard_layout == WIDTH_SHARDED. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this config is user facing, i think would be nicer to organize comments for each of the parameter -- something like:
/* this arg does that
*/
dtype arg = default_val;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it. Is this okay?
bbfb12e
to
9d1b535
Compare
} | ||
|
||
conv_op_l1_usage conv2d::estimate_L1_usage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exact calculation, right? Perhaps rename to calculate_L1_usage
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I have renamed it.
device->get_memory_allocation_statistics(tt::tt_metal::BufferType::L1).total_allocated_bytes; | ||
auto actual_cb_size = program_with_cbs.program.get_cb_memory_size(); | ||
|
||
auto [calc_output_size, calc_CB_size] = estimate_L1_usage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this have e2e perf degradation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function itself just does a handful of calculations, and so should take very little time. It's invoked when the program object is created, in operation::launch_op
. AFAIK, that should only be called the first time, and subsequent calls should hit the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sankarmanoj-tt please wait merging until @pavlejosipovic also approves.
ttnn/cpp/ttnn/operations/conv/conv2d/prepare_conv2d_weights.hpp
Outdated
Show resolved
Hide resolved
42751fc
to
4435b19
Compare
7c7ac78
to
cd60c9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unblocking. I am still not clear why existing tracing functionality can’t be leveraged, but I assume you did the research.
a01be9d
to
93461c8
Compare
Ticket
#15088
Problem description
Accurate ahead of time calculation of L1 usage is necessary for enabling lots of features and optimizations.
What's changed
Implemented code to calculate L1 usage, and cross-verify it with the actual L1 usage. The L1 usage of an op consists of the output tensor and the circular buffers.
The L1 usage for the output tensor can be obtained by checking the global L1 memory allocation stats before and after the op.
The L1 usage by Circular Buffers is calculated from the program object. A new method was added to the program object to be able to get the total memory usage by Circular Buffer before memory allocation.
Checklist