#9837: Assign workers after performing ref count cleanup in async mode #9944

tt-asaigal · 2024-07-03T23:52:15Z

Ticket

Problem description

Segfault seen when generating rotation matrix cache for LLMs. Was never seen on CI, since these runners have the cache already created.

What's changed

Reference count management for tensors was erroneous. This case was not correctly supported:

device_tensor = device_tensor.cpu()

The reference count for the device tensor would not get decremented post assignment.

Assign worker vector after decrementing ref count when calling the tensor copy or move assignment operators.

Checklist

Post commit CI passes
Model regression CI testing passes (if applicable)
New/Existing tests provide coverage for changes

- This handles cases where a device tensor is reassigned to a host tensor - Exposed during model cache generation which uses the following pattern: device_tensor = device_tensor.cpu()

tt-asaigal requested review from arakhmati, eyonland, cfjchu and xanderchin as code owners July 3, 2024 23:52

arakhmati approved these changes Jul 4, 2024

View reviewed changes

#9837: Assign workers after performing ref count cleanup in async mode

112ea79

- This handles cases where a device tensor is reassigned to a host tensor - Exposed during model cache generation which uses the following pattern: device_tensor = device_tensor.cpu()

tt-asaigal force-pushed the asaigal/cache_gen_segfault_rebased branch from 4558673 to 112ea79 Compare July 4, 2024 16:46

tt-asaigal merged commit d0ac529 into main Jul 4, 2024
5 checks passed

tt-asaigal deleted the asaigal/cache_gen_segfault_rebased branch July 4, 2024 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#9837: Assign workers after performing ref count cleanup in async mode #9944

#9837: Assign workers after performing ref count cleanup in async mode #9944

tt-asaigal commented Jul 3, 2024 •

edited

Loading

#9837: Assign workers after performing ref count cleanup in async mode #9944

#9837: Assign workers after performing ref count cleanup in async mode #9944

Conversation

tt-asaigal commented Jul 3, 2024 • edited Loading

Ticket

Problem description

What's changed

Checklist

tt-asaigal commented Jul 3, 2024 •

edited

Loading