Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor nms into TorchVision variant. #6814

Merged
merged 7 commits into from
Apr 1, 2024

Conversation

ysiraichi
Copy link
Collaborator

This PR refactors the existing nms lowering implementation, changing the appropriate parts for complying with TorchVision semantics. In summary:

  • Implement new nms lowering, based on the old implementation
    • More comments
    • TorchVision semantics
  • Adapts TorchVision tests for nms
  • Move the lowering implementation to torch_xla/csrc/xla_lower_util.cpp
    • Alongside other lowering implementations
  • Register the new kernel as an XLA dispatch for torchvision::nms
    • Make it possible to call the kernel by calling torchvision.ops.nms(...) directly
  • Kills the old implementation

cc @miladm @JackCaoG

@ysiraichi ysiraichi requested a review from JackCaoG March 23, 2024 23:39
@ysiraichi
Copy link
Collaborator Author

I'm having a bit of trouble identifying the cause of the CI failure. Basically, if we set XLA_USE_EAGER_DEBUG_MODE=1, the test TestNMS::test_nms_ref starts failing with the following error:

RuntimeError: Bad StatusOr access: INVALID_ARGUMENT: Executable expected parameter 0 of size 4000
but got buffer with incompatible size 4004

After looking into where this error was coming from, I found out that it was due to PyTorch/XLA trying to figure out the actual size of the output.

My guess is that, since I call xla::SetDimensionSize in the lowering, one of the dimensions are marked as dynamic. Which makes the XLATensorImpl::sym_sizes_ be populated with a c10::SymInt. When we try to retrieve the actual size (triggered by a to_functional_tensor(t) call on the result of nms), PyTorch/XLA tries to figure it out by actually running the computation.


Given all this information, I'm not sure where the "buffer with imcompatible size 4004" is coming from. test_nms_ref does create a tensor of size 4000 (1000x4 elements). But, the last 4 is unclear to me.

Another thing I'm puzzled about is why is this test failing only when XLA_USE_EAGER_DEBUG_MODE is set? Why wouldn't a normal test run trigger the same error?

@JackCaoG Have you ever seen an error like this? I believe other lowerings also call this same function (e.g. BuildMaskedSelect). So, doesn't seem like a problem with xla::SetDimensionSize being used.

@JackCaoG
Copy link
Collaborator

XLA_USE_EAGER_DEBUG_MODE it is a debug thing where it tries to execute the value of the node upon construction. You can think of it as calling mark_step every time a IR node is created.

@ysiraichi
Copy link
Collaborator Author

Right. So, what is puzzling is why would it fail only when XLA_USE_EAGER_DEBUG_MODE is set. Do you have any guesses?

@JackCaoG
Copy link
Collaborator

not really but feel free to skip the test when XLA_USE_EAGER_DEBUG_MODE is enabled. nms is a rarely used ops that's not even aten, XLA_USE_EAGER_DEBUG_MODE is a debug only flag. I think it is OK.

@ysiraichi ysiraichi force-pushed the ysiraichi/refactor-into-torchvision-nms branch from a4e90c8 to be21831 Compare March 26, 2024 19:53
const xla::Shape& boxes_shape = ShapeHelper::ShapeOfXlaOp(boxes);
XLA_CHECK_EQ(boxes_shape.rank(), 2);
XLA_CHECK_EQ(boxes_shape.dimensions(1), COORDINATES);
int64_t num_boxes = boxes_shape.dimensions(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the implementation here is a copy with some modification of the deleted nms_op.cpp, or if it is a brand new implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that it is a new implementation heavily based on the previous one. So, the problems that I saw with the previous implementation were:

  • Almost no comments: couldn't easily tell what some parts of the code were doing
  • Copied from another source: it was copied from an old version of tensorflow
  • Different signature and semantics: it returned the best output_size box indices, even though some of them might have been suppressed

I believe this implementation checks all of the above-mentioned boxes. I added plenty of comments + a few changes that resulted in a more maintainable code (IMHO).

@ysiraichi
Copy link
Collaborator Author

@JackCaoG @vanbasten23 I think this PR is ready. Could you take a look at it when you have some time?

init_values, "BoxSelectionLoop", builder));
}

xla::XlaOp BuildNms(xla::XlaOp boxes, xla::XlaOp scores,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any resource that you referenced when coming up with the implementation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I based this implementation in two other implementations:

  • The old PyTorch/XLA nms implementation
  • TorchVision CUDA nms implementation

@ysiraichi ysiraichi merged commit 6cf9b91 into master Apr 1, 2024
18 checks passed
return boxes, scores

@skipOnEagerDebug
def test_nms_ref(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ysiraichi can you please add a dynamic shape input test scenario for nms? Because this op is dynamic, it's been falling back to CPU. Of course, there is a lot of interest to bring the op to the xla device; though, this requires correct functionality for dynamism.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more clarity: number of boxes is set to 1000 at the moment. we want that number to be dynamic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear the output of this test appears to be dynamic, though the input number of boxes can also be dynamic for nms. This test, currently, covers one dynamism scenario (i.e. the output dynamism).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants