Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Python binding inference performance improvement #426

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

kthui
Copy link
Contributor

@kthui kthui commented Jan 11, 2025

What does the PR do?

Refactor infer() and async_infer() APIs to handle memory allocation and callbacks internally in C++, and only expose the basic interface for the Python iterator to fetch responses.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

triton-inference-server/server#7949

Where should the reviewer start?

Start with the tritonserver_pybind.cc for the interface change, then move on to _model.py on how the Python iterator interacts with the interface. Finally, move on to the _request.py and _response.py on how they support the Python iterator.

For testing, start with test_binding.py and test_api.py, and then _tensor.py on the DLPack limitation regarding bytes.

Test plan:

Existing L0_python_api is sufficient for catching any regression from this performance improvement. It is modified to test from the new interface.

  • CI Pipeline ID: 22617754

Caveats:

User are no longer able to specify custom:

  • request release callback
  • response allocator
  • response callback

Currently, only CPU memory output is supported at the binding level, so GPU memory output will involve an extra D2H copy at the backend and a H2D copy at the frontend. This will be resolved as a follow-up.

The test_stop failure will have to be triaged and fixed as a follow-up.

Background

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

* fix request tensor lifecycle

* remove prints for benchmarking

* remove one h2h response copy

* add allocator delete

* schedule future.set_result to be called in event loop
@kthui kthui added the PR: perf A code change that improves performance label Jan 11, 2025
@kthui kthui self-assigned this Jan 11, 2025
* Use tensor object in output

* Enable infer() to use the improved binding

* Pass the C pointer directly to Python

* Move output device if differ from request setting

* Copyright and pre-commit
@kthui kthui force-pushed the jacky-py-res-callback branch from 300b0b6 to 131078a Compare January 11, 2025 02:52
@kthui kthui marked this pull request as ready for review January 15, 2025 19:43
@@ -137,6 +137,7 @@ def test_memory_fallback_to_cpu(self, server_options):

tritonserver.default_memory_allocators[tritonserver.MemoryType.GPU] = allocator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be removed - or changed in some way to indicate the allocator is internal ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated the test to indicate the allocator is internal, and it will always use CPU memory regardless of the backend memory preference.

@@ -164,6 +165,7 @@ def test_memory_allocator_exception(self, server_options):
):
pass

@pytest.mark.skip(reason="Skipping test, infer no longer use allocator")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should keep this but refactor - if user requests output memory type gpu and that is not supported by the internal allocator - we would still want to raise an exception during inference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should raise an exception if the output memory type specified on the request is not supported, but currently the bindings does not accept a requested output memory type, so I think we can skip this test for now and add proper testing after adding support for allocating GPU memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, but if we fail because say cupy is not available? Can we still make sure the right error gets propagated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the "test_unsupported_memory_type" is repurposed for testing moving outputs to unsupported memory type

@@ -418,6 +420,9 @@ def test_ready(self, server_options):
server = tritonserver.Server(server_options).start()
assert server.ready()

@pytest.mark.skip(
reason="Skipping test, some request/response object may not be released which may cause server stop to fail"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use xfail instead of skip?

@rmccorm4 - what do you think - as we had spent time on fixing this - how much an issue is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, switched to xfail.

server.infer_async(request, trace)

# [FIXME] WAR due to trace lifecycle is tied to response in Triton core,
# trace reference should drop on response send..
res = response_queue.get(block=True, timeout=10)
future = concurrent.futures.Future()
request.get_next_response(future)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question - why does get_next_response not return a future, instead of taking one in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow is mainly centered around asyncio future, which requires the presence of a running loop when it is created, for example

loop = asyncio.get_running_loop()
future = loop.create_future()

where calling asyncio.get_running_loop() outside the async method that it will be awaited may ends up not finding the loop or finding the wrong loop, so I think it is more robust to have the AsyncIterator creates the future and then pass it into the binding.

The Iterator future simply follows the Async routine for a more alike interface, but in the future I think we can improve it by having the async get_next_response() to return an awaitable, and the normal one to simply block until there is a response.

) = out
ctypes_buffer = ctypes.create_string_buffer(byte_size)
ctypes.memmove(ctypes_buffer, out_buffer, byte_size)
numpy_buffer = numpy.frombuffer(ctypes_buffer, dtype=numpy.byte)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who owns the memory in this case -

does it become owned by numpy or ctypes_buffer ... just for my understanding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The out_buffer is owned by the res object. The ctypes_buffer is a copy of the out_buffer, which the ctypes_buffer owns itself.

The numpy_buffer is not entirely clear from the docs, but from my experiment it is referencing the ctypes_buffer as the numpy_buffer is marked as read-only and cannot interface with DLPack without a copy.

@@ -31,6 +31,7 @@
#include <triton/core/tritonserver.h>

#include <iostream>
#include <mutex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh no, one of my least favorite things to think about :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

* Indicate allocator is internal on test_memory_fallback_to_cpu()

* Remove test_memory_allocator_exception() as infer no longer use custom allocators

* Update reason for skipping test_unsupported_memory_type()

* Mark test_stop() with xfail instead of skip
const char* tensor_name, size_t* byte_size,
TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id)
{
*memory_type = TRITONSERVER_MEMORY_CPU;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a todo to support GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is added

TRITONSERVER_MemoryType* actual_memory_type,
int64_t* actual_memory_type_id)
{
*buffer = malloc(byte_size * sizeof(uint8_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there tritonserver apis we should be using for memory allocation instead of directly calling malloc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*actual_memory_type = TRITONSERVER_MEMORY_CPU;
*actual_memory_type_id = 0;
// The response allocator needs to be kept alive until the allocated memory
// is released via the release function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this true? - question: why not an allocator singleton for cpu - and reuse it for all requests?

Copy link
Contributor Author

@kthui kthui Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the core will need the allocator opaque object for finding the set release function. The allocator opaque object is basically this class that stores the release function pointer.

yes, updated to use singleton allocator object for all instances of the request wrapper.

int64_t memory_type_id)
{
free(buffer);
// Release ownership of the response allocator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking a singleton could be simpler and avoid alloc / dealloc of the allocator object itself - if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated to use a singleton response allocator.

// Release ownership of the response allocator.
std::unique_ptr<std::shared_ptr<struct TRITONSERVER_ResponseAllocator>>
allocator_ptr(reinterpret_cast<
std::shared_ptr<struct TRITONSERVER_ResponseAllocator>*>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually - what do we use the buffer_userp for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was used to store the shared pointer to the response allocator object and passed to the release function, so the reference count can be increased/decreased as malloc/free that enables the allocator to be destructed only after the last allocated response memory is freed.

but it is no longer necessary with the singleton allocator.

if (response_future_.get() != nullptr) {
throw AlreadyExistsError("cannot call GetNextResponse concurrently");
}
response_future_.reset(new py::object(py_future));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow the logic here - why do we create a new object based on the passed in object? why not a reference on the original object?

second: how does the swap below work?

Copy link
Contributor Author

@kthui kthui Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried holding a raw pointer to the original object, but the pointer can no longer be dereferenced (segmentation fault) after the GetNextResponse() returned, so the only safe way of holding on to the future object after GetNextResponse() returns is to increment the reference count on the future object, which is achieved by "copying" it to a new object.

the reason for using a unique pointer is to have a mechanism for checking if there is a future object pending to be set with the next response without accessing any Python variable, to avoid holding the GIL when doing the simple check. it can be achieved with a simple bool flag, but I think it is safer to have one variable than two to avoid setting one variable but forgetting about the other one.

{
py::gil_scoped_acquire gil;
std::unique_ptr<py::object> response_future_local(nullptr);
response_future_.swap(response_future_local);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow this - we create a new py object from null, swap it with the stored one (releasing it?) - then set result on the nullptr based object?...

Or ... I guess swap works the opposite way :)

we swap the null ptr for the one stored in the future ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the guarantee here that the GetNextResponse() will not be called and receive a new future object (vs this future object) is this future object is not yet set with a result, so as soon as this future object is set with a result, we need to be ready for the next future object from GetNextResponse() that may take the place of this future object. thus, the solution here is to move the current future object from the class to local variable, making the class variable ready for the next future object before the current future object is set with a result.

py_response.second);
if (py::hasattr(py_future, "get_loop")) {
py_future.attr("get_loop")().attr("call_soon_threadsafe")(
py_future.attr("set_result"), std::move(py_res));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat! So this support async and non async?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

inference_request.response_queue,
raise_on_error,
)
self._server.infer_async(request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we pass in the requested memory type to the C++ side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add a new binding function to the request wrapper that allows setting a requested output memory type to the request wrapper object, then the response allocation function can access the requested memory type from the request wrapper object and allocate the correct memory type.


def infer(
self,
inference_request: Optional[InferenceRequest] = None,
raise_on_error: bool = True,
**kwargs: Unpack[InferenceRequest],
) -> ResponseIterator:
) -> Iterable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the explicit types?

Is it because of forward declaration?

If so we can use quotes, or import from typing future to allow for forward declaration,

"ResponseIterator"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is reverted to use the explicit type ResponseIterator.

inference_request.output_memory_allocator,
inference_request.output_memory_type,
).create_tritonserver_response_allocator()
class AsyncResponseIterator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is a nested class - is that needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not needed, the iterators are moved to the _response.py.

from tritonserver._api import _model

if TYPE_CHECKING:
from tritonserver._api._model import Model

from tritonserver._api._allocators import MemoryBuffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could rename this as MemoryBuffer and /or put it into Tensor.py as its used there as well and no longer needed in allocators ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the file is renamed to _memorybuffer.py, and the allocator apis are removed as custom allocators are no longer used with the infer api.

tensor = Tensor(data_type, shape, memory_buffer)

if output_memory_type is not None:
tensor = tensor.to_device(output_memory_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check for error here - or will propagate?

Should check original behavior in terms of fallback behavior -

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error will be set in the response.

it is updated to raise InvalidArgumentError similar to the original behavior.

# DLPack does not support bytes type.
original_data_type = self.data_type
original_shape = self.shape
if self.data_type == DataType.BYTES:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question - would this remain after we enable device memory on c++ side?

or is temporary - if temporary - we can add a note on when it can be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be permanent, because we (may?) want to support moving bytes tensors between host and device(s), which this adds the functionality.

Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Will want secondary review as well.

For some of the questions may be easier to sync offline.

kthui added 2 commits January 17, 2025 00:52
* Add TODO to support GPU memory

* Use singleton response allocator

* Move response iterators to _response.py

* Rename _allocators.py to _memorybuffer.py and cleanup

* Cleanup response_queue on request

* Raise exception when unable to move output tensor to requested memory type
* Remove allocator related tests

* Add test for output memory copy to unsupported type will raise exception
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: perf A code change that improves performance
Development

Successfully merging this pull request may close these issues.

2 participants