-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Python binding inference performance improvement #426
base: main
Are you sure you want to change the base?
Conversation
* fix request tensor lifecycle * remove prints for benchmarking * remove one h2h response copy * add allocator delete * schedule future.set_result to be called in event loop
* Use tensor object in output * Enable infer() to use the improved binding * Pass the C pointer directly to Python * Move output device if differ from request setting * Copyright and pre-commit
300b0b6
to
131078a
Compare
* Fix py_future object lifecycle * Fix request released after complete final
python/test/test_api.py
Outdated
@@ -137,6 +137,7 @@ def test_memory_fallback_to_cpu(self, server_options): | |||
|
|||
tritonserver.default_memory_allocators[tritonserver.MemoryType.GPU] = allocator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be removed - or changed in some way to indicate the allocator is internal ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updated the test to indicate the allocator is internal, and it will always use CPU memory regardless of the backend memory preference.
python/test/test_api.py
Outdated
@@ -164,6 +165,7 @@ def test_memory_allocator_exception(self, server_options): | |||
): | |||
pass | |||
|
|||
@pytest.mark.skip(reason="Skipping test, infer no longer use allocator") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should keep this but refactor - if user requests output memory type gpu and that is not supported by the internal allocator - we would still want to raise an exception during inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we should raise an exception if the output memory type specified on the request is not supported, but currently the bindings does not accept a requested output memory type, so I think we can skip this test for now and add proper testing after adding support for allocating GPU memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, but if we fail because say cupy is not available? Can we still make sure the right error gets propagated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the "test_unsupported_memory_type" is repurposed for testing moving outputs to unsupported memory type
python/test/test_api.py
Outdated
@@ -418,6 +420,9 @@ def test_ready(self, server_options): | |||
server = tritonserver.Server(server_options).start() | |||
assert server.ready() | |||
|
|||
@pytest.mark.skip( | |||
reason="Skipping test, some request/response object may not be released which may cause server stop to fail" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use xfail instead of skip?
@rmccorm4 - what do you think - as we had spent time on fixing this - how much an issue is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, switched to xfail.
server.infer_async(request, trace) | ||
|
||
# [FIXME] WAR due to trace lifecycle is tied to response in Triton core, | ||
# trace reference should drop on response send.. | ||
res = response_queue.get(block=True, timeout=10) | ||
future = concurrent.futures.Future() | ||
request.get_next_response(future) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question - why does get_next_response not return a future, instead of taking one in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This workflow is mainly centered around asyncio future, which requires the presence of a running loop when it is created, for example
loop = asyncio.get_running_loop()
future = loop.create_future()
where calling asyncio.get_running_loop()
outside the async method that it will be awaited may ends up not finding the loop or finding the wrong loop, so I think it is more robust to have the AsyncIterator
creates the future and then pass it into the binding.
The Iterator
future simply follows the Async routine for a more alike interface, but in the future I think we can improve it by having the async get_next_response()
to return an awaitable, and the normal one to simply block until there is a response.
) = out | ||
ctypes_buffer = ctypes.create_string_buffer(byte_size) | ||
ctypes.memmove(ctypes_buffer, out_buffer, byte_size) | ||
numpy_buffer = numpy.frombuffer(ctypes_buffer, dtype=numpy.byte) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who owns the memory in this case -
does it become owned by numpy or ctypes_buffer ... just for my understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The out_buffer
is owned by the res
object. The ctypes_buffer
is a copy of the out_buffer
, which the ctypes_buffer
owns itself.
The numpy_buffer
is not entirely clear from the docs, but from my experiment it is referencing the ctypes_buffer
as the numpy_buffer
is marked as read-only and cannot interface with DLPack without a copy.
@@ -31,6 +31,7 @@ | |||
#include <triton/core/tritonserver.h> | |||
|
|||
#include <iostream> | |||
#include <mutex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh no, one of my least favorite things to think about :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😅
* Indicate allocator is internal on test_memory_fallback_to_cpu() * Remove test_memory_allocator_exception() as infer no longer use custom allocators * Update reason for skipping test_unsupported_memory_type() * Mark test_stop() with xfail instead of skip
const char* tensor_name, size_t* byte_size, | ||
TRITONSERVER_MemoryType* memory_type, int64_t* memory_type_id) | ||
{ | ||
*memory_type = TRITONSERVER_MEMORY_CPU; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a todo to support GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is added
TRITONSERVER_MemoryType* actual_memory_type, | ||
int64_t* actual_memory_type_id) | ||
{ | ||
*buffer = malloc(byte_size * sizeof(uint8_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there tritonserver apis we should be using for memory allocation instead of directly calling malloc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is one, because there are multiple calls to malloc in the core code base for different purpose, i.e.
https://github.com/triton-inference-server/core/blob/r24.12/src/backend_memory_manager.cc#L98
https://github.com/triton-inference-server/core/blob/r24.12/src/backend_model_instance.cc#L73
https://github.com/triton-inference-server/core/blob/r24.12/src/cache_manager.cc#L95
There is one on the server repo, but it is our of reach from the core binding:
https://github.com/triton-inference-server/server/blob/r24.12/src/memory_alloc.cc#L96
*actual_memory_type = TRITONSERVER_MEMORY_CPU; | ||
*actual_memory_type_id = 0; | ||
// The response allocator needs to be kept alive until the allocated memory | ||
// is released via the release function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this true? - question: why not an allocator singleton for cpu - and reuse it for all requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the core will need the allocator opaque object for finding the set release function. The allocator opaque object is basically this class that stores the release function pointer.
yes, updated to use singleton allocator object for all instances of the request wrapper.
int64_t memory_type_id) | ||
{ | ||
free(buffer); | ||
// Release ownership of the response allocator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking a singleton could be simpler and avoid alloc / dealloc of the allocator object itself - if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updated to use a singleton response allocator.
// Release ownership of the response allocator. | ||
std::unique_ptr<std::shared_ptr<struct TRITONSERVER_ResponseAllocator>> | ||
allocator_ptr(reinterpret_cast< | ||
std::shared_ptr<struct TRITONSERVER_ResponseAllocator>*>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually - what do we use the buffer_userp for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was used to store the shared pointer to the response allocator object and passed to the release function, so the reference count can be increased/decreased as malloc/free that enables the allocator to be destructed only after the last allocated response memory is freed.
but it is no longer necessary with the singleton allocator.
if (response_future_.get() != nullptr) { | ||
throw AlreadyExistsError("cannot call GetNextResponse concurrently"); | ||
} | ||
response_future_.reset(new py::object(py_future)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I follow the logic here - why do we create a new object based on the passed in object? why not a reference on the original object?
second: how does the swap below work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried holding a raw pointer to the original object, but the pointer can no longer be dereferenced (segmentation fault) after the GetNextResponse()
returned, so the only safe way of holding on to the future object after GetNextResponse()
returns is to increment the reference count on the future object, which is achieved by "copying" it to a new object.
the reason for using a unique pointer is to have a mechanism for checking if there is a future object pending to be set with the next response without accessing any Python variable, to avoid holding the GIL when doing the simple check. it can be achieved with a simple bool flag, but I think it is safer to have one variable than two to avoid setting one variable but forgetting about the other one.
{ | ||
py::gil_scoped_acquire gil; | ||
std::unique_ptr<py::object> response_future_local(nullptr); | ||
response_future_.swap(response_future_local); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I follow this - we create a new py object from null, swap it with the stored one (releasing it?) - then set result on the nullptr based object?...
Or ... I guess swap works the opposite way :)
we swap the null ptr for the one stored in the future ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the guarantee here that the GetNextResponse()
will not be called and receive a new future object (vs this future object) is this future object is not yet set with a result, so as soon as this future object is set with a result, we need to be ready for the next future object from GetNextResponse()
that may take the place of this future object. thus, the solution here is to move the current future object from the class to local variable, making the class variable ready for the next future object before the current future object is set with a result.
py_response.second); | ||
if (py::hasattr(py_future, "get_loop")) { | ||
py_future.attr("get_loop")().attr("call_soon_threadsafe")( | ||
py_future.attr("set_result"), std::move(py_res)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat! So this support async and non async?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
inference_request.response_queue, | ||
raise_on_error, | ||
) | ||
self._server.infer_async(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we pass in the requested memory type to the C++ side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add a new binding function to the request wrapper that allows setting a requested output memory type to the request wrapper object, then the response allocation function can access the requested memory type from the request wrapper object and allocate the correct memory type.
python/tritonserver/_api/_model.py
Outdated
|
||
def infer( | ||
self, | ||
inference_request: Optional[InferenceRequest] = None, | ||
raise_on_error: bool = True, | ||
**kwargs: Unpack[InferenceRequest], | ||
) -> ResponseIterator: | ||
) -> Iterable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use the explicit types?
Is it because of forward declaration?
If so we can use quotes, or import from typing future to allow for forward declaration,
"ResponseIterator"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is reverted to use the explicit type ResponseIterator
.
python/tritonserver/_api/_model.py
Outdated
inference_request.output_memory_allocator, | ||
inference_request.output_memory_type, | ||
).create_tritonserver_response_allocator() | ||
class AsyncResponseIterator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is a nested class - is that needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not needed, the iterators are moved to the _response.py.
from tritonserver._api import _model | ||
|
||
if TYPE_CHECKING: | ||
from tritonserver._api._model import Model | ||
|
||
from tritonserver._api._allocators import MemoryBuffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could rename this as MemoryBuffer and /or put it into Tensor.py as its used there as well and no longer needed in allocators ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the file is renamed to _memorybuffer.py, and the allocator apis are removed as custom allocators are no longer used with the infer api.
tensor = Tensor(data_type, shape, memory_buffer) | ||
|
||
if output_memory_type is not None: | ||
tensor = tensor.to_device(output_memory_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check for error here - or will propagate?
Should check original behavior in terms of fallback behavior -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error will be set in the response.
it is updated to raise InvalidArgumentError
similar to the original behavior.
# DLPack does not support bytes type. | ||
original_data_type = self.data_type | ||
original_shape = self.shape | ||
if self.data_type == DataType.BYTES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question - would this remain after we enable device memory on c++ side?
or is temporary - if temporary - we can add a note on when it can be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be permanent, because we (may?) want to support moving bytes tensors between host and device(s), which this adds the functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Will want secondary review as well.
For some of the questions may be easier to sync offline.
* Add TODO to support GPU memory * Use singleton response allocator * Move response iterators to _response.py * Rename _allocators.py to _memorybuffer.py and cleanup * Cleanup response_queue on request * Raise exception when unable to move output tensor to requested memory type
* Remove allocator related tests * Add test for output memory copy to unsupported type will raise exception
What does the PR do?
Refactor
infer()
andasync_infer()
APIs to handle memory allocation and callbacks internally in C++, and only expose the basic interface for the Python iterator to fetch responses.Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
triton-inference-server/server#7949
Where should the reviewer start?
Start with the tritonserver_pybind.cc for the interface change, then move on to _model.py on how the Python iterator interacts with the interface. Finally, move on to the _request.py and _response.py on how they support the Python iterator.
For testing, start with test_binding.py and test_api.py, and then _tensor.py on the DLPack limitation regarding bytes.
Test plan:
Existing L0_python_api is sufficient for catching any regression from this performance improvement. It is modified to test from the new interface.
Caveats:
User are no longer able to specify custom:
Currently, only CPU memory output is supported at the binding level, so GPU memory output will involve an extra D2H copy at the backend and a H2D copy at the frontend. This will be resolved as a follow-up.
The test_stop failure will have to be triaged and fixed as a follow-up.
Background
N/A
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
N/A