perf: Python binding inference performance improvement #426

kthui · 2025-01-11T02:47:56Z

What does the PR do?

Refactor infer() and async_infer() APIs to handle memory allocation and callbacks internally in C++, and only expose the basic interface for the Python iterator to fetch responses.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

triton-inference-server/server#7949

Where should the reviewer start?

Start with the tritonserver_pybind.cc for the interface change, then move on to _model.py on how the Python iterator interacts with the interface. Finally, move on to the _request.py and _response.py on how they support the Python iterator.

For testing, start with test_binding.py and test_api.py, and then _tensor.py on the DLPack limitation regarding bytes.

Test plan:

Existing L0_python_api is sufficient for catching any regression from this performance improvement. It is modified to test from the new interface.

CI Pipeline ID: 22617754

Caveats:

User are no longer able to specify custom:

request release callback
response allocator
response callback

Currently, only CPU memory output is supported at the binding level, so GPU memory output will involve an extra D2H copy at the backend and a H2D copy at the frontend. This will be resolved as a follow-up.

The test_stop failure will have to be triaged and fixed as a follow-up.

Background

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

* fix request tensor lifecycle * remove prints for benchmarking * remove one h2h response copy * add allocator delete * schedule future.set_result to be called in event loop

* Use tensor object in output * Enable infer() to use the improved binding * Pass the C pointer directly to Python * Move output device if differ from request setting * Copyright and pre-commit

* Fix py_future object lifecycle * Fix request released after complete final

nnshah1 · 2025-01-15T21:28:25Z

python/test/test_api.py

@@ -137,6 +137,7 @@ def test_memory_fallback_to_cpu(self, server_options):

        tritonserver.default_memory_allocators[tritonserver.MemoryType.GPU] = allocator


should this be removed - or changed in some way to indicate the allocator is internal ....

yes, updated the test to indicate the allocator is internal, and it will always use CPU memory regardless of the backend memory preference.

nnshah1 · 2025-01-15T21:30:43Z

python/test/test_api.py

@@ -164,6 +165,7 @@ def test_memory_allocator_exception(self, server_options):
            ):
                pass

+    @pytest.mark.skip(reason="Skipping test, infer no longer use allocator")


we should keep this but refactor - if user requests output memory type gpu and that is not supported by the internal allocator - we would still want to raise an exception during inference

yes, we should raise an exception if the output memory type specified on the request is not supported, but currently the bindings does not accept a requested output memory type, so I think we can skip this test for now and add proper testing after adding support for allocating GPU memory.

hmm, but if we fail because say cupy is not available? Can we still make sure the right error gets propagated?

yes, the "test_unsupported_memory_type" is repurposed for testing moving outputs to unsupported memory type

nnshah1 · 2025-01-15T21:32:06Z

python/test/test_api.py

@@ -418,6 +420,9 @@ def test_ready(self, server_options):
        server = tritonserver.Server(server_options).start()
        assert server.ready()

+    @pytest.mark.skip(
+        reason="Skipping test, some request/response object may not be released which may cause server stop to fail"


Can we use xfail instead of skip?

@rmccorm4 - what do you think - as we had spent time on fixing this - how much an issue is this?

yes, switched to xfail.

nnshah1 · 2025-01-15T21:42:02Z

python/test/test_binding.py

        server.infer_async(request, trace)

        # [FIXME] WAR due to trace lifecycle is tied to response in Triton core,
        # trace reference should drop on response send..
-        res = response_queue.get(block=True, timeout=10)
+        future = concurrent.futures.Future()
+        request.get_next_response(future)


question - why does get_next_response not return a future, instead of taking one in?

This workflow is mainly centered around asyncio future, which requires the presence of a running loop when it is created, for example

loop = asyncio.get_running_loop() future = loop.create_future()

where calling asyncio.get_running_loop() outside the async method that it will be awaited may ends up not finding the loop or finding the wrong loop, so I think it is more robust to have the AsyncIterator creates the future and then pass it into the binding.

The Iterator future simply follows the Async routine for a more alike interface, but in the future I think we can improve it by having the async get_next_response() to return an awaitable, and the normal one to simply block until there is a response.

nnshah1 · 2025-01-15T21:59:27Z

python/test/test_binding.py

            ) = out
+            ctypes_buffer = ctypes.create_string_buffer(byte_size)
+            ctypes.memmove(ctypes_buffer, out_buffer, byte_size)
+            numpy_buffer = numpy.frombuffer(ctypes_buffer, dtype=numpy.byte)


who owns the memory in this case -

does it become owned by numpy or ctypes_buffer ... just for my understanding

The out_buffer is owned by the res object. The ctypes_buffer is a copy of the out_buffer, which the ctypes_buffer owns itself.

The numpy_buffer is not entirely clear from the docs, but from my experiment it is referencing the ctypes_buffer as the numpy_buffer is marked as read-only and cannot interface with DLPack without a copy.

nnshah1 · 2025-01-15T22:00:45Z