[V1] VLM preprocessor hashing #11020

alexm-neuralmagic · 2024-12-09T13:57:14Z

This PR adds:

Hashing functionality for images
Caching of the multi-modal (MM) preprocessor/mapper

The idea of MM preprocessor caching is based on having a client and a server, where the client executes in the frontend process (=P0) and the server in the core process (=P1).

Client: Executes the MM mapper and performs caching of the results.
Server: Performs caching of the results

The caching for both client and server is mirrored/similar, and this allows us to avoid the serialization of "mm_inputs" (like pixel values) between client (=P0) and server (=P1) processes.

Currently, the MM preprocessor caching is disabled by default, since we did not finish the performance analysis yet. We still need to enable encoder caching and prefix caching end-to-end.

Follow up PRs:

Benchmark script for TTFT and TPOT and more deeper analysis of performance.
Encoder caching
Prefix caching end-to-end with skipping of all MM pipeline steps (if hashed)

github-actions · 2024-12-09T13:57:28Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-12-09T13:57:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

alexm-neuralmagic · 2024-12-09T14:32:20Z

/ready

comaniac

Overall LGTM. Just some comments for style. Also please

Revert unnecessary changes in the example code.
Add some unit tests.

vllm/config.py

vllm/engine/arg_utils.py

vllm/v1/engine/__init__.py

vllm/v1/engine/mm_input_mapper.py

comaniac · 2024-12-09T21:00:01Z

vllm/v1/engine/processor.py

+            mm_hashes = self.mm_hasher.hash(decoder_inputs.multi_modal_data) \
+                if self.mm_hasher is not None else None
+
+            mm_inputs, mm_hashes = self.mm_input_mapper_client.process_inputs(
+                decoder_inputs.multi_modal_data, mm_hashes,
+                decoder_inputs.mm_processor_kwargs)


Probably need some assertions to make sure mm_hasher and mm_input_mapper_client are not None.

vllm/v1/utils.py

alexm-neuralmagic · 2024-12-09T23:51:26Z

@comaniac thanks for the quick review! I actually realized that I need to push the hash and the cache into the _process_multimodal(..) due to recent merging of preprocessor for llava. Also it caches the first preprocessor too. Will send changes tomorrow.

mergify · 2024-12-10T13:54:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

alexm-neuralmagic · 2024-12-10T15:32:30Z

@comaniac code is ready for re-review. Eventually, I did not push the hasher/cacher into the _process_multimodal(..) since it is also used to compute the MM placeholders, and there is no simple way to skip it due to hash HIT. We can look into this as a follow up.

comaniac

LGTM. Just some nits

comaniac · 2024-12-10T16:49:39Z

examples/offline_inference_vision_language.py

Does these changes compatible with v0? If not then we should differentiate them.

yeah it should not matter if it is v0 or v1, since it simply controls the images used.

vllm/v1/engine/__init__.py

vllm/v1/engine/core.py

vllm/v1/engine/mm_input_mapper.py

vllm/v1/engine/processor.py

Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Alexander Matveev <[email protected]>

alexm-neuralmagic requested review from WoosukKwon, robertgshaw2-neuralmagic, njhill, ywang96 and comaniac as code owners December 9, 2024 13:57

alexm-neuralmagic self-assigned this Dec 9, 2024

mergify bot added the needs-rebase label Dec 9, 2024

alexm-neuralmagic mentioned this pull request Dec 9, 2024

[WIP] [V1] VLM hashing and mapper caching #10868

Closed

alexm-neuralmagic marked this pull request as draft December 9, 2024 13:58

alexm-neuralmagic changed the title ~~[WIP] [V1] VLM preprocessor hashing~~ [V1] VLM preprocessor hashing Dec 9, 2024

alexm-neuralmagic force-pushed the v1_vlm_hash_mapper branch from 545a40a to 3554439 Compare December 9, 2024 14:30

mergify bot removed the needs-rebase label Dec 9, 2024

alexm-neuralmagic marked this pull request as ready for review December 9, 2024 14:33

comaniac reviewed Dec 9, 2024

View reviewed changes

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 9, 2024

mergify bot added the needs-rebase label Dec 10, 2024

alexm-neuralmagic force-pushed the v1_vlm_hash_mapper branch 2 times, most recently from 0e26569 to f4466ba Compare December 10, 2024 15:16

mergify bot removed the needs-rebase label Dec 10, 2024

alexm-neuralmagic force-pushed the v1_vlm_hash_mapper branch from 0c69c14 to 7304434 Compare December 10, 2024 15:29

comaniac approved these changes Dec 10, 2024

View reviewed changes

mergify bot added the ci/build label Dec 10, 2024

alexm-neuralmagic force-pushed the v1_vlm_hash_mapper branch from 185655c to 6a7b37a Compare December 10, 2024 18:38

rickyyx reviewed Dec 10, 2024

View reviewed changes

vllm/v1/engine/processor.py Show resolved Hide resolved

ywang96 mentioned this pull request Dec 10, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

42 tasks

alexm-neuralmagic force-pushed the v1_vlm_hash_mapper branch from 9ec0c15 to 431f5d4 Compare December 11, 2024 00:14

[V1] VLM - preprocessor hashing

adda9d4

Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Alexander Matveev <[email protected]>

alexm-neuralmagic force-pushed the v1_vlm_hash_mapper branch from 431f5d4 to adda9d4 Compare December 11, 2024 13:53

comaniac enabled auto-merge (squash) December 11, 2024 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] VLM preprocessor hashing #11020

[V1] VLM preprocessor hashing #11020

alexm-neuralmagic commented Dec 9, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 9, 2024

mergify bot commented Dec 9, 2024

alexm-neuralmagic commented Dec 9, 2024

comaniac left a comment •

edited

Loading

comaniac Dec 9, 2024

alexm-neuralmagic commented Dec 9, 2024

mergify bot commented Dec 10, 2024

alexm-neuralmagic commented Dec 10, 2024

comaniac left a comment

comaniac Dec 10, 2024

alexm-neuralmagic Dec 10, 2024

[V1] VLM preprocessor hashing #11020

Are you sure you want to change the base?

[V1] VLM preprocessor hashing #11020

Conversation

alexm-neuralmagic commented Dec 9, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 9, 2024

mergify bot commented Dec 9, 2024

alexm-neuralmagic commented Dec 9, 2024

comaniac left a comment • edited Loading

Choose a reason for hiding this comment

comaniac Dec 9, 2024

Choose a reason for hiding this comment

alexm-neuralmagic commented Dec 9, 2024

mergify bot commented Dec 10, 2024

alexm-neuralmagic commented Dec 10, 2024

comaniac left a comment

Choose a reason for hiding this comment

comaniac Dec 10, 2024

Choose a reason for hiding this comment

alexm-neuralmagic Dec 10, 2024

Choose a reason for hiding this comment

alexm-neuralmagic commented Dec 9, 2024 •

edited by github-actions bot

Loading

comaniac left a comment •

edited

Loading