-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LoRA] Change lora_tokenizers capacity #10796
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Xin Yang <[email protected]>
Thanks for your contribution, I will look at this PR tomorrow or the day after tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your contribution. Could we pass max_loras
as a keyword argument? This way the overall changes would be minimal.
@jeejeelee Thanks for reviewing. There's already **tokenizer_config https://github.com/vllm-project/vllm/blob/main/vllm/transformers_utils/tokenizer_group/tokenizer_group.py#L18, you mean add to **tokenizer_config? |
Yeah |
OK |
1053c54
to
2f362f9
Compare
Signed-off-by: Xin Yang <[email protected]>
@jeejeelee I have updated the code, please review. |
5defaac
to
d9994b0
Compare
Signed-off-by: Xin Yang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM, thanks for your contribution and patience
Currently lora_tokenizers capacity is max_num_seqs. This is fine if max_loras is smaller than max_num_seqs.
But if max_loras is greater than max_num_seqs, lora_tokenizers should have capacity=max_loras, just like active_adapters LoRALRUCache have capacity=max_loras. If not, items in active_adapters are not evicted, but items in lora_tokenizers are evicted. I think this is unnecessarily evicted, causing performance degradation.
In our benchmark, we test max_loras=50, max_num_seqs=32, 50 adapters are invoked evenly distributed. We found this causing 50% performance degradation in P90 latency, comparing to tokenizers are not evicted.
So raising this PR to make lora_tokenizers capacity to be max(max_loras, max_num_seqs).
Another option is to add max_lora_tokenizers, just like max_loras and max_cpu_loras which controls GPU cache and CPU cache. But this might be an overkill.