-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampler Throughput Optimization #192
Sampler Throughput Optimization #192
Conversation
…24-Feb/optimize-sampler
) | ||
_test_max_tokens(staging_engine) | ||
_test_ignore_eos(staging_engine) | ||
# TODO (@sunggg): There is something stateful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, _test_logrpobs
does not work when _test_stop
is enabled, seems like _test_stop
has some statefulness that affects the next tests. I see the following test starts after generating one extra token at the beginning which I don't understand where it comes from atm. @yelite, can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask if the order of the unit tests impact those results? Like if we move test_stop
after test_logprobs
would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if I change the order, it works. So I suspect there is something strange going on with stop
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that the _test_stop will result extra tokens after the sequence is finished due to stop word. The stop word detection happens in the main process, and the main process will send cancellation to worker. Somehow the worker and main process become out of sync. I am looking into the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that it's not really a bug in the InferenceEngine, although such out-of-sync is undesirable from the perspective of implementation. What really causes trouble is that different tests send requests with same request ids. Here _test_stop
sends request id 2, and _test_logprobs
also has a request id 2. A quick fix for this test failure is to make sure request ids are unique. I will fix the underlying out-of-sync in the engine refactoring.
|
||
|
||
def _test_penalties(): | ||
# TODO(vvchernov): Add test for repetition penalty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvchernov, would you add one after this PR? I couldn't find much info about repetition penaltiy in OpenAI spec.
Probably better to make that a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that so much refactor of logprobs is correct.
Due to performance with logprobs is still bad, could we do async calculation of it when next token is found and transferred to the next decode step?
@masahi, yeah, I thought about it but it was tightly coupled with this refactoring so it was tricky to find the clean cut 😢 |
I think mostly what I did is a pure refactoring without any functional change. This detokenziation is the only exception.
Yeah, I also thought about that and would be an interesting optimization. I think there could be more opportunity for CPUs as well. Actually, one interesting observation is that most of the degradation is from post-processing when preparing the final response. If I change this line to |
|
||
@dataclass | ||
class SamplingTensors: | ||
mask_random: torch.Tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masahi we might want to look into https://github.com/patrick-kidger/jaxtyping or another project to also get more checking on the tensors. I think its relatively value-able especially as we have more and more people working on this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of small nits and requests from me. Overall LGTM thanks for taking the first stab at this @sunggg, looking forward to landing it.
Overview
This PR aims to land the followings:
sampler.py
that manages data structures and key functions regarding sampling computation. For example,SamplingMetadata
managestop_p
,top_k
,logit_bias
etc., in torch tensors so that they can be performed in the vectorized fashion. Also,adjust_logits
function has introduced to have a single point of various logit manipulations, such astemperature
,presence_penalty
, etc. and the upcominglogit_processor
for JSON mode.test_sampler.py
to verify logit manipulation directly.logprob
logits and use thedetokenize_incrementally
for tokens inTopLogprobs
for the contextual detokenization.Performance
By targeting Mistral fp16 on 2xH100, the following scenarios are measured and compared with the current HEAD.
--apply-all-sampling-params
: it includes the overhead fortop_p/k
,presence/frequency/repetition penalties
,logit bias
--logprob
: this includes the overhead forlogprob
Updates
PR v0: initial version
PR v1: vectorize further by reducing index shuffling
Engine Throughput (req/s)
command for D0:
/opt/bin/cuda-reserve.py --num-gpu 2 python3 serve/benchmarks/benchmark_throughput.py --local-id mixtral-8x7b-instruct-v0.1-q0f16-presharded-2gpu --max-num-batched-tokens 16000 --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --greedy-sampling-ratio 0.0 --num-prompts 1000
--apply-all-sampling-params
--logprob
Latency (s)
command for D0:
/opt/bin/cuda-reserve.py --num-gpu 2 python3 serve/benchmarks/benchmark_latency.py --local-id mixtral-8x7b-instruct-v0.1-q0f16-presharded-2gpu --max-num-batched-tokens 16000
--apply-all-sampling-params
--logprob
TODOs
Note
Although there is no obvious quality impact on my local testing, please review carefully since this changes the way of computation.
cc. @yelite @masahi @vvchernov @zxybazh