-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JSON Mode] Constrained Sampling #175
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR, @vegaluisjose!
It looks good to me overall, can you also update the benchmark scripts and report the performance with/without this feature?
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/benchmark_latency.py
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/benchmark_throughput.py
You can add a flag for constrained sampling and update the post processing
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/utils.py#L5
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/utils.py#L38
Snow.model_validate(json.loads(out_text)) | ||
else: | ||
SnowList.model_validate(json.loads(out_text)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the test case for n>1
and mark it as skip since it is not supported yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by skipping you mean commenting it out right? or you mean the pytest way?
@pytest.mark.skip(reason="no way of currently testing this")
def test_the_unknown():
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest way. Thank you for adding this!
Also, can we also check if local server works with this change? |
Yes, it works. Do we need to add (hexagon) lvega@crusoe-p4d:~/hexagon$ curl http://127.0.0.1:8000/v1/chat/completions -i -H "Content-Type: application/json" -d "@test.json"
HTTP/1.1 200 OK
date: Fri, 26 Jan 2024 22:28:07 GMT
server: uvicorn
content-length: 802
content-type: application/json
{"id":"cmpl-ba2d083fbc854e22b601861b3f05d543","object":"chat.completion","created":1706308088,"model":"test","choices":[{"index":0,"message":{"role":"assistant","content":" The capital of France is Paris. Paris is one of the most populous cities in Europe, and is known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. It is also famous for its fashion, art, and culinary scenes. Paris is located in the north-central part of France, on the banks of the Seine River. It is the political, cultural, and economic center of France, and is home to a number of important government institutions, including the French Parliament and the presidential palace."},"finish_reason":"stop"}],"usage":{"prompt_tokens":16,"total_tokens":136,"completion_tokens":120}} {
"model" : "test",
"stream": false,
"temperature" : 0,
"messages":[
{
"role" : "user",
"content" : "what is the capital of France?"
}
]
} |
I was able to test latency (I am not sure how to do throughput since requires a dataset). Here are the results Regular mode (hexagon) lvega@crusoe-p4d:~/hexagon/mlc-llm$ /opt/bin/cuda-reserve.py --num-gpus 2 python3 serve/benchmarks/benchmark_latency.py --local-id Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu
Namespace(local_id='Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu', artifact_path='dist', use_sync_engine=False, num_sequences_to_sample=1, max_num_batched_tokens=4096, min_decode_steps=32, max_decode_steps=56, debug_logging=False, seed=0, num_input_tokens=128, num_output_tokens=128, temperature=0.5, apply_penalties=False, apply_logit_bias=False, apply_top_p_top_k=Fa
lse, apply_json_mode=False, apply_all_sampling_params=False, model_artifact_path=PosixPath('dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu'), use_staging_engine=True, sampling_setting={'ignore_eos': True, 'logit_bias': None, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'top_p': 1.0, 'top_k': -1, 'json_schema': None})
2024-01-26 22:35:21 [info ] StagingInferenceEngine.start [mlc_serve.engine.staging_engine] func_name=start lineno=88 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3042834
2024-01-26 22:35:24 [info ] Loading parameters from dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu. [mlc_serve.model.tvm_model] func_name=get_tvm_model lineno=67 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:48 [info ] Running memory profiling. [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=457 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:50 [info ] Using 26034 cache blocks. [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=479 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:50 [info ] Allocated KV cache blocks. [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=501 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:50 [info ] Model is initalized. [mlc_serve.engine.staging_engine_worker] func_name=run_generation_loop_worker lineno=358 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py process=3043006
2024-01-26 22:35:50 [warning ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3042834
2024-01-26 22:35:50 [info ] StagingInferenceEngine.add [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3042834 requests=[Request(request_id='1', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema=None, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_sequences=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None,
contextvars={})]
2024-01-26 22:35:53 [warning ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3042834
2024-01-26 22:35:53 [info ] StagingInferenceEngine.add [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3042834 requests=[Request(request_id='2', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema=None, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_sequences=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None,
contextvars={})]
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 39.603 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.043 ms (55.424 tok/s)
* End-to-end latency: 2.331 s JSON mode (hexagon) lvega@crusoe-p4d:~/hexagon/mlc-llm$ /opt/bin/cuda-reserve.py --num-gpus 2 python3 serve/benchmarks/benchmark_latency.py --local-id Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu --apply-json-mode
Namespace(local_id='Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu', artifact_path='dist', use_sync_engine=False, num_sequences_to_sample=1, max_num_batched_tokens=4096, min_decode_steps=32, max_decode_steps=56, debug_logging=False, seed=0, num_input_tokens=128, num_output_tokens=128, temperature=0.5, apply_penalties=False, apply_logit_bias=False, apply_top_p_top_k=Fa
lse, apply_json_mode=True, apply_all_sampling_params=False, model_artifact_path=PosixPath('dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu'), use_staging_engine=True, sampling_setting={'ignore_eos': True, 'logit_bias': None, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'top_p': 1.0, 'top_k': -1, 'json_schema': {'properties': {'answ
er': {'title': 'Answer', 'type': 'string'}}, 'required': ['answer'], 'title': 'Output', 'type': 'object'}})
2024-01-26 22:36:39 [info ] StagingInferenceEngine.start [mlc_serve.engine.staging_engine] func_name=start lineno=88 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3044746
2024-01-26 22:36:42 [info ] Loading parameters from dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu. [mlc_serve.model.tvm_model] func_name=get_tvm_model lineno=67 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:05 [info ] Running memory profiling. [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=457 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:07 [info ] Using 26034 cache blocks. [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=479 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:07 [info ] Allocated KV cache blocks. [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=501 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:07 [info ] Model is initalized. [mlc_serve.engine.staging_engine_worker] func_name=run_generation_loop_worker lineno=358 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py process=3045035
2024-01-26 22:37:07 [warning ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3044746
2024-01-26 22:37:07 [info ] StagingInferenceEngine.add [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3044746 requests=[Request(request_id='1', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema={'properties': {'answer': {'title': 'Answer', 'type': 'string'}}, 'required': ['answer'], 'title': 'Output', 'type': 'object'}, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_seque
nces=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None, contextvars={})]
2024-01-26 22:37:16 [warning ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3044746
2024-01-26 22:37:16 [info ] StagingInferenceEngine.add [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3044746 requests=[Request(request_id='2', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema={'properties': {'answer': {'title': 'Answer', 'type': 'string'}}, 'required': ['answer'], 'title': 'Output', 'type': 'object'}, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_seque
nces=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None, contextvars={})]
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 476.846 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.241 ms (54.820 tok/s)
* End-to-end latency: 2.794 s |
Yes, we should match with the openai spec. ref. |
Thanks for the benchmarking. Since the change is located in the common path, can we also measure the latency/throughput before this PR? Just to confirm sure it's impact is marginal. For throughput, you can download the dataset with |
I see, will add it then. In terms of the spec, OpenAI only supports |
Got it, just did the tests for both latency and throughput (I see quite the hit on TTFT and throughput when JSON mode is on) ### Latency ###
# Default
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 39.603 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.043 ms (55.424 tok/s)
* End-to-end latency: 2.331 s
# JSON mode
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 476.846 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.241 ms (54.820 tok/s)
* End-to-end latency: 2.794 s
### Throughput ###
# Default
Engine Throughput: 11.23 requests/s, 4298.74 tokens/s
# JSON mode
Engine Throughput: 1.41 requests/s, 539.09 tokens/s These are the number for current # Latency
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 36.088 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.049 ms (55.404 tok/s)
* End-to-end latency: 2.328 s
# Throughput
Engine Throughput: 11.41 requests/s, 4364.51 tokens/s |
Unfortunately, a drop in throughput from 11 req/s to 1 req/s is not sustainable in production. But my suspicion is that the only way throughput can be that different, while end to end times are that similar is that the OLLM/Python/sampler layer is blocking things and starving the generation loop. Which is likely fixable. @vegaluisjose, can you do a If it's in our PyTorch code, then perhaps an |
Hi, @binarybana. I synced with @vegaluisjose and learned that the functionality is verified and the performance for the non-JSON mode requests won't be affected. And I think I figured out why the throughput is dropped significantly, this is mainly because of this sequential loop. https://github.com/octoml/mlc-llm/pull/175/files#diff-2ac58f8a6d96b2cb84b3e875bfca806011d767c2f0c1c95deaa371aed9ba6c01R335 This first attempt is exactly mimicking vllm's implementation so @vegaluisjose confirmed that vllm also see the same amount of degradation. |
Is this going to be merged any time soon? I have a PR #181 that's waiting for other high priority PRs to be merged first. |
@masahi, yes it is ready for the review now. It has been waiting for #192 to get merged. Hopefully, we can get it merged by EOD. @vegaluisjose, it is rebased now, can you check your tests? I can pass all unit tests and benchmark tests. |
No description provided.