Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JSON Mode] Constrained Sampling #175

Merged
merged 21 commits into from
Feb 8, 2024
Merged

[JSON Mode] Constrained Sampling #175

merged 21 commits into from
Feb 8, 2024

Conversation

vegaluisjose
Copy link

No description provided.

Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serve/mlc_serve/model/tvm_model.py Outdated Show resolved Hide resolved
Snow.model_validate(json.loads(out_text))
else:
SnowList.model_validate(json.loads(out_text))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the test case for n>1 and mark it as skip since it is not supported yet?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by skipping you mean commenting it out right? or you mean the pytest way?

@pytest.mark.skip(reason="no way of currently testing this")
def test_the_unknown():

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest way. Thank you for adding this!

serve/mlc_serve/engine/model_module.py Outdated Show resolved Hide resolved
@sunggg
Copy link
Member

sunggg commented Jan 26, 2024

Also, can we also check if local server works with this change?
python3 -m mlc_serve --local-id xxxx would launch the server and you can test it with the curl request.
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/api/handler.py#L89

@vegaluisjose
Copy link
Author

Also, can we also check if local server works with this change? python3 -m mlc_serve --local-id xxxx would launch the server and you can test it with the curl request. https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/api/handler.py#L89

Yes, it works. Do we need to add response_format for json support here as well?

(hexagon) lvega@crusoe-p4d:~/hexagon$ curl http://127.0.0.1:8000/v1/chat/completions -i -H "Content-Type: application/json" -d "@test.json"
HTTP/1.1 200 OK
date: Fri, 26 Jan 2024 22:28:07 GMT
server: uvicorn
content-length: 802
content-type: application/json

{"id":"cmpl-ba2d083fbc854e22b601861b3f05d543","object":"chat.completion","created":1706308088,"model":"test","choices":[{"index":0,"message":{"role":"assistant","content":" The capital of France is Paris. Paris is one of the most populous cities in Europe, and is known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. It is also famous for its fashion, art, and culinary scenes. Paris is located in the north-central part of France, on the banks of the Seine River. It is the political, cultural, and economic center of France, and is home to a number of important government institutions, including the French Parliament and the presidential palace."},"finish_reason":"stop"}],"usage":{"prompt_tokens":16,"total_tokens":136,"completion_tokens":120}}
{
  "model" : "test",
  "stream": false,
  "temperature" : 0,
  "messages":[
      {
          "role" : "user",
          "content" : "what is the capital of France?"
      }
  ]
}

@vegaluisjose
Copy link
Author

Thank you for the PR, @vegaluisjose! It looks good to me overall, can you also update the benchmark scripts and report the performance with/without this feature? https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/benchmark_latency.py https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/benchmark_throughput.py

You can add a flag for constrained sampling and update the post processing https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/utils.py#L5 https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/utils.py#L38

I was able to test latency (I am not sure how to do throughput since requires a dataset). Here are the results

Regular mode

(hexagon) lvega@crusoe-p4d:~/hexagon/mlc-llm$ /opt/bin/cuda-reserve.py --num-gpus 2 python3 serve/benchmarks/benchmark_latency.py --local-id Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu
Namespace(local_id='Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu', artifact_path='dist', use_sync_engine=False, num_sequences_to_sample=1, max_num_batched_tokens=4096, min_decode_steps=32, max_decode_steps=56, debug_logging=False, seed=0, num_input_tokens=128, num_output_tokens=128, temperature=0.5, apply_penalties=False, apply_logit_bias=False, apply_top_p_top_k=Fa
lse, apply_json_mode=False, apply_all_sampling_params=False, model_artifact_path=PosixPath('dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu'), use_staging_engine=True, sampling_setting={'ignore_eos': True, 'logit_bias': None, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'top_p': 1.0, 'top_k': -1, 'json_schema': None})
2024-01-26 22:35:21 [info     ] StagingInferenceEngine.start   [mlc_serve.engine.staging_engine] func_name=start lineno=88 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3042834
2024-01-26 22:35:24 [info     ] Loading parameters from dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu. [mlc_serve.model.tvm_model] func_name=get_tvm_model lineno=67 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:48 [info     ] Running memory profiling.      [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=457 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:50 [info     ] Using 26034 cache blocks.      [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=479 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:50 [info     ] Allocated KV cache blocks.     [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=501 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:50 [info     ] Model is initalized.           [mlc_serve.engine.staging_engine_worker] func_name=run_generation_loop_worker lineno=358 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py process=3043006
2024-01-26 22:35:50 [warning  ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3042834
2024-01-26 22:35:50 [info     ] StagingInferenceEngine.add     [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3042834 requests=[Request(request_id='1', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema=None, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_sequences=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None,
 contextvars={})]
2024-01-26 22:35:53 [warning  ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3042834
2024-01-26 22:35:53 [info     ] StagingInferenceEngine.add     [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3042834 requests=[Request(request_id='2', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema=None, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_sequences=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None,
 contextvars={})]
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 39.603 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.043 ms (55.424 tok/s)
* End-to-end latency: 2.331 s

JSON mode

(hexagon) lvega@crusoe-p4d:~/hexagon/mlc-llm$ /opt/bin/cuda-reserve.py --num-gpus 2 python3 serve/benchmarks/benchmark_latency.py --local-id Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu --apply-json-mode
Namespace(local_id='Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu', artifact_path='dist', use_sync_engine=False, num_sequences_to_sample=1, max_num_batched_tokens=4096, min_decode_steps=32, max_decode_steps=56, debug_logging=False, seed=0, num_input_tokens=128, num_output_tokens=128, temperature=0.5, apply_penalties=False, apply_logit_bias=False, apply_top_p_top_k=Fa
lse, apply_json_mode=True, apply_all_sampling_params=False, model_artifact_path=PosixPath('dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu'), use_staging_engine=True, sampling_setting={'ignore_eos': True, 'logit_bias': None, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'top_p': 1.0, 'top_k': -1, 'json_schema': {'properties': {'answ
er': {'title': 'Answer', 'type': 'string'}}, 'required': ['answer'], 'title': 'Output', 'type': 'object'}})
2024-01-26 22:36:39 [info     ] StagingInferenceEngine.start   [mlc_serve.engine.staging_engine] func_name=start lineno=88 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3044746
2024-01-26 22:36:42 [info     ] Loading parameters from dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu. [mlc_serve.model.tvm_model] func_name=get_tvm_model lineno=67 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:05 [info     ] Running memory profiling.      [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=457 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:07 [info     ] Using 26034 cache blocks.      [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=479 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:07 [info     ] Allocated KV cache blocks.     [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=501 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:07 [info     ] Model is initalized.           [mlc_serve.engine.staging_engine_worker] func_name=run_generation_loop_worker lineno=358 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py process=3045035
2024-01-26 22:37:07 [warning  ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3044746
2024-01-26 22:37:07 [info     ] StagingInferenceEngine.add     [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3044746 requests=[Request(request_id='1', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema={'properties': {'answer': {'title': 'Answer', 'type': 'string'}}, 'required': ['answer'], 'title': 'Output', 'type': 'object'}, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_seque
nces=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None, contextvars={})]
2024-01-26 22:37:16 [warning  ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3044746
2024-01-26 22:37:16 [info     ] StagingInferenceEngine.add     [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3044746 requests=[Request(request_id='2', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema={'properties': {'answer': {'title': 'Answer', 'type': 'string'}}, 'required': ['answer'], 'title': 'Output', 'type': 'object'}, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_seque
nces=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None, contextvars={})]
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 476.846 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.241 ms (54.820 tok/s)
* End-to-end latency: 2.794 s

@sunggg
Copy link
Member

sunggg commented Jan 29, 2024

Do we need to add response_format for json support here as well?

Yes, we should match with the openai spec. ref.
I'm not familiar with its usage tho, how do we pass the actual class object? (e.g., class France, class Snow in your test script)

@sunggg
Copy link
Member

sunggg commented Jan 29, 2024

Thanks for the benchmarking. Since the change is located in the common path, can we also measure the latency/throughput before this PR? Just to confirm sure it's impact is marginal.

For throughput, you can download the dataset with wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

@vegaluisjose
Copy link
Author

Do we need to add response_format for json support here as well?

Yes, we should match with the openai spec. ref. I'm not familiar with its usage tho, how do we pass the actual class object? (e.g., class France, class Snow in your test script)

I see, will add it then. In terms of the spec, OpenAI only supports response_format(type="json_object") but together, anyscale, and fireworks do response_format(type="json_object", schema={...}) and we will be following that

@vegaluisjose
Copy link
Author

vegaluisjose commented Jan 29, 2024

User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 476.846 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.241 ms (54.820 tok/s)
* End-to-end latency: 2.794 s

Got it, just did the tests for both latency and throughput (I see quite the hit on TTFT and throughput when JSON mode is on)

### Latency ###

# Default
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 39.603 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.043 ms (55.424 tok/s)
* End-to-end latency: 2.331 s

# JSON mode
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 476.846 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.241 ms (54.820 tok/s)
* End-to-end latency: 2.794 s

### Throughput ###

# Default
Engine Throughput: 11.23 requests/s, 4298.74 tokens/s

# JSON mode
Engine Throughput: 1.41 requests/s, 539.09 tokens/s

These are the number for current batch-serving, commit

# Latency

User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 36.088 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.049 ms (55.404 tok/s)
* End-to-end latency: 2.328 s

# Throughput

Engine Throughput: 11.41 requests/s, 4364.51 tokens/s

@binarybana
Copy link
Member

Unfortunately, a drop in throughput from 11 req/s to 1 req/s is not sustainable in production.

But my suspicion is that the only way throughput can be that different, while end to end times are that similar is that the OLLM/Python/sampler layer is blocking things and starving the generation loop. Which is likely fixable.

@vegaluisjose, can you do a py-spy record with and without JSON mode to see where the time is being spent?

If it's in our PyTorch code, then perhaps an asyncio.to_thread might be enough (since PyTorch calls release the GIL), but if it's in Guidance, then we might need to get creative, since that looks like all Python.

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jan 30, 2024
@vegaluisjose vegaluisjose marked this pull request as draft January 30, 2024 16:52
@sunggg
Copy link
Member

sunggg commented Feb 1, 2024

Hi, @binarybana. I synced with @vegaluisjose and learned that the functionality is verified and the performance for the non-JSON mode requests won't be affected.

And I think I figured out why the throughput is dropped significantly, this is mainly because of this sequential loop. https://github.com/octoml/mlc-llm/pull/175/files#diff-2ac58f8a6d96b2cb84b3e875bfca806011d767c2f0c1c95deaa371aed9ba6c01R335
Because we visit each request in the batch one-by-one, the throughput got horrible although the latency for the single latency with no load did not hurt.
The benchmark script we used tests extremely high concurrency scenario (1,000 vus), so the degradation looked bad. I guess the degradation won't be this bad when we have less vus.

This first attempt is exactly mimicking vllm's implementation so @vegaluisjose confirmed that vllm also see the same amount of degradation.
Since it looks good for the first attempt in my opinion, I'd like to get this merged and follow-up.

@masahi
Copy link
Member

masahi commented Feb 8, 2024

Is this going to be merged any time soon? I have a PR #181 that's waiting for other high priority PRs to be merged first.

@sunggg
Copy link
Member

sunggg commented Feb 8, 2024

@masahi, yes it is ready for the review now. It has been waiting for #192 to get merged. Hopefully, we can get it merged by EOD.

@vegaluisjose, it is rebased now, can you check your tests? I can pass all unit tests and benchmark tests.

@sunggg sunggg marked this pull request as ready for review February 8, 2024 15:36
@vegaluisjose
Copy link
Author

Alright, just tested and it is passing my tests 38 out of 38 on both Mistral and Mixtral @masahi @sunggg

@sunggg sunggg merged commit 1c7e7f0 into batch-serving Feb 8, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants