[JSON Mode] Constrained Sampling #175

vegaluisjose · 2024-01-25T23:54:05Z

No description provided.

sunggg

Thank you for the PR, @vegaluisjose!
It looks good to me overall, can you also update the benchmark scripts and report the performance with/without this feature?
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/benchmark_latency.py
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/benchmark_throughput.py

You can add a flag for constrained sampling and update the post processing
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/utils.py#L5
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/utils.py#L38

serve/mlc_serve/model/tvm_model.py

sunggg · 2024-01-26T18:50:46Z

serve/tests/unittest/test_engine_with_samplers.py

+            Snow.model_validate(json.loads(out_text))
+        else:
+            SnowList.model_validate(json.loads(out_text))
+


Can we add the test case for n>1 and mark it as skip since it is not supported yet?

by skipping you mean commenting it out right? or you mean the pytest way?

@pytest.mark.skip(reason="no way of currently testing this") def test_the_unknown():

pytest way. Thank you for adding this!

serve/mlc_serve/engine/model_module.py

sunggg · 2024-01-26T22:02:00Z

Also, can we also check if local server works with this change?
python3 -m mlc_serve --local-id xxxx would launch the server and you can test it with the curl request.
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/api/handler.py#L89

vegaluisjose · 2024-01-26T22:33:27Z

Also, can we also check if local server works with this change? python3 -m mlc_serve --local-id xxxx would launch the server and you can test it with the curl request. https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/api/handler.py#L89

Yes, it works. Do we need to add response_format for json support here as well?

(hexagon) lvega@crusoe-p4d:~/hexagon$ curl http://127.0.0.1:8000/v1/chat/completions -i -H "Content-Type: application/json" -d "@test.json"
HTTP/1.1 200 OK
date: Fri, 26 Jan 2024 22:28:07 GMT
server: uvicorn
content-length: 802
content-type: application/json

{"id":"cmpl-ba2d083fbc854e22b601861b3f05d543","object":"chat.completion","created":1706308088,"model":"test","choices":[{"index":0,"message":{"role":"assistant","content":" The capital of France is Paris. Paris is one of the most populous cities in Europe, and is known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. It is also famous for its fashion, art, and culinary scenes. Paris is located in the north-central part of France, on the banks of the Seine River. It is the political, cultural, and economic center of France, and is home to a number of important government institutions, including the French Parliament and the presidential palace."},"finish_reason":"stop"}],"usage":{"prompt_tokens":16,"total_tokens":136,"completion_tokens":120}}

{
  "model" : "test",
  "stream": false,
  "temperature" : 0,
  "messages":[
      {
          "role" : "user",
          "content" : "what is the capital of France?"
      }
  ]
}

vegaluisjose · 2024-01-26T22:46:18Z

Thank you for the PR, @vegaluisjose! It looks good to me overall, can you also update the benchmark scripts and report the performance with/without this feature? https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/benchmark_latency.py https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/benchmark_throughput.py

You can add a flag for constrained sampling and update the post processing https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/utils.py#L5 https://github.com/octoml/mlc-llm/blob/batch-serving/serve/benchmarks/utils.py#L38

I was able to test latency (I am not sure how to do throughput since requires a dataset). Here are the results

Regular mode

(hexagon) lvega@crusoe-p4d:~/hexagon/mlc-llm$ /opt/bin/cuda-reserve.py --num-gpus 2 python3 serve/benchmarks/benchmark_latency.py --local-id Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu
Namespace(local_id='Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu', artifact_path='dist', use_sync_engine=False, num_sequences_to_sample=1, max_num_batched_tokens=4096, min_decode_steps=32, max_decode_steps=56, debug_logging=False, seed=0, num_input_tokens=128, num_output_tokens=128, temperature=0.5, apply_penalties=False, apply_logit_bias=False, apply_top_p_top_k=Fa
lse, apply_json_mode=False, apply_all_sampling_params=False, model_artifact_path=PosixPath('dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu'), use_staging_engine=True, sampling_setting={'ignore_eos': True, 'logit_bias': None, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'top_p': 1.0, 'top_k': -1, 'json_schema': None})
2024-01-26 22:35:21 [info     ] StagingInferenceEngine.start   [mlc_serve.engine.staging_engine] func_name=start lineno=88 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3042834
2024-01-26 22:35:24 [info     ] Loading parameters from dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu. [mlc_serve.model.tvm_model] func_name=get_tvm_model lineno=67 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:48 [info     ] Running memory profiling.      [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=457 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:50 [info     ] Using 26034 cache blocks.      [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=479 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:50 [info     ] Allocated KV cache blocks.     [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=501 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3043006
2024-01-26 22:35:50 [info     ] Model is initalized.           [mlc_serve.engine.staging_engine_worker] func_name=run_generation_loop_worker lineno=358 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py process=3043006
2024-01-26 22:35:50 [warning  ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3042834
2024-01-26 22:35:50 [info     ] StagingInferenceEngine.add     [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3042834 requests=[Request(request_id='1', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema=None, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_sequences=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None,
 contextvars={})]
2024-01-26 22:35:53 [warning  ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3042834
2024-01-26 22:35:53 [info     ] StagingInferenceEngine.add     [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3042834 requests=[Request(request_id='2', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema=None, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_sequences=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None,
 contextvars={})]
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 39.603 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.043 ms (55.424 tok/s)
* End-to-end latency: 2.331 s

JSON mode

(hexagon) lvega@crusoe-p4d:~/hexagon/mlc-llm$ /opt/bin/cuda-reserve.py --num-gpus 2 python3 serve/benchmarks/benchmark_latency.py --local-id Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu --apply-json-mode
Namespace(local_id='Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu', artifact_path='dist', use_sync_engine=False, num_sequences_to_sample=1, max_num_batched_tokens=4096, min_decode_steps=32, max_decode_steps=56, debug_logging=False, seed=0, num_input_tokens=128, num_output_tokens=128, temperature=0.5, apply_penalties=False, apply_logit_bias=False, apply_top_p_top_k=Fa
lse, apply_json_mode=True, apply_all_sampling_params=False, model_artifact_path=PosixPath('dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu'), use_staging_engine=True, sampling_setting={'ignore_eos': True, 'logit_bias': None, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'top_p': 1.0, 'top_k': -1, 'json_schema': {'properties': {'answ
er': {'title': 'Answer', 'type': 'string'}}, 'required': ['answer'], 'title': 'Output', 'type': 'object'}})
2024-01-26 22:36:39 [info     ] StagingInferenceEngine.start   [mlc_serve.engine.staging_engine] func_name=start lineno=88 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3044746
2024-01-26 22:36:42 [info     ] Loading parameters from dist/Mixtral-8x7B-Instruct-v0.1-q0f16-presharded-2gpu. [mlc_serve.model.tvm_model] func_name=get_tvm_model lineno=67 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:05 [info     ] Running memory profiling.      [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=457 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:07 [info     ] Using 26034 cache blocks.      [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=479 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:07 [info     ] Allocated KV cache blocks.     [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=501 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/model/tvm_model.py process=3045035
2024-01-26 22:37:07 [info     ] Model is initalized.           [mlc_serve.engine.staging_engine_worker] func_name=run_generation_loop_worker lineno=358 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py process=3045035
2024-01-26 22:37:07 [warning  ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3044746
2024-01-26 22:37:07 [info     ] StagingInferenceEngine.add     [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3044746 requests=[Request(request_id='1', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema={'properties': {'answer': {'title': 'Answer', 'type': 'string'}}, 'required': ['answer'], 'title': 'Output', 'type': 'object'}, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_seque
nces=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None, contextvars={})]
2024-01-26 22:37:16 [warning  ] `debug_options.prompt_token_ids` is provided. This will be used directly and the prompts will be ignored if provided. [mlc_serve.engine.base] func_name=__post_init__ lineno=138 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/base.py process=3044746
2024-01-26 22:37:16 [info     ] StagingInferenceEngine.add     [mlc_serve.engine.staging_engine] func_name=add lineno=105 pathname=/home/lvega/hexagon/mlc-llm/serve/mlc_serve/engine/staging_engine.py process=3044746 requests=[Request(request_id='2', messages=None, num_sequences=1, best_of=1, sampling_params=SamplingParams(presence_penalty=0.0, frequency_penalty=0.0, rep
etition_penalty=1.0, temperature=0.5, top_p=1.0, top_k=-1, logit_bias=None, appeared_tokens_freq={}, logit_bias_index=None, logit_bias_value=None, json_schema={'properties': {'answer': {'title': 'Answer', 'type': 'string'}}, 'required': ['answer'], 'title': 'Output', 'type': 'object'}, logits_processor=None), stopping_criteria=StoppingCriteria(max_tokens=128, stop_seque
nces=None), debug_options=DebugOptions(ignore_eos=True, prompt=None, prompt_token_ids=[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), validate_tokens=None, contextvars={})]
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 476.846 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.241 ms (54.820 tok/s)
* End-to-end latency: 2.794 s

sunggg · 2024-01-29T17:02:34Z

Do we need to add response_format for json support here as well?

Yes, we should match with the openai spec. ref.
I'm not familiar with its usage tho, how do we pass the actual class object? (e.g., class France, class Snow in your test script)

sunggg · 2024-01-29T17:04:46Z

Thanks for the benchmarking. Since the change is located in the common path, can we also measure the latency/throughput before this PR? Just to confirm sure it's impact is marginal.

For throughput, you can download the dataset with wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

vegaluisjose · 2024-01-29T17:08:44Z

Do we need to add response_format for json support here as well?

Yes, we should match with the openai spec. ref. I'm not familiar with its usage tho, how do we pass the actual class object? (e.g., class France, class Snow in your test script)

I see, will add it then. In terms of the spec, OpenAI only supports response_format(type="json_object") but together, anyscale, and fireworks do response_format(type="json_object", schema={...}) and we will be following that

vegaluisjose · 2024-01-29T17:49:50Z

User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 476.846 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.241 ms (54.820 tok/s)
* End-to-end latency: 2.794 s

Got it, just did the tests for both latency and throughput (I see quite the hit on TTFT and throughput when JSON mode is on)

### Latency ###

# Default
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 39.603 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.043 ms (55.424 tok/s)
* End-to-end latency: 2.331 s

# JSON mode
User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 476.846 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.241 ms (54.820 tok/s)
* End-to-end latency: 2.794 s

### Throughput ###

# Default
Engine Throughput: 11.23 requests/s, 4298.74 tokens/s

# JSON mode
Engine Throughput: 1.41 requests/s, 539.09 tokens/s

These are the number for current batch-serving, commit

# Latency

User side metrics
* number of input tokens: 128, number of output tokens: 128
* Time To First Token (TTFT): 36.088 ms
* Inter-Subsequent-Token-Latency (ISTL): 18.049 ms (55.404 tok/s)
* End-to-end latency: 2.328 s

# Throughput

Engine Throughput: 11.41 requests/s, 4364.51 tokens/s

binarybana · 2024-01-29T20:16:40Z

Unfortunately, a drop in throughput from 11 req/s to 1 req/s is not sustainable in production.

But my suspicion is that the only way throughput can be that different, while end to end times are that similar is that the OLLM/Python/sampler layer is blocking things and starving the generation loop. Which is likely fixable.

@vegaluisjose, can you do a py-spy record with and without JSON mode to see where the time is being spent?

If it's in our PyTorch code, then perhaps an asyncio.to_thread might be enough (since PyTorch calls release the GIL), but if it's in Guidance, then we might need to get creative, since that looks like all Python.

sunggg · 2024-02-01T20:33:32Z

Hi, @binarybana. I synced with @vegaluisjose and learned that the functionality is verified and the performance for the non-JSON mode requests won't be affected.

And I think I figured out why the throughput is dropped significantly, this is mainly because of this sequential loop. https://github.com/octoml/mlc-llm/pull/175/files#diff-2ac58f8a6d96b2cb84b3e875bfca806011d767c2f0c1c95deaa371aed9ba6c01R335
Because we visit each request in the batch one-by-one, the throughput got horrible although the latency for the single latency with no load did not hurt.
The benchmark script we used tests extremely high concurrency scenario (1,000 vus), so the degradation looked bad. I guess the degradation won't be this bad when we have less vus.

This first attempt is exactly mimicking vllm's implementation so @vegaluisjose confirmed that vllm also see the same amount of degradation.
Since it looks good for the first attempt in my opinion, I'd like to get this merged and follow-up.

masahi · 2024-02-08T06:23:43Z

Is this going to be merged any time soon? I have a PR #181 that's waiting for other high priority PRs to be merged first.

sunggg · 2024-02-08T15:36:42Z

@masahi, yes it is ready for the review now. It has been waiting for #192 to get merged. Hopefully, we can get it merged by EOD.

@vegaluisjose, it is rebased now, can you check your tests? I can pass all unit tests and benchmark tests.

vegaluisjose · 2024-02-08T19:09:52Z

Alright, just tested and it is passing my tests 38 out of 38 on both Mistral and Mixtral @masahi @sunggg

vegaluisjose requested review from csullivan and sunggg January 26, 2024 18:43

sunggg reviewed Jan 26, 2024

View reviewed changes

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jan 30, 2024

Support RedPajama-INCITE-Chat-3B-v1 (octoml#175)

5184eb8

vegaluisjose marked this pull request as draft January 30, 2024 16:52

vegaluisjose and others added 15 commits February 8, 2024 14:48

initial hack to get cs

4f59de1

updated prompt

cb1abc6

works

7934ba7

fixed smoke test

7022ef5

added a nested structure test

09a8878

fixed fmt

1bbbd62

added outlines as dependency

87bae61

removed log info

3f23707

put test back in the right placce

4ea1667

added outlines

3d1c406

added outlines pkg

aaf75b8

fixed types

d12c181

changed type

3d026e6

added _tokenizer

82ff2c9

removed deadcode

57b55bc

vegaluisjose and others added 6 commits February 8, 2024 14:59

added json mode to latency and throughput benchmark

53c1329

fixed poetry dependencies

5e507b4

added schema to api

1466a48

updated constrained sampling loop

6ddabcf

reverted back loop

323172f

rebased

982d751

sunggg force-pushed the lvega/cs branch from 754bb59 to 982d751 Compare February 8, 2024 15:22

sunggg marked this pull request as ready for review February 8, 2024 15:36

sunggg merged commit 1c7e7f0 into batch-serving Feb 8, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JSON Mode] Constrained Sampling #175

[JSON Mode] Constrained Sampling #175

vegaluisjose commented Jan 25, 2024

sunggg left a comment

sunggg Jan 26, 2024

vegaluisjose Jan 26, 2024

sunggg Jan 26, 2024

sunggg commented Jan 26, 2024

vegaluisjose commented Jan 26, 2024

vegaluisjose commented Jan 26, 2024

sunggg commented Jan 29, 2024

sunggg commented Jan 29, 2024 •

edited

Loading

vegaluisjose commented Jan 29, 2024

vegaluisjose commented Jan 29, 2024 •

edited

Loading

binarybana commented Jan 29, 2024

sunggg commented Feb 1, 2024

masahi commented Feb 8, 2024

sunggg commented Feb 8, 2024

vegaluisjose commented Feb 8, 2024

[JSON Mode] Constrained Sampling #175

[JSON Mode] Constrained Sampling #175

Conversation

vegaluisjose commented Jan 25, 2024

sunggg left a comment

Choose a reason for hiding this comment

sunggg Jan 26, 2024

Choose a reason for hiding this comment

vegaluisjose Jan 26, 2024

Choose a reason for hiding this comment

sunggg Jan 26, 2024

Choose a reason for hiding this comment

sunggg commented Jan 26, 2024

vegaluisjose commented Jan 26, 2024

vegaluisjose commented Jan 26, 2024

sunggg commented Jan 29, 2024

sunggg commented Jan 29, 2024 • edited Loading

vegaluisjose commented Jan 29, 2024

vegaluisjose commented Jan 29, 2024 • edited Loading

binarybana commented Jan 29, 2024

sunggg commented Feb 1, 2024

masahi commented Feb 8, 2024

sunggg commented Feb 8, 2024

vegaluisjose commented Feb 8, 2024

sunggg commented Jan 29, 2024 •

edited

Loading

vegaluisjose commented Jan 29, 2024 •

edited

Loading