Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct small typos in vllm_mixtral.py #563

Merged
merged 2 commits into from
Jan 26, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion 06_gpu_and_ml/vllm_mixtral.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# walks through setting up an environment that works with `vLLM ` for basic inference.
#
# We are running the [Mixtral 8x7B Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model here, which is a mixture-of-experts model finetuned for conversation.
# You can expect 3 minute second cold starts
# You can expect 3 minute cold starts.
# For a single request, the throughput is about 11 tokens/second, but there are upcoming `vLLM` optimizations to improve this.
# The larger the batch of prompts, the higher the throughput (up to about 300 tokens/second).
# For example, with the 60 prompts below, we can produce 30k tokens in 100 seconds.
Expand Down
Loading