Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Upstream sync 2024 07 01 #350

Merged
merged 113 commits into from
Jul 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
54348ea
[Distributed] Add send and recv helpers (#5719)
andoorve Jun 23, 2024
5ae4430
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requi…
Isotr0py Jun 24, 2024
1601d82
[doc][faq] add warning to download models for every nodes (#5783)
youkaichao Jun 24, 2024
20faeb6
[Doc] Add "Suggest edit" button to doc pages (#5789)
mgoin Jun 24, 2024
62ecd68
[Doc] Add Phi-3-medium to list of supported models (#5788)
mgoin Jun 24, 2024
ca8bc83
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args…
CatherineSue Jun 24, 2024
91b2d1d
[ci] Remove aws template (#5757)
khluu Jun 25, 2024
21450bc
[Doc] Add notice about breaking changes to VLMs (#5818)
DarkLight1337 Jun 25, 2024
1d55e23
[Speculative Decoding] Support draft model on different tensor-parall…
wooyeonlee0 Jun 25, 2024
980c10b
[Misc] Remove useless code in cpu_worker (#5824)
DamonFool Jun 25, 2024
3b261da
[Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748)
Yard1 Jun 25, 2024
8d6c12f
[doc][distributed] add both gloo and nccl tests (#5834)
youkaichao Jun 25, 2024
c3bc8c6
[CI/Build] Add unit testing for FlexibleArgumentParser (#5798)
mgoin Jun 25, 2024
a9e34b9
[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16`…
dsikka Jun 25, 2024
21f69d1
[Hardware][TPU] Refactor TPU backend (#5831)
WoosukKwon Jun 25, 2024
ece7c7f
resolved
mawong-amd Jun 25, 2024
e6935bd
[Hardware][TPU] Raise errors for unsupported sampling params (#5850)
WoosukKwon Jun 25, 2024
81a21d2
[CI/Build] Add E2E tests for MLPSpeculator (#5791)
tdoublep Jun 26, 2024
f9775e9
[Bugfix] Fix assertion in NeuronExecutor (#5841)
aws-patlange Jun 26, 2024
fb41934
[Core] Refactor Worker and ModelRunner to consolidate control plane c…
stephanie-wang Jun 26, 2024
ce9da79
[Misc][Doc] Add Example of using OpenAI Server with VLM (#5832)
ywang96 Jun 26, 2024
cb364ef
[bugfix][distributed] fix shm broadcast when the queue size is full (…
youkaichao Jun 26, 2024
9744700
[Bugfix] Fix embedding to support 2D inputs (#5829)
WoosukKwon Jun 26, 2024
2f7eba7
[Bugfix][TPU] Fix KV cache size calculation (#5860)
WoosukKwon Jun 26, 2024
74952fd
[CI/Build] Refactor image test assets (#5821)
DarkLight1337 Jun 26, 2024
1d1929b
[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560)
ProExpertProg Jun 26, 2024
5095252
[Frontend] Add tokenize/detokenize endpoints (#5054)
sasha0552 Jun 26, 2024
1653293
[Hardware][TPU] Support parallel sampling & Swapping (#5855)
WoosukKwon Jun 26, 2024
e423b2c
[Bugfix][TPU] Fix CPU cache allocation (#5869)
WoosukKwon Jun 26, 2024
698f968
Support CPU inference with VSX PowerPC ISA (#5652)
ChipKerchner Jun 26, 2024
182cdaa
[doc] update usage of env var to avoid conflict (#5873)
youkaichao Jun 26, 2024
750539c
[Misc] Add example for LLaVA-NeXT (#5879)
ywang96 Jun 27, 2024
7823612
[BugFix] Fix cuda graph for MLPSpeculator (#5875)
njhill Jun 27, 2024
0844ba8
[Doc] Add note about context length in Phi-3-Vision example (#5887)
DarkLight1337 Jun 27, 2024
5855a8e
[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted prop…
xwjiang2010 Jun 27, 2024
2102a46
[Model] Add base class for LoRA-supported models (#5018)
DarkLight1337 Jun 27, 2024
f483510
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888)
ywang96 Jun 27, 2024
684c441
[CI/Build] [1/3] Reorganize entrypoints tests (#5526)
DarkLight1337 Jun 27, 2024
dcb8246
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896)
DarkLight1337 Jun 27, 2024
db62aa3
[doc][misc] add note for Kubernetes users (#5916)
youkaichao Jun 27, 2024
0c7ef70
[BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (#5…
njhill Jun 27, 2024
6e594ee
[BugFix] Fix `min_tokens` behaviour for multiple eos tokens (#5849)
njhill Jun 27, 2024
81ddde3
[CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (#5922)
ywang96 Jun 27, 2024
c1d4964
[Model] Add Gemma 2 (#5908)
WoosukKwon Jun 27, 2024
209a147
[core][misc] remove logical block (#5882)
youkaichao Jun 27, 2024
4d5e0b9
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932)
divakar-amd Jun 27, 2024
5f1316e
[Hardware][TPU] Optimize KV cache swapping (#5878)
WoosukKwon Jun 28, 2024
74bf88f
[VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast prope…
xwjiang2010 Jun 28, 2024
f177c04
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU…
Isotr0py Jun 28, 2024
70af85d
[Core] Registry for processing model inputs (#5214)
DarkLight1337 Jun 28, 2024
fd59ff4
Unmark fused_moe config json file as executable (#5960)
tlrmchlsmth Jun 28, 2024
2e67191
[Hardware][Intel] OpenVINO vLLM backend (#5379)
ilya-lavrenov Jun 28, 2024
0d4c0c6
[Bugfix] Better error message for MLPSpeculator when `num_speculative…
tdoublep Jun 28, 2024
1ce7d18
[CI/Build] [2/3] Reorganize entrypoints tests (#5904)
DarkLight1337 Jun 28, 2024
4b9894c
[Distributed] Make it clear that % should not be in tensor dict keys.…
xwjiang2010 Jun 28, 2024
6664f2a
[Spec Decode] Introduce DraftModelRunner (#5799)
comaniac Jun 28, 2024
42cdb40
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
tlrmchlsmth Jun 28, 2024
7c1515e
[ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Sim…
robertgshaw2-neuralmagic Jun 28, 2024
9598197
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP…
robertgshaw2-neuralmagic Jun 28, 2024
3441c30
Support Deepseek-V2 (#4650)
zwd003 Jun 28, 2024
a5ef790
[Bugfix] Only add `Attention.kv_scale` if kv cache quantization is en…
mgoin Jun 28, 2024
ccd94db
Unmark more files as executable (#5962)
tlrmchlsmth Jun 28, 2024
f49047a
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadEr…
robertgshaw2-neuralmagic Jun 28, 2024
eeb9d99
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for …
LiuXiaoxuanPKU Jun 28, 2024
026b28e
[Bugfix][TPU] Fix TPU sampler output (#5978)
WoosukKwon Jun 29, 2024
f281c2e
[Bugfix][TPU] Fix pad slot id (#5977)
WoosukKwon Jun 29, 2024
b89416e
[Bugfix] fix missing last itl in openai completions benchmark (#5926)
mcalman Jun 29, 2024
acf1f76
[Misc] Extend vLLM Metrics logging API (#5925)
SolitaryThinker Jun 29, 2024
b9acdae
[Kernel] Add punica dimensions for Granite 3b and 8b (#5930)
joerunde Jun 29, 2024
aa72bdc
[Bugfix] Fix precisions in Gemma 1 (#5913)
WoosukKwon Jun 29, 2024
33fecd4
[Misc] Update Phi-3-Vision Example (#5981)
ywang96 Jun 29, 2024
00f60d2
[Bugfix] Support `eos_token_id` from `config.json` (#5954)
DarkLight1337 Jun 29, 2024
270105d
[Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum …
Yard1 Jun 29, 2024
b22f1be
[Kernel] Raise an exception in MoE kernel if the batch size is larger…
comaniac Jun 29, 2024
aa49ffe
[ CI/Build ] Added E2E Test For Compressed Tensors (#5839)
robertgshaw2-neuralmagic Jun 29, 2024
b481fe3
[CI/Build] Add TP test for vision models (#5892)
DarkLight1337 Jun 29, 2024
47407b7
[ CI/Build ] LM Eval Harness Based CI Testing (#5838)
robertgshaw2-neuralmagic Jun 29, 2024
3d215cc
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix…
mawong-amd Jun 29, 2024
d0b7111
[CI/Build] Temporarily Remove Phi3-Vision from TP Test (#5989)
ywang96 Jun 30, 2024
445b0d3
[CI/Build] Reuse code for checking output consistency (#5988)
DarkLight1337 Jun 30, 2024
cea9f6b
[CI/Build] [3/3] Reorganize entrypoints tests (#5966)
DarkLight1337 Jun 30, 2024
4f7381a
[ci][distributed] fix device count call
youkaichao Jun 30, 2024
3ceed36
[Frontend]: Support base64 embedding (#5935)
llmpros Jun 30, 2024
51f3e3f
[Lora] Use safetensor keys instead of adapter_config.json to find une…
rkooo567 Jun 30, 2024
9c74b00
[ CI ] Temporarily Disable Large LM-Eval Tests (#6005)
robertgshaw2-neuralmagic Jun 30, 2024
4153e58
[Misc] Fix `get_min_capability` (#5971)
dsikka Jun 30, 2024
27a711a
[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify …
robertgshaw2-neuralmagic Jun 30, 2024
53655b2
format
robertgshaw2-neuralmagic Jul 1, 2024
07abe05
isort
robertgshaw2-neuralmagic Jul 1, 2024
0f0fec4
format
robertgshaw2-neuralmagic Jul 1, 2024
1cc7c46
updated skipping
robertgshaw2-neuralmagic Jul 1, 2024
a699814
added skipping to compressed-tensors
robertgshaw2-neuralmagic Jul 1, 2024
9a4be7f
updated
robertgshaw2-neuralmagic Jul 1, 2024
b4eec34
format
robertgshaw2-neuralmagic Jul 1, 2024
08dedd5
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007)
youkaichao Jul 1, 2024
dac4bb3
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into…
sroy745 Jul 1, 2024
87a4288
[ CI ] Re-enable Large Model LM Eval (#6031)
robertgshaw2-neuralmagic Jul 1, 2024
81e1c3e
[doc][misc] remove deprecated api server in doc (#6037)
youkaichao Jul 1, 2024
2c3c43b
[Misc] update benchmark backend for scalellm (#6018)
zhyncs Jul 1, 2024
cf4e758
[doc][misc] further lower visibility of simple api server (#6041)
youkaichao Jul 1, 2024
9c7608c
[Bugfix] Use RayActorError for older versions of Ray in RayTokenizer…
Yard1 Jul 1, 2024
1b7245f
[Bugfix] adding chunking mechanism to fused_moe to handle large input…
avshalomman Jul 1, 2024
fa05042
add FAQ doc under 'serving' (#5946)
llmpros Jul 1, 2024
484a2e3
[Bugfix][Doc] Fix Doc Formatting (#6048)
ywang96 Jul 1, 2024
99f1474
Update conftest.py
robertgshaw2-neuralmagic Jul 2, 2024
afb93b9
format
robertgshaw2-neuralmagic Jul 2, 2024
fcb4dd3
make _ImageAssets type hint py3.8 compatible
derekk-nm Jul 2, 2024
ceaf019
fixed sampler
robertgshaw2-neuralmagic Jul 2, 2024
655389d
switched to compressed tensors instrad of sparseml
robertgshaw2-neuralmagic Jul 2, 2024
206af82
formatted
robertgshaw2-neuralmagic Jul 2, 2024
cd2aa72
fix sampler test again
robertgshaw2-neuralmagic Jul 3, 2024
f43cb06
Merge branch 'main' into upstream-sync-2024-07-01
robertgshaw2-neuralmagic Jul 3, 2024
7a45bfa
Remove errant `__commit__` definition
dbarbuzzi Jul 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-70B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.892
- name: "exact_match,flexible-extract"
value: 0.892
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.616
- name: "exact_match,flexible-extract"
value: 0.632
limit: 250
num_fewshot: 5
2 changes: 2 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
2 changes: 2 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8.yaml
46 changes: 46 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo
}

while getopts "m:b:l:f:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
51 changes: 51 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for vllm.
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.2

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:b:l:f:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
59 changes: 59 additions & 0 deletions .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/bin/bash

usage() {
echo``
echo "Runs lm eval harness on GSM8k using vllm and compares to "
echo "precomputed baseline (measured by HF transformers.)"
echo
echo "usage: ${0} <options>"
echo
echo " -c - path to the test data config (e.g. configs/small-models.txt)"
echo " -t - tensor parallel size"
echo
}

SUCCESS=0

while getopts "c:t:" OPT; do
case ${OPT} in
c )
CONFIG="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
LOCAL_SUCCESS=0

echo "=== RUNNING MODEL: $MODEL_CONFIG WITH TP SIZE: $TP_SIZE==="

export LM_EVAL_TEST_DATA_FILE=$PWD/configs/${MODEL_CONFIG}
export LM_EVAL_TP_SIZE=$TP_SIZE
pytest -s test_lm_eval_correctness.py || LOCAL_SUCCESS=$?

if [[ $LOCAL_SUCCESS == 0 ]]; then
echo "=== PASSED MODEL: ${MODEL_CONFIG} ==="
else
echo "=== FAILED MODEL: ${MODEL_CONFIG} ==="
fi

SUCCESS=$((SUCCESS + LOCAL_SUCCESS))

done

if [ "${SUCCESS}" -eq "0" ]; then
exit 0
else
exit 1
fi
54 changes: 54 additions & 0 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml

* export LM_EVAL_TEST_DATA_FILE=configs/Meta-Llama-3-70B-Instruct.yaml
* export LM_EVAL_TP_SIZE=4
* pytest -s test_lm_eval_correctness.py
"""

import os
from pathlib import Path

import lm_eval
import numpy
import yaml

RTOL = 0.02
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")

TP_SIZE = os.environ.get("LM_EVAL_TP_SIZE", 1)


def launch_lm_eval(eval_config):
model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={TP_SIZE}"

results = lm_eval.simple_evaluate(
model="vllm",
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto")

return results


def test_lm_eval_correctness():
eval_config = yaml.safe_load(
Path(TEST_DATA_FILE).read_text(encoding="utf-8"))

# Launch eval requests.
results = launch_lm_eval(eval_config)

# Confirm scores match ground truth.
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
14 changes: 14 additions & 0 deletions .buildkite/run-openvino-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# This script build the OpenVINO docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t openvino-test -f Dockerfile.openvino .

# Setup cleanup
remove_docker_container() { docker rm -f openvino-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/vllm/examples/offline_inference.py
52 changes: 37 additions & 15 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# In this file, you can add more tests to run either by adding a new step or
# adding a new command to an existing step. See different options here for examples.
# This script will be feed into Jinja template in `test-template-aws.j2` to generate
# the final pipeline yaml file.

# This script will be feed into Jinja template in `test-template-aws.j2` at
# https://github.com/vllm-project/buildkite-ci/blob/main/scripts/test-template-aws.j2
# to generate the final pipeline yaml file.


steps:
- label: Regression Test
Expand All @@ -24,7 +27,9 @@ steps:

- label: Core Test
mirror_hardwares: [amd]
command: pytest -v -s core
commands:
- pytest -v -s core
- pytest -v -s distributed/test_parallel_state.py

- label: Distributed Comm Ops Test
#mirror_hardwares: [amd]
Expand All @@ -39,19 +44,21 @@ steps:
working_dir: "/vllm-workspace/tests"
num_gpus: 2
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ../.buildkite/download-images.sh
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=llava-hf/llava-1.5-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_multimodal_broadcast.py
- TEST_DIST_MODEL=microsoft/Phi-3-vision-128k-instruct DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_multimodal_broadcast.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- pytest -v -s spec_decode/e2e/test_integration_dist.py
- TEST_DIST_MODEL=llava-hf/llava-1.5-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_multimodal_broadcast.py
- TEST_DIST_MODEL=microsoft/Phi-3-vision-128k-instruct DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_multimodal_broadcast.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s distributed/test_utils.py

Expand All @@ -60,14 +67,12 @@ steps:
working_dir: "/vllm-workspace/tests"
num_gpus: 4
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s distributed/test_pynccl.py
# We want to test that models which use 2 GPUs work with 4 GPUs, which is why we duplicate them here.
# See https://github.com/vllm-project/vllm/pull/5473#issuecomment-2166601837 for context.
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py

- label: Engine Test
mirror_hardwares: [amd]
Expand All @@ -77,8 +82,8 @@ steps:
mirror_hardwares: [amd]

commands:
- pytest -v -s entrypoints -m llm
- pytest -v -s entrypoints -m openai
- pytest -v -s entrypoints/llm
- pytest -v -s entrypoints/openai

- label: Examples Test
working_dir: "/vllm-workspace/examples"
Expand Down Expand Up @@ -186,6 +191,22 @@ steps:
- pip install aiohttp
- bash run-benchmarks.sh

- label: LM Eval Small Models
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
commands:
- pip install lm-eval
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ./run-tests.sh -c configs/models-small.txt -t 1

- label: LM Eval Large Models
gpu: a100
num_gpus: 4
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
commands:
- pip install lm-eval
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ./run-tests.sh -c configs/models-large.txt -t 4

- label: Documentation Build
working_dir: "/vllm-workspace/test_docs/docs"
no_gpu: True
Expand All @@ -197,11 +218,12 @@ steps:
gpu: a100
num_gpus: 4
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
# NOTE: don't test llama model here, it seems hf implementation is buggy
# see https://github.com/vllm-project/vllm/pull/5689 for details
- pytest -v -s distributed/test_custom_all_reduce.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.0.5/flashinfer-0.0.5+cu121torch2.3-cp310-cp310-linux_x86_64.whl
- VLLM_ATTENTION_BACKEND=FLASHINFER TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- VLLM_ATTENTION_BACKEND=FLASHINFER TEST_DIST_MODEL=meta-llama/Meta-Llama-3-8B DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- pytest -v -s -x lora/test_mixtral.py
Loading
Loading