GitHub - marcella-found/neuralmagic-vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

Neural Magic vLLM

Fork of vLLM with sparsity.

To Run

Clone and install magic_wand:

git clone https://github.com/neuralmagic/magic_wand.git
cd magic_wand
export TORCH_CUDA_ARCH_LIST=8.6
pip install -e .

Install:

cd ../
pip install -e .

Run Sample

Run a 50% sparse model:

from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/Llama-2-7b-pruned50-retrained", 
    sparsity="sparse_w16a16",   # If left off, model will be loaded as dense
    enforce_eager=True,         # Does not work with cudagraphs yet
    dtype="float16",
    tensor_parallel_size=1,
    max_model_len=1024
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text

Name		Name	Last commit message	Last commit date
Latest commit History 687 Commits
.buildkite		.buildkite
.github/workflows		.github/workflows
benchmarks		benchmarks
csrc		csrc
docs		docs
examples		examples
rocm_patch		rocm_patch
tests		tests
vllm		vllm
.dockerignore		.dockerignore
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.rocm		Dockerfile.rocm
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
format.sh		format.sh
mypy.ini		mypy.ini
patch_xformers.rocm.sh		patch_xformers.rocm.sh
pyproject.toml		pyproject.toml
requirements-build.txt		requirements-build.txt
requirements-dev.txt		requirements-dev.txt
requirements-neuron.txt		requirements-neuron.txt
requirements-rocm.txt		requirements-rocm.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Magic vLLM

To Run

Run Sample

About

Releases

Packages

Languages

License

marcella-found/neuralmagic-vllm

Folders and files

Latest commit

History

Repository files navigation

Neural Magic vLLM

To Run

Run Sample

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages