v0.28.0
Key Features
Check out our latest Large Model Inference Containers.
LMI container
- Provided general performance optimization.
- Added text embedding support
- Our solution for text embedding is 5% faster than HF TEI solution.
- Multi-LoRA feature now supports LLama3 and AWS models
TensorRT-LLM container
- Upgraded to TensorRT-LLM 0.9.0
- AWQ, FP8 support for Llama3 models on G6/P5 machines
- Now, default max_new_tokens=16384
- Bugfix for critical memory leaks on long run.
- Bugfix for model hanging issues.
Transformers NeuronX container
- Upgraded to Transformers NeuronX 2.18.2
DeepSpeed container (deprecated)
DeepSpeed container is now deprecated. If you are not using deepspeed
engine, all you need is 0.28.0-lmi
container and continue using it.
New Model Support
- LMI container
- Artic, DBRX, Falcon 2, Command-R, InternLM2, Phi-3, Qwen2MoE, StableLM, StarCoder2, Xverse, and Jais
- TensorRT-LLM container
- Gemma
CX Usability Enhancements/Changes
- Model loading CX:
- SERVING_LOAD_MODELS env is deprecated, use HF_MODEL_ID instead.
- Inference CX:
- Input/Output schema changes:
- Speculative decoding now in streaming, returns multiple jsonlines tokens at each generation step
- Standardized the output formatter signature:
- We reduced the parameters of output_formatter by introducing RequestOutput class.
- RequestOutput contains all input information such as text, token_ids and parameters and also output information such as output tokens, log probabilities, and other details like finish reason. Check this doc to know more.
- Introduced prompt details in the
details
of the response for vLLM and lmi-dist rolling batch options. These prompt details contains the prompt token_ids and their corresponding text and log probability. Check this doc to know more.
- New error handling mechanism:
- Improved our error handling for container responses for rolling batch. Check this doc to know more
- New CX capability:
- We introduce OPTION_TGI_COMPAT env which enables you to get the same response format as TGI.
- We also now support SSE text/event-stream data format.
- Input/Output schema changes:
Breaking Changes
- Inference CX for rolling batch:
- Token id changed from list into integer in rolling batch response.
- Error handling: “finish_reason: error” during rolling batch inference
- DeepSpeed container has been deprecated, functionality is generally available in the LMI container now
Known Issues
- TRTLLM periodically crashes during model compilation when using A100 GPU
- TRTLLM AWQ quantization currently crashes due to an internal error
Enhancements
- [serving] Avoid unecessary copy plugin jars. by @frankfliu in #1793
- [serving] Refactor rolling batch detection logic by @frankfliu in #1781
- [serving] Uses Utils.openUrl() to download files by @frankfliu in #1810
- [docker] Install onnxruntime in docker by @frankfliu in #1790
- [lmi] infer entry point for lmi models in LmiConfigRecommender by @siddvenk in #1779
- [trt][lmi] always set mpi mode for trt container by @siddvenk in #1803
- [awscurl] Support TGI response with coral streaming by @frankfliu in #1769
- [awscurl] Fix output for server sent event content-type by @frankfliu in #1784
- [tnx] installing tiktoken and blobfile for TNX by @lanking520 in #1762
- [tnx] upgrade neuron sdk to 2.18.1 by @tosterberg in #1765
- [tnx] add torchvision for resnet tests by @tosterberg in #1768
- [tnx] adding additional neuron config options by @tosterberg in #1777
- [tnx] add vllm to container by @tosterberg in #1786
- update lora test to tp=max and variable prompt lengths by @rohithkrn in #1766
- [Neo] Add more logging to Neo Neuron partition script by @a-ys in #1802
- Migrate to pydantic v2 by @sindhuvahinis in #1764
- replace peft with open source version by @lanking520 in #1813
- use corretto java by @lanking520 in #1818
- [Java][DLC] try with CA cert update by @lanking520 in #1820
- upgrade trtllm to 0.9.0 by @lanking520 in #1795
- Remove libnccl-dev installation in trtllm container by @nskool in #1822
- [tnx] version bump Neuron SDK and Optimum by @tosterberg in #1826
- [serving] Support set arguments via env var by @frankfliu in #1829
- Creates initial benchmark suite by @zachgk in #1831
- Resolve protected_namespaces warning for pydantic by @sindhuvahinis in #1834
- [chat] Remove unused parameters by @xyang16 in #1835
- Updated s3url gpt4all lora adapter_config.json by @sindhuvahinis in #1836
- Refactor parse_input for handlers by @sindhuvahinis in #1788
- [awscurl] Adds seed and test duration by @frankfliu in #1844
- [djl-bench] Refactor benchmark for arbitary inputs by @frankfliu in #1845
- [djl-bench] Uses Shape.parseShapes() by @frankfliu in #1849
- [Neo] Add JumpStart Integration to SM Neo Neuron AOT compilation flow by @a-ys in #1854
- update container for LMI by @lanking520 in #1814
- [tnx] Default to generation.config generation settings by @tosterberg in #1894
- [tnx] Read-only Neuron cache workaround for SM by @a-ys in #1879
- [tnx] add greedy speculative decoding by @tosterberg in #1902
- [awscurl] Adds extra parameters to dataset by @frankfliu in #1862
- [awscurl] allow write to json file by @lanking520 in #1859
- [DLC] update deps and wheel by @lanking520 in #1860
- Stop the model server when model download fails by @sindhuvahinis in #1842
- [vllm, lmi-dist] Support input log_probs by @sindhuvahinis in #1861
- Use s3 models for llama gptq by @rohithkrn in #1872
- [serving] Avoid using tab in logging by @frankfliu in #1871
- Add placeholder workflow by @ydm-amazon in #1874
- [T5] TRTLLM python repo model check by @sindhuvahinis in #1870
- add TGI compat feature for rollingbatch by @lanking520 in #1866
- Add better token handling under hazardous condition by @lanking520 in #1875
- Update TRT-LLM Container to build with Triton 24.04 by @nskool in #1878
- Remove docker daemon restarts from p4d workflow by @nskool in #1882
- [RollingBatch] allow appending for token by @lanking520 in #1885
- [RollingBatch] remove pending requests by @lanking520 in #1884
- [awscurl] Allows override stream parameter for dataset by @frankfliu in #1895
- allow TGI compat to work with output token ids by @lanking520 in #1900
- Add max num tokens workflow by @ydm-amazon in #1899
- Convert huggingface model to onnx by @xyang16 in #1888
- Improve the artifact saving by @ydm-amazon in #1904
- [TRTLLM] add gemma model support by @lanking520 in #1906
- [serving] Avoid tab in logging by @frankfliu in #1910
- Updates default management port by @zachgk in #1907
- [LMI] not sharing container info by @lanking520 in #1915
- update default for TRTLLM with max_new_tokens by @lanking520 in #1922
- Refactor tgi compat and removed token_id as list by @sindhuvahinis in #1924
- [docker] Updates aarch64 docker to use PyTorch 2.2.2 by @frankfliu in #1926
- Add request output dataclasses by @sindhuvahinis in #1921
- [trtllm] translate tgi stop to stop_sequences by @sindhuvahinis in #1930
- Add chat output formatter unit tests by @sindhuvahinis in #1932
- Move request and output_formatter out of rolling_batch.py by @sindhuvahinis in #1931
- Switches default management port by @zachgk in #1936
- [tnx] version bumps for the container by @tosterberg in #1939
- [awscurl] Adds logs for duration by @frankfliu in #1943
- Refactor output_formatter and include all parameters by @sindhuvahinis in #1941
- add open orca conversion by @lanking520 in #1942
- [dep] Update flash-attn dependency build by @maaquib in #1934
- [onnx] Search for onnx file with folder name by @xyang16 in #1950
- [tnx] add new optimum safetensors model loading and partition support by @tosterberg in #1953
- download model workflow by @lanking520 in #1954
- Add multi-token support by @ymwangg in #1889
- [cache] Update DDBLocal to 2.4.0 by @frankfliu in #1955
- [tnx] add on device generation params support by @tosterberg in #1782
- add profiling utils to check object leakage by @lanking520 in #1947
- Update shm size for trtllm test by @nskool in #1956
- [tnx] add vllm rolling batch implementation by @tosterberg in #1951
- [dep] Updgrades dependencies for lmi-dist engine by @maaquib in #1949
- [vLLM/LMI-Dist]change the default logprobs to 1 by @lanking520 in #1965
- Add vllm and lmi-dist lora awq tests by @rohithkrn in #1937
- Upgrade onnxruntime version to 1.18.0 by @xyang16 in #1967
- [python] Avoid nested logging exception by @frankfliu in #1968
- [tnx] refactor neuron testing by @tosterberg in #1971
- add a failed log check for jsonlines by @lanking520 in #1970
- [0.4.2] update the list of supported model by @lanking520 in #1901
- [vllm 0.4.2] prompt log probs changes by @sindhuvahinis in #1946
- Creates a python Session Manager by @zachgk in #1945
- [dep] Bump up transformers version to support falcon2 by @maaquib in #1972
- [dep] Update optimum version to resolve conflict with transformers by @maaquib in #1973
- [docker] Install onnxruntime custom built wheel by @xyang16 in #1974
- skip for code llama by @lanking520 in #1980
- [dep] Update optimum to latest version by @maaquib in #1981
- Add tensorrt_llm libs to LD_LIBRARY_PATH by @nskool in #1988
- [vLLM] support future arctic model FP8 quantization by @lanking520 in #1993
- [docker] Remove onnxruntime cpu jar in lmi docker by @xyang16 in #2016
Bug Fixes
- [lmi] always configure lmi model instead of just when engine is missing by @siddvenk in #1727
- [tnx] update default neuron rolling batch for correct mpi mode config by @tosterberg in #1789
- [serving] Fixes HF_MODEL_ID conflict with djl:// URL issue by @frankfliu in #1772
- [python] Exclude kv_cache*.pt from release by @frankfliu in #1771
- [wlm] Fixed engine detection logic by @frankfliu in #1828
- [tnx] fix CI for running on 2.18.1 by @tosterberg in #1830
- [IB] fixing bugs and update awscurl by @lanking520 in #1843
- [tnx] fix chat completions by @tosterberg in #1848
- fix the comments in awscurl by @lanking520 in #1852
- [fix] mpi mode check in trtllm by @sindhuvahinis in #1929
- [fix] Fix for neuron rolling batch CI error by @tosterberg in #1938
- [fix] Fix arguments null pointer by @xyang16 in #1985
- [fix] Seq2Seq model fix for lmi-dist by @maaquib in #1991
- [fix] Update vllm config path for sm by @maaquib in #2006
- fix typo in finish_reason default. by @sindhuvahinis in #1994
- [serving] Fixes logging invoke convention by @frankfliu in #1864
- add some fixes and fine print the error message by @lanking520 in #1979
- [serving] Fixes java21 compile error by @frankfliu in #1863
- fix t5 tokenizer and prompt token failures by @rohithkrn in #1966
- Fix noncritical json loading bug by @ydm-amazon in #1908
- [BugFix] Fix bug when invalid adapters are passed by @rohithkrn in #1920
- Fix Download workflow by @lanking520 in #1959
- fix default values in request io by @sindhuvahinis in #1975
- [UX] Fix a few code to get closer to TGI UX by @lanking520 in #1925
- fix bug on p4d machine closure by @lanking520 in #1940
- [tnx] fix optimum 0.0.22 integration by @tosterberg in #1952
Documentation
- [doc] Add chat completions input output schema doc by @xyang16 in #1760
- [doc][fix] fix formatting of list for mkdocs, add link to conceptual … by @siddvenk in #1763
- [doc] Fix chat completions input output schema doc by @xyang16 in #1778
- [0.27.0-DLC] update user guide for TRTLLM by @lanking520 in #1780
- Updates DJL version to 0.28.0 by @xyang16 in #1775
- removing paddlepaddle and tflite reference by @lanking520 in #1805
- [docs][lmi] update backend selection guide to recommend lmi-dist over… by @siddvenk in #1819
- [docs] Updates model server configuration document by @frankfliu in #1855
- [Doc] Update LoRA docs by @zachgk in #1898
- [docs] fix wrong hf token env var in doc by @siddvenk in #1957
- [docs][lmi] update input/output schema docs with error response details by @siddvenk in #1962
CI/CD
- [CI] add trtllm workflow pipeline by @lanking520 in #1794
- [CI][TRTLLM] setting up permission to retrieve from repo by @lanking520 in #1797
- [ci][vllm] reduce max seq len for gemma vllm by @siddvenk in #1774
- [CI][DeepSpeed Deprecation] Batch 1: deepspeed and Umerged LoRA test removal by @lanking520 in #1796
- [DeepSpeed] batch 2 removal of deepspeed by @lanking520 in #1806
- [CI] Creates nightly benchmark by @zachgk in #1787
- [CI] fix CI in LMI container build by @lanking520 in #1808
- fix docker build by @lanking520 in #1811
- [CI] Disable rolling batch for hf acc tests by @sindhuvahinis in #1809
- Refactor properties error test cases to parameterized by @sindhuvahinis in #1770
- [wlm] Avoid using gated model for testing by @frankfliu in #1791
- set user-agent for hf requests by @siddvenk in #1800
- add fixes for huggingface accelerate CI by @lanking520 in #1804
- [serving] Fixes nightly benchmark for onnx GPU by @frankfliu in #1812
- [fix] inf2 container fix torchvision install by @tosterberg in #1815
- [tnx][ci] add resnet test w/o requirements.txt by @tosterberg in #1817
- [CI][IB] fix tests by @lanking520 in #1823
- [CI][IB] fail instance benchmark if a single command failed by @lanking520 in #1824
- fix test profiles by @lanking520 in #1832
- Creates container test runner by @zachgk in #1671
- Add back LCNC test for gptneox in trtllm by @nskool in #1838
- [tnx] add mixtral test to neuron ci by @tosterberg in #1837
- [CI] Fix test LLM client by @zachgk in #1841
- remove open llama lmi test by @ydm-amazon in #1840
- [CI] update the falcon model for TRTLLM by @lanking520 in #1853
- [CI] Fixes too many TOKENIZER in nightly bench llama by @zachgk in #1858
- [CI] restart docker service if general container kill doesn't work by @lanking520 in #1867
- [ci] Upgrades gradle to 8.5 by @frankfliu in #1868
- [CI] Update p4d CI workflow by @nskool in #1877
- [ci] Updates dependencies version by @frankfliu in #1881
- [CI] remove redundant pipeline and adding retry by @lanking520 in #1887
- [CI] update benchmark tokenizer by @lanking520 in #1896
- [CI] fix tokenizers by @lanking520 in #1903
- [IB] support benchmark matrix by @lanking520 in #1913
- [IB] fix with master branch if come from external repo by @lanking520 in #1917
- [prometheus] Avoid missing class warning by @frankfliu in #1916
- [IB] checkin other branch for benchmark by @lanking520 in #1918
- [Benchmark] dataset preparation by @lanking520 in #1891
- [IB] increase benchmark time out time by @lanking520 in #1919
- [IB] fix a bug in IB by @lanking520 in #1923
- [CI] improve lcnc by @lanking520 in #1928
- [ci] Add text embedding integration test by @xyang16 in #1909
- [ci] Updates gradle to 8.7 by @frankfliu in #1960
- [ci] Remove remote code and use latest transformers for falcon by @maaquib in #1987
- [CI] update model list by @lanking520 in #1961
- [CI] track for code not 200 errored response by @lanking520 in #1963
- [fix] inf2 pipeline fix for non-compiled models by @tosterberg in #1912
- [fix] inf2 pipeline fix for adapter unsupported by @tosterberg in #1927
- [fix] Revert extra parameters and added unit tests cases by @sindhuvahinis in #1948
- emergency fix by @lanking520 in #1876
- Fix P4D docker issues by @lanking520 in #1869
- fix sagemaker test suite by @siddvenk in #1944
New Contributors
Full Changelog: v0.27.0...v0.28.0