Release v0.28.0 · deepjavalibrary/djl-serving

Key Features

Check out our latest Large Model Inference Containers.

LMI container

Provided general performance optimization.
Added text embedding support
- Our solution for text embedding is 5% faster than HF TEI solution.
Multi-LoRA feature now supports LLama3 and AWS models

TensorRT-LLM container

Upgraded to TensorRT-LLM 0.9.0
AWQ, FP8 support for Llama3 models on G6/P5 machines
Now, default max_new_tokens=16384
Bugfix for critical memory leaks on long run.
Bugfix for model hanging issues.

Transformers NeuronX container

Upgraded to Transformers NeuronX 2.18.2

DeepSpeed container (deprecated)

DeepSpeed container is now deprecated. If you are not using deepspeed engine, all you need is 0.28.0-lmi container and continue using it.

New Model Support

LMI container
- Artic, DBRX, Falcon 2, Command-R, InternLM2, Phi-3, Qwen2MoE, StableLM, StarCoder2, Xverse, and Jais
TensorRT-LLM container
- Gemma

CX Usability Enhancements/Changes

Model loading CX:
- SERVING_LOAD_MODELS env is deprecated, use HF_MODEL_ID instead.
Inference CX:
- Input/Output schema changes:
  - Speculative decoding now in streaming, returns multiple jsonlines tokens at each generation step
  - Standardized the output formatter signature:
    - We reduced the parameters of output_formatter by introducing RequestOutput class.
    - RequestOutput contains all input information such as text, token_ids and parameters and also output information such as output tokens, log probabilities, and other details like finish reason. Check this doc to know more.
    - Introduced prompt details in the details of the response for vLLM and lmi-dist rolling batch options. These prompt details contains the prompt token_ids and their corresponding text and log probability. Check this doc to know more.
- New error handling mechanism:
  - Improved our error handling for container responses for rolling batch. Check this doc to know more
- New CX capability:
  - We introduce OPTION_TGI_COMPAT env which enables you to get the same response format as TGI.
  - We also now support SSE text/event-stream data format.

Breaking Changes

Inference CX for rolling batch:
- Token id changed from list into integer in rolling batch response.
- Error handling: “finish_reason: error” during rolling batch inference
DeepSpeed container has been deprecated, functionality is generally available in the LMI container now

Known Issues

TRTLLM periodically crashes during model compilation when using A100 GPU
TRTLLM AWQ quantization currently crashes due to an internal error

Enhancements

[serving] Avoid unecessary copy plugin jars. by @frankfliu in #1793
[serving] Refactor rolling batch detection logic by @frankfliu in #1781
[serving] Uses Utils.openUrl() to download files by @frankfliu in #1810
[docker] Install onnxruntime in docker by @frankfliu in #1790
[lmi] infer entry point for lmi models in LmiConfigRecommender by @siddvenk in #1779
[trt][lmi] always set mpi mode for trt container by @siddvenk in #1803
[awscurl] Support TGI response with coral streaming by @frankfliu in #1769
[awscurl] Fix output for server sent event content-type by @frankfliu in #1784
[tnx] installing tiktoken and blobfile for TNX by @lanking520 in #1762
[tnx] upgrade neuron sdk to 2.18.1 by @tosterberg in #1765
[tnx] add torchvision for resnet tests by @tosterberg in #1768
[tnx] adding additional neuron config options by @tosterberg in #1777
[tnx] add vllm to container by @tosterberg in #1786
update lora test to tp=max and variable prompt lengths by @rohithkrn in #1766
[Neo] Add more logging to Neo Neuron partition script by @a-ys in #1802
Migrate to pydantic v2 by @sindhuvahinis in #1764
replace peft with open source version by @lanking520 in #1813
use corretto java by @lanking520 in #1818
[Java][DLC] try with CA cert update by @lanking520 in #1820
upgrade trtllm to 0.9.0 by @lanking520 in #1795
Remove libnccl-dev installation in trtllm container by @nskool in #1822
[tnx] version bump Neuron SDK and Optimum by @tosterberg in #1826
[serving] Support set arguments via env var by @frankfliu in #1829
Creates initial benchmark suite by @zachgk in #1831
Resolve protected_namespaces warning for pydantic by @sindhuvahinis in #1834
[chat] Remove unused parameters by @xyang16 in #1835
Updated s3url gpt4all lora adapter_config.json by @sindhuvahinis in #1836
Refactor parse_input for handlers by @sindhuvahinis in #1788
[awscurl] Adds seed and test duration by @frankfliu in #1844
[djl-bench] Refactor benchmark for arbitary inputs by @frankfliu in #1845
[djl-bench] Uses Shape.parseShapes() by @frankfliu in #1849
[Neo] Add JumpStart Integration to SM Neo Neuron AOT compilation flow by @a-ys in #1854
update container for LMI by @lanking520 in #1814
[tnx] Default to generation.config generation settings by @tosterberg in #1894
[tnx] Read-only Neuron cache workaround for SM by @a-ys in #1879
[tnx] add greedy speculative decoding by @tosterberg in #1902
[awscurl] Adds extra parameters to dataset by @frankfliu in #1862
[awscurl] allow write to json file by @lanking520 in #1859
[DLC] update deps and wheel by @lanking520 in #1860
Stop the model server when model download fails by @sindhuvahinis in #1842
[vllm, lmi-dist] Support input log_probs by @sindhuvahinis in #1861
Use s3 models for llama gptq by @rohithkrn in #1872
[serving] Avoid using tab in logging by @frankfliu in #1871
Add placeholder workflow by @ydm-amazon in #1874
[T5] TRTLLM python repo model check by @sindhuvahinis in #1870
add TGI compat feature for rollingbatch by @lanking520 in #1866
Add better token handling under hazardous condition by @lanking520 in #1875
Update TRT-LLM Container to build with Triton 24.04 by @nskool in #1878
Remove docker daemon restarts from p4d workflow by @nskool in #1882
[RollingBatch] allow appending for token by @lanking520 in #1885
[RollingBatch] remove pending requests by @lanking520 in #1884
[awscurl] Allows override stream parameter for dataset by @frankfliu in #1895
allow TGI compat to work with output token ids by @lanking520 in #1900
Add max num tokens workflow by @ydm-amazon in #1899
Convert huggingface model to onnx by @xyang16 in #1888
Improve the artifact saving by @ydm-amazon in #1904
[TRTLLM] add gemma model support by @lanking520 in #1906
[serving] Avoid tab in logging by @frankfliu in #1910
Updates default management port by @zachgk in #1907
[LMI] not sharing container info by @lanking520 in #1915
update default for TRTLLM with max_new_tokens by @lanking520 in #1922
Refactor tgi compat and removed token_id as list by @sindhuvahinis in #1924
[docker] Updates aarch64 docker to use PyTorch 2.2.2 by @frankfliu in #1926
Add request output dataclasses by @sindhuvahinis in #1921
[trtllm] translate tgi stop to stop_sequences by @sindhuvahinis in #1930
Add chat output formatter unit tests by @sindhuvahinis in #1932
Move request and output_formatter out of rolling_batch.py by @sindhuvahinis in #1931
Switches default management port by @zachgk in #1936
[tnx] version bumps for the container by @tosterberg in #1939
[awscurl] Adds logs for duration by @frankfliu in #1943
Refactor output_formatter and include all parameters by @sindhuvahinis in #1941
add open orca conversion by @lanking520 in #1942
[dep] Update flash-attn dependency build by @maaquib in #1934
[onnx] Search for onnx file with folder name by @xyang16 in #1950
[tnx] add new optimum safetensors model loading and partition support by @tosterberg in #1953
download model workflow by @lanking520 in #1954
Add multi-token support by @ymwangg in #1889
[cache] Update DDBLocal to 2.4.0 by @frankfliu in #1955
[tnx] add on device generation params support by @tosterberg in #1782
add profiling utils to check object leakage by @lanking520 in #1947
Update shm size for trtllm test by @nskool in #1956
[tnx] add vllm rolling batch implementation by @tosterberg in #1951
[dep] Updgrades dependencies for lmi-dist engine by @maaquib in #1949
[vLLM/LMI-Dist]change the default logprobs to 1 by @lanking520 in #1965
Add vllm and lmi-dist lora awq tests by @rohithkrn in #1937
Upgrade onnxruntime version to 1.18.0 by @xyang16 in #1967
[python] Avoid nested logging exception by @frankfliu in #1968
[tnx] refactor neuron testing by @tosterberg in #1971
add a failed log check for jsonlines by @lanking520 in #1970
[0.4.2] update the list of supported model by @lanking520 in #1901
[vllm 0.4.2] prompt log probs changes by @sindhuvahinis in #1946
Creates a python Session Manager by @zachgk in #1945
[dep] Bump up transformers version to support falcon2 by @maaquib in #1972
[dep] Update optimum version to resolve conflict with transformers by @maaquib in #1973
[docker] Install onnxruntime custom built wheel by @xyang16 in #1974
skip for code llama by @lanking520 in #1980
[dep] Update optimum to latest version by @maaquib in #1981
Add tensorrt_llm libs to LD_LIBRARY_PATH by @nskool in #1988
[vLLM] support future arctic model FP8 quantization by @lanking520 in #1993
[docker] Remove onnxruntime cpu jar in lmi docker by @xyang16 in #2016

Bug Fixes

[lmi] always configure lmi model instead of just when engine is missing by @siddvenk in #1727
[tnx] update default neuron rolling batch for correct mpi mode config by @tosterberg in #1789
[serving] Fixes HF_MODEL_ID conflict with djl:// URL issue by @frankfliu in #1772
[python] Exclude kv_cache*.pt from release by @frankfliu in #1771
[wlm] Fixed engine detection logic by @frankfliu in #1828
[tnx] fix CI for running on 2.18.1 by @tosterberg in #1830
[IB] fixing bugs and update awscurl by @lanking520 in #1843
[tnx] fix chat completions by @tosterberg in #1848
fix the comments in awscurl by @lanking520 in #1852
[fix] mpi mode check in trtllm by @sindhuvahinis in #1929
[fix] Fix for neuron rolling batch CI error by @tosterberg in #1938
[fix] Fix arguments null pointer by @xyang16 in #1985
[fix] Seq2Seq model fix for lmi-dist by @maaquib in #1991
[fix] Update vllm config path for sm by @maaquib in #2006
fix typo in finish_reason default. by @sindhuvahinis in #1994
[serving] Fixes logging invoke convention by @frankfliu in #1864
add some fixes and fine print the error message by @lanking520 in #1979
[serving] Fixes java21 compile error by @frankfliu in #1863
fix t5 tokenizer and prompt token failures by @rohithkrn in #1966
Fix noncritical json loading bug by @ydm-amazon in #1908
[BugFix] Fix bug when invalid adapters are passed by @rohithkrn in #1920
Fix Download workflow by @lanking520 in #1959
fix default values in request io by @sindhuvahinis in #1975
[UX] Fix a few code to get closer to TGI UX by @lanking520 in #1925
fix bug on p4d machine closure by @lanking520 in #1940
[tnx] fix optimum 0.0.22 integration by @tosterberg in #1952

Documentation

[doc] Add chat completions input output schema doc by @xyang16 in #1760
[doc][fix] fix formatting of list for mkdocs, add link to conceptual … by @siddvenk in #1763
[doc] Fix chat completions input output schema doc by @xyang16 in #1778
[0.27.0-DLC] update user guide for TRTLLM by @lanking520 in #1780
Updates DJL version to 0.28.0 by @xyang16 in #1775
removing paddlepaddle and tflite reference by @lanking520 in #1805
[docs][lmi] update backend selection guide to recommend lmi-dist over… by @siddvenk in #1819
[docs] Updates model server configuration document by @frankfliu in #1855
[Doc] Update LoRA docs by @zachgk in #1898
[docs] fix wrong hf token env var in doc by @siddvenk in #1957
[docs][lmi] update input/output schema docs with error response details by @siddvenk in #1962

CI/CD

[CI] add trtllm workflow pipeline by @lanking520 in #1794
[CI][TRTLLM] setting up permission to retrieve from repo by @lanking520 in #1797
[ci][vllm] reduce max seq len for gemma vllm by @siddvenk in #1774
[CI][DeepSpeed Deprecation] Batch 1: deepspeed and Umerged LoRA test removal by @lanking520 in #1796
[DeepSpeed] batch 2 removal of deepspeed by @lanking520 in #1806
[CI] Creates nightly benchmark by @zachgk in #1787
[CI] fix CI in LMI container build by @lanking520 in #1808
fix docker build by @lanking520 in #1811
[CI] Disable rolling batch for hf acc tests by @sindhuvahinis in #1809
Refactor properties error test cases to parameterized by @sindhuvahinis in #1770
[wlm] Avoid using gated model for testing by @frankfliu in #1791
set user-agent for hf requests by @siddvenk in #1800
add fixes for huggingface accelerate CI by @lanking520 in #1804
[serving] Fixes nightly benchmark for onnx GPU by @frankfliu in #1812
[fix] inf2 container fix torchvision install by @tosterberg in #1815
[tnx][ci] add resnet test w/o requirements.txt by @tosterberg in #1817
[CI][IB] fix tests by @lanking520 in #1823
[CI][IB] fail instance benchmark if a single command failed by @lanking520 in #1824
fix test profiles by @lanking520 in #1832
Creates container test runner by @zachgk in #1671
Add back LCNC test for gptneox in trtllm by @nskool in #1838
[tnx] add mixtral test to neuron ci by @tosterberg in #1837
[CI] Fix test LLM client by @zachgk in #1841
remove open llama lmi test by @ydm-amazon in #1840
[CI] update the falcon model for TRTLLM by @lanking520 in #1853
[CI] Fixes too many TOKENIZER in nightly bench llama by @zachgk in #1858
[CI] restart docker service if general container kill doesn't work by @lanking520 in #1867
[ci] Upgrades gradle to 8.5 by @frankfliu in #1868
[CI] Update p4d CI workflow by @nskool in #1877
[ci] Updates dependencies version by @frankfliu in #1881
[CI] remove redundant pipeline and adding retry by @lanking520 in #1887
[CI] update benchmark tokenizer by @lanking520 in #1896
[CI] fix tokenizers by @lanking520 in #1903
[IB] support benchmark matrix by @lanking520 in #1913
[IB] fix with master branch if come from external repo by @lanking520 in #1917
[prometheus] Avoid missing class warning by @frankfliu in #1916
[IB] checkin other branch for benchmark by @lanking520 in #1918
[Benchmark] dataset preparation by @lanking520 in #1891
[IB] increase benchmark time out time by @lanking520 in #1919
[IB] fix a bug in IB by @lanking520 in #1923
[CI] improve lcnc by @lanking520 in #1928
[ci] Add text embedding integration test by @xyang16 in #1909
[ci] Updates gradle to 8.7 by @frankfliu in #1960
[ci] Remove remote code and use latest transformers for falcon by @maaquib in #1987
[CI] update model list by @lanking520 in #1961
[CI] track for code not 200 errored response by @lanking520 in #1963
[fix] inf2 pipeline fix for non-compiled models by @tosterberg in #1912
[fix] inf2 pipeline fix for adapter unsupported by @tosterberg in #1927
[fix] Revert extra parameters and added unit tests cases by @sindhuvahinis in #1948
emergency fix by @lanking520 in #1876
Fix P4D docker issues by @lanking520 in #1869
fix sagemaker test suite by @siddvenk in #1944

New Contributors

@ymwangg made their first contribution in #1889

Full Changelog: v0.27.0...v0.28.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.28.0

Key Features

LMI container

TensorRT-LLM container

Transformers NeuronX container

DeepSpeed container (deprecated)

New Model Support

CX Usability Enhancements/Changes

Breaking Changes

Known Issues

Enhancements

Bug Fixes

Documentation

CI/CD

New Contributors

Contributors