Skip to content

v0.28.0

Compare
Choose a tag to compare
@tosterberg tosterberg released this 19 Jun 21:16
· 22 commits to 0.28.0-dlc since this release
73a782d

Key Features

Check out our latest Large Model Inference Containers.

LMI container

  • Provided general performance optimization.
  • Added text embedding support
    • Our solution for text embedding is 5% faster than HF TEI solution.
  • Multi-LoRA feature now supports LLama3 and AWS models

TensorRT-LLM container

  • Upgraded to TensorRT-LLM 0.9.0
  • AWQ, FP8 support for Llama3 models on G6/P5 machines
  • Now, default max_new_tokens=16384
  • Bugfix for critical memory leaks on long run.
  • Bugfix for model hanging issues.

Transformers NeuronX container

  • Upgraded to Transformers NeuronX 2.18.2

DeepSpeed container (deprecated)

DeepSpeed container is now deprecated. If you are not using deepspeed engine, all you need is 0.28.0-lmi container and continue using it.

New Model Support

  • LMI container
    • Artic, DBRX, Falcon 2, Command-R, InternLM2, Phi-3, Qwen2MoE, StableLM, StarCoder2, Xverse, and Jais
  • TensorRT-LLM container
    • Gemma

CX Usability Enhancements/Changes

  • Model loading CX:
    • SERVING_LOAD_MODELS env is deprecated, use HF_MODEL_ID instead.
  • Inference CX:
    • Input/Output schema changes:
      • Speculative decoding now in streaming, returns multiple jsonlines tokens at each generation step
      • Standardized the output formatter signature:
        • We reduced the parameters of output_formatter by introducing RequestOutput class.
        • RequestOutput contains all input information such as text, token_ids and parameters and also output information such as output tokens, log probabilities, and other details like finish reason. Check this doc to know more.
        • Introduced prompt details in the details of the response for vLLM and lmi-dist rolling batch options. These prompt details contains the prompt token_ids and their corresponding text and log probability. Check this doc to know more.
    • New error handling mechanism:
      • Improved our error handling for container responses for rolling batch. Check this doc to know more
    • New CX capability:
      • We introduce OPTION_TGI_COMPAT env which enables you to get the same response format as TGI.
      • We also now support SSE text/event-stream data format.

Breaking Changes

  • Inference CX for rolling batch:
    • Token id changed from list into integer in rolling batch response.
    • Error handling: “finish_reason: error” during rolling batch inference
  • DeepSpeed container has been deprecated, functionality is generally available in the LMI container now

Known Issues

  • TRTLLM periodically crashes during model compilation when using A100 GPU
  • TRTLLM AWQ quantization currently crashes due to an internal error

Enhancements

Bug Fixes

Documentation

  • [doc] Add chat completions input output schema doc by @xyang16 in #1760
  • [doc][fix] fix formatting of list for mkdocs, add link to conceptual … by @siddvenk in #1763
  • [doc] Fix chat completions input output schema doc by @xyang16 in #1778
  • [0.27.0-DLC] update user guide for TRTLLM by @lanking520 in #1780
  • Updates DJL version to 0.28.0 by @xyang16 in #1775
  • removing paddlepaddle and tflite reference by @lanking520 in #1805
  • [docs][lmi] update backend selection guide to recommend lmi-dist over… by @siddvenk in #1819
  • [docs] Updates model server configuration document by @frankfliu in #1855
  • [Doc] Update LoRA docs by @zachgk in #1898
  • [docs] fix wrong hf token env var in doc by @siddvenk in #1957
  • [docs][lmi] update input/output schema docs with error response details by @siddvenk in #1962

CI/CD

New Contributors

Full Changelog: v0.27.0...v0.28.0