Skip to content

Releases: huggingface/optimum-nvidia

v0.1.0b8

17 Sep 13:09
761847a
Compare
Choose a tag to compare

Optimum-Nvidia v0.1.0 Beta 8

Highlight

  • Exporting a model is now more robust and better defined overall compared to previous version. All the parameters are now exposed through optimum.nvidia.ExportConfig
  • Bring back quantization and sparsity through integration of Nvidia's ModelOpt
  • Added examples of quantization and sparsification recipes under examples/quantization
  • Integrated optimum-nvidia with the latest optimum-cli interface to support exporting engines without any code through optimum-cli export trtllm.

Known Issues

  • ModelOpt v0.15 as integrated in optimum-nvidia has an issue when trying to quantize with AWQ schema which is gone with v0.17. This dependency will be upgraded in the next release

What's Changed

Full Changelog: v0.1.0b7...v0.1.0b8

v0.1.0b7

24 May 11:24
d19ce46
Compare
Choose a tag to compare

Highlights

  • Mixtral models are now supported (requires a multi-gpu setup)
  • Tensor Parallelism & Pipeline Parallelism are supported on from_pretrained and pipeline through the use of tp=<int>, pp=<int>
  • Models from transformers are now loaded in their respective checkpoint data type rather than float32 avoiding most of memory errors that were happening in 0.1.0b6
  • Intermediate TensorRT-LLM checkpoints and engines are now saved in two different folders (checkpoints/ and engines/) to avoid issues when building multiple checkpoints with the same config.json (TP / PP setup)

What's Changed

New Contributors

Full Changelog: v0.1.0b6...v0.1.0b7

v0.1.0b6 - Whisper, CodeGemma and QoL improvements

11 Apr 21:05
1065a8e
Compare
Choose a tag to compare

Highlights

Models

  • Whisper
  • CodeGemma

Quality Improvements

  • Generated outputs should now be closer to the one issued from transformers

What's Changed

Full Changelog: v0.1.0b4...v0.1.0b6

v0.1.0b4

21 Mar 14:29
5ee2ff0
Compare
Choose a tag to compare

#Highlights

  • Update to TensorRT-LLM version 03-19-2024
  • pip installation
  • Float8 quantization workflow updated on more robust
  • Save and restore prebuild engine from the Hugging Face Hub or locally on the machine

What's Changed

New Contributors

Full Changelog: v0.1.0b3...v0.1.0b4

Optimum-Nvidia 0.1.0b3 Release, welcome Google Gemma!

28 Feb 21:44
Compare
Choose a tag to compare

Highlights

  • This release brings support Google recently released model Gemma
  • optimum-nvidia went through a major refactor which will make it much easier to support new models and integrate the latest one in the long run

TensorRT-LLM

  • Update underlying TensorRT-LLM dependency to b7c309d1c9baa9c030680988cb73e461f6253b98 (v0.9.0)

Known issues

  • The current float8 flow is disabled until next release in order to support the new calibration workflow

What's Changed

New Contributors

Full Changelog: v0.1.0b2...v0.1.0b3

Optimum-Nvidia 0.1.0b2 Release, bug fix release

21 Dec 13:11
938878e
Compare
Choose a tag to compare

This release is meant to focus on improving the previous one with additional test coverage, bug fixes and more usability improvements

TensorRT-LLM

Improvements

  • Generally improve unittest coverage
  • Initial documentation and updated build instructions
  • The prebuilt container now supports Volta and Tesla (experimental) architectures for V100 and T4 GPUs
  • More in-depth usage of TensortRT-LLM Runtime Python C++ binding

Bug Fixes

  • Fixed an issue with pipeline returning only the first output when provided with a batch
  • Fixed an issue with bfloat16 conversion not loading weights in the right formats for the TRT Engine builder
  • Fixed an issue with non Multi Heads Attention setup where the heads were not replicated with the proper factor

Engine Builder changes

  • RMSNorm plugin is now being deprecated by Nvidia for performance reasons so we will not attempt to enable it anymore

Model Support

  • Mistral familly of model should theorically work but currently it is not being extensively tested through our CI/CD. We plan to add official support in the next release

What's Changed

New Contributors

Full Changelog: v0.1.0b1...v0.1.0b2

0.1.0b1 - Initial Release

18 Dec 15:05
4aa44f6
Compare
Choose a tag to compare

This release is the first for optimum-nvidia and focus on bringing the latest performance improvements for Llama based model such as float8 on the latest generation of Nvidia Tensor Cores GPUs