17 Sep 13:09

mfuntowicz

761847a

v0.1.0b8 Latest

Latest

Optimum-Nvidia v0.1.0 Beta 8

Highlight

Exporting a model is now more robust and better defined overall compared to previous version. All the parameters are now exposed through optimum.nvidia.ExportConfig
Bring back quantization and sparsity through integration of Nvidia's ModelOpt
Added examples of quantization and sparsification recipes under examples/quantization
Integrated optimum-nvidia with the latest optimum-cli interface to support exporting engines without any code through optimum-cli export trtllm.

Known Issues

ModelOpt v0.15 as integrated in optimum-nvidia has an issue when trying to quantize with AWQ schema which is gone with v0.17. This dependency will be upgraded in the next release

What's Changed

feat(package): make sure we dont have init as optimum level by @mfuntowicz in #132
Enable trufflehog scanner CI on GA by @mfuntowicz in #136
Enable automatic build of container at each release by @mfuntowicz in #137
Refactor the overall Hugging Face -> TRTLLM export workflow by @mfuntowicz in #133
feat(tests) : Update CI to use new workflow and silicon. by @mfuntowicz in #145
move to new cluster by @glegendre01 in #150
Bring back quantization with Nvidia ModelOpt by @mfuntowicz in #147
(misc) disable xQA kernels for now as they seem to hang by @mfuntowicz in #152
Add CLI quantization option by @mfuntowicz in #153
tests(cli): uncomment out tests for CLI by @mfuntowicz in #154
Fix license detection path by @mfuntowicz in #155
Fix test again by @mfuntowicz in #156
chore: remove invalid examples by @mfuntowicz in #157
Bump version to 0.1.0b8 by @mfuntowicz in #158
chore: update README badges by @mfuntowicz in #159

Full Changelog: v0.1.0b7...v0.1.0b8

Contributors

mfuntowicz and glegendre01

Assets 2

24 May 11:24

mfuntowicz

v0.1.0b7

d19ce46

v0.1.0b7

Highlights

Mixtral models are now supported (requires a multi-gpu setup)
Tensor Parallelism & Pipeline Parallelism are supported on from_pretrained and pipeline through the use of tp=<int>, pp=<int>
Models from transformers are now loaded in their respective checkpoint data type rather than float32 avoiding most of memory errors that were happening in 0.1.0b6
Intermediate TensorRT-LLM checkpoints and engines are now saved in two different folders (checkpoints/ and engines/) to avoid issues when building multiple checkpoints with the same config.json (TP / PP setup)

What's Changed

Fix checking output limits for #114 by @zaycev in #115
Test batched causallm inference by @fxmarty in #117
Remove claim of Turing support by @laikhtewari in #118
Mention important additional parameters for engine config in README by @zaycev in #113
Update to TensorRT-LLM v0.9.0 by @mfuntowicz in #124
Use a percentage based matching rather than exact token match for tests by @mfuntowicz in #125
Mixtral by @mfuntowicz in #131

New Contributors

@zaycev made their first contribution in #115

Full Changelog: v0.1.0b6...v0.1.0b7

Contributors

zaycev, mfuntowicz, and 2 other contributors

Assets 2

11 Apr 21:05

mfuntowicz

v0.1.0b6

1065a8e

v0.1.0b6 - Whisper, CodeGemma and QoL improvements

Highlights

Models

Whisper
CodeGemma

Quality Improvements

Generated outputs should now be closer to the one issued from transformers

What's Changed

Add back the ability to build Whisper from Transformers checkpoints by @fxmarty in #101
Fix invalid dependencies by @mfuntowicz in #104
Whisper inference by @fxmarty in #107
Fix quality on the main branch by @mfuntowicz in #108
Use pinned version for huggingface-hub by @mfuntowicz in #109
Avoid reloading available transformers config by @fxmarty in #111
Test CausalLM generate & pipeline by @fxmarty in #110

Full Changelog: v0.1.0b4...v0.1.0b6

Contributors

mfuntowicz and fxmarty

Assets 2

21 Mar 14:29

mfuntowicz

v0.1.0b4

5ee2ff0

v0.1.0b4

#Highlights

Update to TensorRT-LLM version 03-19-2024
pip installation
Float8 quantization workflow updated on more robust
Save and restore prebuild engine from the Hugging Face Hub or locally on the machine

What's Changed

Add ability to save local prebuilt engines by @mfuntowicz in #87
Make float8 quantization back in the game. by @mfuntowicz in #92
Fixed Repetition Penalty default value by @leopra in #66
Update instructions for pip install by @mfuntowicz in #97
Update to TensorRT-LLM v031224 by @mfuntowicz in #98

New Contributors

@leopra made their first contribution in #66

Full Changelog: v0.1.0b3...v0.1.0b4

Contributors

mfuntowicz and leopra

Assets 2

28 Feb 21:44

mfuntowicz

v0.1.0b3

ad0e17b

Optimum-Nvidia 0.1.0b3 Release, welcome Google Gemma!

Highlights

This release brings support Google recently released model Gemma
optimum-nvidia went through a major refactor which will make it much easier to support new models and integrate the latest one in the long run

TensorRT-LLM

Update underlying TensorRT-LLM dependency to b7c309d1c9baa9c030680988cb73e461f6253b98 (v0.9.0)

Known issues

The current float8 flow is disabled until next release in order to support the new calibration workflow

What's Changed

Bug fixes in readme. by @Anindyadeep in #63
Bump TRTLLM to latest version #d879430 by @mfuntowicz in #65
Ability to build Whisper encoder/decoder TRT engine by @fxmarty in #70
Refactoring of the overall structure to better align with the new TRTLLM workflow moving forward by @mfuntowicz in #74
Fix gemma 7b by @mfuntowicz in #77
Update license by @mfuntowicz in #78
Make pipelines compatible with the new workflow by @mfuntowicz in #79
Fix repo code quality by @mfuntowicz in #80
Bring back CI to a normal state by @mfuntowicz in #82
Fix hardcoded embedding scale with value from config by @mfuntowicz in #85
Make overall optimum-nvidia pip installable by @mfuntowicz in #83

New Contributors

@Anindyadeep made their first contribution in #63
@fxmarty made their first contribution in #70

Full Changelog: v0.1.0b2...v0.1.0b3

Contributors

mfuntowicz, fxmarty, and Anindyadeep

Assets 2

21 Dec 13:11

mfuntowicz

v0.1.0b2

938878e

Optimum-Nvidia 0.1.0b2 Release, bug fix release

This release is meant to focus on improving the previous one with additional test coverage, bug fixes and more usability improvements

TensorRT-LLM

Updated TensorRT-LLM to version f7eca56161d496cbd28e8e7689dbd90003594bd2

Improvements

Generally improve unittest coverage
Initial documentation and updated build instructions
The prebuilt container now supports Volta and Tesla (experimental) architectures for V100 and T4 GPUs
More in-depth usage of TensortRT-LLM Runtime Python C++ binding

Bug Fixes

Fixed an issue with pipeline returning only the first output when provided with a batch
Fixed an issue with bfloat16 conversion not loading weights in the right formats for the TRT Engine builder
Fixed an issue with non Multi Heads Attention setup where the heads were not replicated with the proper factor

Engine Builder changes

RMSNorm plugin is now being deprecated by Nvidia for performance reasons so we will not attempt to enable it anymore

Model Support

Mistral familly of model should theorically work but currently it is not being extensively tested through our CI/CD. We plan to add official support in the next release

What's Changed

bump trt llm version to 0.6.1 by @laikhtewari in #27
Fix issue returning only the first batch item after pipeline call. by @mfuntowicz in #29
Update README.md by @eltociear in #31
Missing comma in setup.py by @IlyasMoutawwakil in #19
Quality by @mfuntowicz in #30
Fix typo by @mfuntowicz in #40
Update to latest trtllm f7eca56161d496cbd28e8e7689dbd90003594bd2 by @mfuntowicz in #41
Enable more SM architectures in the prebuild docker by @mfuntowicz in #35
Add initial set of documentation to build the optimum-nvidia container by @mfuntowicz in #39
Fix caching for docker by @mfuntowicz in #15
Initial set of unittest in CI by @mfuntowicz in #43
Build from source instructions by @laikhtewari in #38
Enable testing on GPUs by @mfuntowicz in #45
Enable HF Transfer in tests by @mfuntowicz in #51
Let's make sure to use the repeated heads tensor when in a non-mha scenario by @mfuntowicz in #48
Bump version to 0.1.0b2 by @mfuntowicz in #53
Add more unittest by @mfuntowicz in #52
Disable RMSNorm plugin as deprecated for performance reasons by @mfuntowicz in #55
Rename LLamaForCausalLM to LlamaForCausalLM to match transformers by @mfuntowicz in #54
AutoModelForCausalLM instead of LlamaForCausalLM by @laikhtewari in #24
Use the new runtime handled allocation by @mfuntowicz in #46

New Contributors

@eltociear made their first contribution in #31
@IlyasMoutawwakil made their first contribution in #19

Full Changelog: v0.1.0b1...v0.1.0b2

Contributors

mfuntowicz, laikhtewari, and 2 other contributors

Assets 2

18 Dec 15:05

mfuntowicz

v0.1.0b1

4aa44f6

0.1.0b1 - Initial Release

This release is the first for optimum-nvidia and focus on bringing the latest performance improvements for Llama based model such as float8 on the latest generation of Nvidia Tensor Cores GPUs

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimum-Nvidia v0.1.0 Beta 8

Highlight

Known Issues

What's Changed

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlights

Models

Quality Improvements

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Highlights

TensorRT-LLM

Known issues

What's Changed

New Contributors

Contributors

TensorRT-LLM

Improvements

Bug Fixes

Engine Builder changes

Model Support

What's Changed

New Contributors

Contributors

Releases: huggingface/optimum-nvidia

v0.1.0b8

Optimum-Nvidia v0.1.0 Beta 8

Highlight

Known Issues

What's Changed

Contributors

v0.1.0b7

Highlights

What's Changed

New Contributors

Contributors

v0.1.0b6 - Whisper, CodeGemma and QoL improvements

Highlights

Models

Quality Improvements

What's Changed

Contributors

v0.1.0b4

What's Changed

New Contributors

Contributors

Optimum-Nvidia 0.1.0b3 Release, welcome Google Gemma!

Highlights

TensorRT-LLM

Known issues

What's Changed

New Contributors

Contributors

Optimum-Nvidia 0.1.0b2 Release, bug fix release

TensorRT-LLM

Improvements

Bug Fixes

Engine Builder changes

Model Support

What's Changed

New Contributors

Contributors

0.1.0b1 - Initial Release