Releases: huggingface/optimum-nvidia
Releases · huggingface/optimum-nvidia
v0.1.0b8
Optimum-Nvidia v0.1.0 Beta 8
Highlight
- Exporting a model is now more robust and better defined overall compared to previous version. All the parameters are now exposed through
optimum.nvidia.ExportConfig
- Bring back quantization and sparsity through integration of Nvidia's ModelOpt
- Added examples of quantization and sparsification recipes under
examples/quantization
- Integrated
optimum-nvidia
with the latestoptimum-cli
interface to support exporting engines without any code throughoptimum-cli export trtllm
.
Known Issues
- ModelOpt v0.15 as integrated in optimum-nvidia has an issue when trying to quantize with AWQ schema which is gone with v0.17. This dependency will be upgraded in the next release
What's Changed
- feat(package): make sure we dont have init as optimum level by @mfuntowicz in #132
- Enable trufflehog scanner CI on GA by @mfuntowicz in #136
- Enable automatic build of container at each release by @mfuntowicz in #137
- Refactor the overall Hugging Face -> TRTLLM export workflow by @mfuntowicz in #133
- feat(tests) : Update CI to use new workflow and silicon. by @mfuntowicz in #145
- move to new cluster by @glegendre01 in #150
- Bring back quantization with Nvidia ModelOpt by @mfuntowicz in #147
- (misc) disable xQA kernels for now as they seem to hang by @mfuntowicz in #152
- Add CLI quantization option by @mfuntowicz in #153
- tests(cli): uncomment out tests for CLI by @mfuntowicz in #154
- Fix license detection path by @mfuntowicz in #155
- Fix test again by @mfuntowicz in #156
- chore: remove invalid examples by @mfuntowicz in #157
- Bump version to 0.1.0b8 by @mfuntowicz in #158
- chore: update README badges by @mfuntowicz in #159
Full Changelog: v0.1.0b7...v0.1.0b8
v0.1.0b7
Highlights
- Mixtral models are now supported (requires a multi-gpu setup)
- Tensor Parallelism & Pipeline Parallelism are supported on
from_pretrained
andpipeline
through the use oftp=<int>
,pp=<int>
- Models from
transformers
are now loaded in their respective checkpoint data type rather thanfloat32
avoiding most of memory errors that were happening in 0.1.0b6 - Intermediate TensorRT-LLM checkpoints and engines are now saved in two different folders (
checkpoints/
andengines/
) to avoid issues when building multiple checkpoints with the sameconfig.json
(TP / PP setup)
What's Changed
- Fix checking output limits for #114 by @zaycev in #115
- Test batched causallm inference by @fxmarty in #117
- Remove claim of Turing support by @laikhtewari in #118
- Mention important additional parameters for engine config in README by @zaycev in #113
- Update to TensorRT-LLM v0.9.0 by @mfuntowicz in #124
- Use a percentage based matching rather than exact token match for tests by @mfuntowicz in #125
- Mixtral by @mfuntowicz in #131
New Contributors
Full Changelog: v0.1.0b6...v0.1.0b7
v0.1.0b6 - Whisper, CodeGemma and QoL improvements
Highlights
Models
- Whisper
- CodeGemma
Quality Improvements
- Generated outputs should now be closer to the one issued from transformers
What's Changed
- Add back the ability to build Whisper from Transformers checkpoints by @fxmarty in #101
- Fix invalid dependencies by @mfuntowicz in #104
- Whisper inference by @fxmarty in #107
- Fix quality on the main branch by @mfuntowicz in #108
- Use pinned version for huggingface-hub by @mfuntowicz in #109
- Avoid reloading available transformers config by @fxmarty in #111
- Test CausalLM generate & pipeline by @fxmarty in #110
Full Changelog: v0.1.0b4...v0.1.0b6
v0.1.0b4
#Highlights
- Update to TensorRT-LLM version 03-19-2024
- pip installation
- Float8 quantization workflow updated on more robust
- Save and restore prebuild engine from the Hugging Face Hub or locally on the machine
What's Changed
- Add ability to save local prebuilt engines by @mfuntowicz in #87
- Make float8 quantization back in the game. by @mfuntowicz in #92
- Fixed Repetition Penalty default value by @leopra in #66
- Update instructions for pip install by @mfuntowicz in #97
- Update to TensorRT-LLM v031224 by @mfuntowicz in #98
New Contributors
Full Changelog: v0.1.0b3...v0.1.0b4
Optimum-Nvidia 0.1.0b3 Release, welcome Google Gemma!
Highlights
- This release brings support Google recently released model Gemma
optimum-nvidia
went through a major refactor which will make it much easier to support new models and integrate the latest one in the long run
TensorRT-LLM
- Update underlying TensorRT-LLM dependency to b7c309d1c9baa9c030680988cb73e461f6253b98 (v0.9.0)
Known issues
- The current
float8
flow is disabled until next release in order to support the new calibration workflow
What's Changed
- Bug fixes in readme. by @Anindyadeep in #63
- Bump TRTLLM to latest version #d879430 by @mfuntowicz in #65
- Ability to build Whisper encoder/decoder TRT engine by @fxmarty in #70
- Refactoring of the overall structure to better align with the new TRTLLM workflow moving forward by @mfuntowicz in #74
- Fix gemma 7b by @mfuntowicz in #77
- Update license by @mfuntowicz in #78
- Make pipelines compatible with the new workflow by @mfuntowicz in #79
- Fix repo code quality by @mfuntowicz in #80
- Bring back CI to a normal state by @mfuntowicz in #82
- Fix hardcoded embedding scale with value from config by @mfuntowicz in #85
- Make overall
optimum-nvidia
pip installable by @mfuntowicz in #83
New Contributors
- @Anindyadeep made their first contribution in #63
- @fxmarty made their first contribution in #70
Full Changelog: v0.1.0b2...v0.1.0b3
Optimum-Nvidia 0.1.0b2 Release, bug fix release
This release is meant to focus on improving the previous one with additional test coverage, bug fixes and more usability improvements
TensorRT-LLM
- Updated TensorRT-LLM to version f7eca56161d496cbd28e8e7689dbd90003594bd2
Improvements
- Generally improve unittest coverage
- Initial documentation and updated build instructions
- The prebuilt container now supports Volta and Tesla (experimental) architectures for V100 and T4 GPUs
- More in-depth usage of TensortRT-LLM Runtime Python C++ binding
Bug Fixes
- Fixed an issue with pipeline returning only the first output when provided with a batch
- Fixed an issue with
bfloat16
conversion not loading weights in the right formats for the TRT Engine builder - Fixed an issue with non Multi Heads Attention setup where the heads were not replicated with the proper factor
Engine Builder changes
- RMSNorm plugin is now being deprecated by Nvidia for performance reasons so we will not attempt to enable it anymore
Model Support
- Mistral familly of model should theorically work but currently it is not being extensively tested through our CI/CD. We plan to add official support in the next release
What's Changed
- bump trt llm version to 0.6.1 by @laikhtewari in #27
- Fix issue returning only the first batch item after pipeline call. by @mfuntowicz in #29
- Update README.md by @eltociear in #31
- Missing comma in setup.py by @IlyasMoutawwakil in #19
- Quality by @mfuntowicz in #30
- Fix typo by @mfuntowicz in #40
- Update to latest trtllm f7eca56161d496cbd28e8e7689dbd90003594bd2 by @mfuntowicz in #41
- Enable more SM architectures in the prebuild docker by @mfuntowicz in #35
- Add initial set of documentation to build the
optimum-nvidia
container by @mfuntowicz in #39 - Fix caching for docker by @mfuntowicz in #15
- Initial set of unittest in CI by @mfuntowicz in #43
- Build from source instructions by @laikhtewari in #38
- Enable testing on GPUs by @mfuntowicz in #45
- Enable HF Transfer in tests by @mfuntowicz in #51
- Let's make sure to use the repeated heads tensor when in a non-mha scenario by @mfuntowicz in #48
- Bump version to 0.1.0b2 by @mfuntowicz in #53
- Add more unittest by @mfuntowicz in #52
- Disable RMSNorm plugin as deprecated for performance reasons by @mfuntowicz in #55
- Rename LLamaForCausalLM to LlamaForCausalLM to match transformers by @mfuntowicz in #54
- AutoModelForCausalLM instead of LlamaForCausalLM by @laikhtewari in #24
- Use the new runtime handled allocation by @mfuntowicz in #46
New Contributors
- @eltociear made their first contribution in #31
- @IlyasMoutawwakil made their first contribution in #19
Full Changelog: v0.1.0b1...v0.1.0b2
0.1.0b1 - Initial Release
This release is the first for optimum-nvidia
and focus on bringing the latest performance improvements for Llama based model such as float8
on the latest generation of Nvidia Tensor Cores GPUs