Release Optimum-Nvidia 0.1.0b2 Release, bug fix release · huggingface/optimum-nvidia

This release is meant to focus on improving the previous one with additional test coverage, bug fixes and more usability improvements

Generally improve unittest coverage
Initial documentation and updated build instructions
The prebuilt container now supports Volta and Tesla (experimental) architectures for V100 and T4 GPUs
More in-depth usage of TensortRT-LLM Runtime Python C++ binding

Fixed an issue with pipeline returning only the first output when provided with a batch
Fixed an issue with bfloat16 conversion not loading weights in the right formats for the TRT Engine builder
Fixed an issue with non Multi Heads Attention setup where the heads were not replicated with the proper factor

RMSNorm plugin is now being deprecated by Nvidia for performance reasons so we will not attempt to enable it anymore

Mistral familly of model should theorically work but currently it is not being extensively tested through our CI/CD. We plan to add official support in the next release

What's Changed

bump trt llm version to 0.6.1 by @laikhtewari in #27
Fix issue returning only the first batch item after pipeline call. by @mfuntowicz in #29
Update README.md by @eltociear in #31
Missing comma in setup.py by @IlyasMoutawwakil in #19
Quality by @mfuntowicz in #30
Fix typo by @mfuntowicz in #40
Update to latest trtllm f7eca56161d496cbd28e8e7689dbd90003594bd2 by @mfuntowicz in #41
Enable more SM architectures in the prebuild docker by @mfuntowicz in #35
Add initial set of documentation to build the optimum-nvidia container by @mfuntowicz in #39
Fix caching for docker by @mfuntowicz in #15
Initial set of unittest in CI by @mfuntowicz in #43
Build from source instructions by @laikhtewari in #38
Enable testing on GPUs by @mfuntowicz in #45
Enable HF Transfer in tests by @mfuntowicz in #51
Let's make sure to use the repeated heads tensor when in a non-mha scenario by @mfuntowicz in #48
Bump version to 0.1.0b2 by @mfuntowicz in #53
Add more unittest by @mfuntowicz in #52
Disable RMSNorm plugin as deprecated for performance reasons by @mfuntowicz in #55
Rename LLamaForCausalLM to LlamaForCausalLM to match transformers by @mfuntowicz in #54
AutoModelForCausalLM instead of LlamaForCausalLM by @laikhtewari in #24
Use the new runtime handled allocation by @mfuntowicz in #46

Full Changelog: v0.1.0b1...v0.1.0b2