v14.test2: latest TensorRT library
Pre-releaseThis is a preview release for TensorRT 9.1.0, following v14.test
release.
-
Same as
v14.test
release, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped. -
TensorRT 9.1.0 is officially documented as
for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only
on Linux. The Windows build is downloaded from here, and can be used on other GPU models.On Windows, some users have reported crashes when using it in mpv(#65). This problem occurs on an earlier version of this release, which is now fixed.
-
Add parameters
bf16
(#64),custom_env
andcustom_args
to theTRT
backend.- fp16 execution of
Waifu2xModel.swin_unet_art
is more accurate, faster and uses less GPU memory than bf16 execution (benchmark)
- fp16 execution of
-
Device memory usage of model
Waifu2xModel.swin_unet_art
is reduced compared to TensorRT 9.0.1 on A10G with 1080p input (at 2.66 fps with 7.0GB VRAM usage) with default auxiliary stream heuristic.- TensorRT 9.0.1 using 7 auxiliary streams compared to TensorRT 9.1.0 (3 streams) results in significantly more device memory with no performance gain.
- Setting
max_aux_streams=3
lowers device memory usage of TensorRT 9.0.1 to ~8.9GB, andmax_aux_streams=0
corresponds to ~7.3GB usage. - TensorRT 9.1.0 with
max_aux_streams=0
uses ~6.7GB device memory.
-
Users should use the same version of TensorRT as provided (9.1.0) because runtime version checking is disabled in this release.
-
Added support for RIFE v4.8 - v4.12, v4.12 ~ v4.13 lite (ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update
vsmlrt.py
). v4.8 and v4.9 models should have the same execution speed as v4.7, while v4.10-v4.12 models are equally heavier than previous models. Ensemble models are heavier than their non-ensemble counterparts.- Starting from RIFE v4.11, all rife models are temporarily moved here with individual packaging.
-
RIFE models with v2 representation for
TRT
backend now has improved accuracy, contributed by @charlessuh (#66 (comment)). This has been backported to master.- This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to
vsmlrt.RIFE()
. - By default
vsmlrt.RIFE()
invsmlrt.py
uses v1 representation. The v2 representation is enabled withvsmlrt.RIFE(_implementation=2)
function call.Sample Error Message
input: for dimension number 1 in profile 0 does not match network definition (got min=11, opt=11, max=11), expected min=opt=max=7)
- v2 representation is still considered experimental.
- This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to
-
Added support for SAFA v0.1 video enhancement model.
- This model takes arbitrary sized video and uses both spatial and temporal information to improve visual quality.
- Note that this model is non-deterministic by nature, and existing backends does not support manual seeding.
- ~17 fps on RTX 4090 with
TRT(fp16)
, 1080p input and non-adaptive mode. Adaptive mode is about 2x slower than non-adaptive mode, uses more memory and does not support cuda graphs execution. - This representation is not supported by the
NCNN_VK
backend for the same issue as RIFE v2 representation.
-
Also check the release note of
v14.test
release.
- This pre-release uses trt 9.1.0 + cuda 12.2.2 + cudnn 8.9.5, which can only run on driver >= 525 and 10 series and later gpus, with improved support for self-attentions found in transformer models.
vsmlrt.py
in all branches can be used interchangeably.
- TensorRT 9.0.1 is
for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only
on x86 Linux. ModelWaifu2xModel.swin_unet_art
is 1.2x faster compared to TensorRT 8.6.1 on A10G with 720p input (at 6.3 fps with 4GB VRAM usage), thanks to multi-head attention fusion (requires fp16).
This pre-release is now feature complete. Development now switch to the v14.test3
pre-release.