Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT加速效果并不明显 #3

Open
zzzzzyh111 opened this issue Jun 20, 2024 · 3 comments
Open

TensorRT加速效果并不明显 #3

zzzzzyh111 opened this issue Jun 20, 2024 · 3 comments

Comments

@zzzzzyh111
Copy link

感谢您的优秀工作!
最近我在尝试在Jetson Orign NX上使用TensorRT对Depth Anything进行加速,但是我发现转换后的trt文件的推理速度和onnx文件相比并没有显著提升,甚至还有下降。其中:

ONNX Inference Time: 2.7s per image
TRT Inference Time: 3.0s per image

库的版本如下:

- JetPack: 5.1
- CUDA: 11.4.315
- cuDNN: 8.6.0.166
- TensorRT: 8.5.2.2
- VPI: 2.2.4
- Vulkan: 1.3.204
- OpenCV: 4.5.4 - with CUDA: NO
- torch: 2.1.0
- torchvision: 0.16.0
- onnx: 1.16.1
- onnxruntime: 1.8.0

将pth文件转换成onnx文件的函数如下:

model_name = "zoedepth"
pretrained_resource = "local::./checkpoints/ZoeDepthIndoor_05-Jun_15-11-ebbebc6c1002_best.pt"
dataset = None
overwrite = {"pretrained_resource": pretrained_resource}
config = get_config(model_name, "eval", dataset, **overwrite)
model = build_model(config)
model.eval() 
dummy_input = torch.randn(1, 3, 392, 518)
 _ = model(dummy_input)
torch.onnx.export(model, dummy_input, "ZoeDepth_indoor.onnx", verbose=True)
torch.onnx.export(
            model,
             dummy_input, 
             "./checkpoints/ZoeDepth_indoor_jetson.onnx", 
             opset_version=11, 
             input_names=["input"], 
             output_names=["output"], 
)

将onnx文件转换成trt文件的函数如下:

def build_engine(onnx_file_path):
    onnx_file_path = Path(onnx_file_path)
    # ONNX to TensorRT
    logger = trt.Logger(trt.Logger.VERBOSE)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, logger)

    with open(onnx_file_path, "rb") as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            raise ValueError('Faled to parse the ONNX model.')

    # Set up the builder config
    config = builder.create_builder_config()
    config.set_flag(trt.BuilderFlag.FP16)  # FP16
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 2 << 30)  # 2 GB

    serialized_engine = builder.build_serialized_network(network, config)

    with open(onnx_file_path.with_suffix(".trt"), "wb") as f:
        f.write(serialized_engine)

使用trt文件进行推理的函数如下:

def infer_trt(engine, input_image):
    input_image = input_image.cpu().numpy().astype(np.float32)
    context = engine.create_execution_context()
    height, width = input_image.shape[2], input_image.shape[3]
    output_shape = (1, 1, height, width)
    # Allocate pagelocked memory
    h_input = cuda.pagelocked_empty(trt.volume((1, 3, height, width)), dtype=np.float32)
    h_output = cuda.pagelocked_empty(trt.volume((1, 1, height, width)), dtype=np.float32)

    # Allocate device memory
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)

    bindings = [int(d_input), int(d_output)]
    stream = cuda.Stream()
    # Function to perform inference
    def perform_inference(images_np):
        np.copyto(h_input, images_np.ravel())
        cuda.memcpy_htod_async(d_input, h_input, stream)
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        cuda.memcpy_dtoh_async(h_output, d_output, stream)
        stream.synchronize()
        return torch.tensor(h_output).view(output_shape)
        # Run inference on original images

    pred1 = perform_inference(input_image)

    # Run inference on flipped images
    flipped_images_np = np.flip(input_image, axis=3)
    pred2 = perform_inference(flipped_images_np)
    pred2 = torch.flip(pred2, [3])
    mean_pred = 0.5 * (pred1 + pred2)
    return mean_pred

代码运行过程中除了转换成onnx文件的时候会有一些warning,其他全部正常运行。但是最后的结果还是不尽如人意,期待得到您的回复!

@thinvy
Copy link
Owner

thinvy commented Jun 28, 2024

在linux arm64平台默认pip安装的onnxruntime-gpu是通过tensorrt加速的(参考 https://onnxruntime.ai/getting-started ),如果是这样安装的话,和直接使用tensorrt再简单地导出个模型推理的性能基本一致,尤其是python的推理。

此外相比于桌面端GPU,tensorrt对orin上int8的推理加速效果较fp16提升较为明显,实际部署的话最好是可以进行int8量化或者8-16混合精度量化

@zzzzzyh111
Copy link
Author

zzzzzyh111 commented Jun 30, 2024

谢谢您的回答,但我发现其实tensorrt和pth的推理速度也是基本一样的,因此:
1.我估计代码里的data loading和 preprocessing部分可能占了大部分时间,我会进一步打印每一步的时间并查找究竟是哪一部分耗时最久
2.关于使用Int8量化加速的提议非常好,但是我的任务对精度要求比较高,所以可能只会在目前情况得不到改善的前提下再考虑使用

感谢您的快速答复!

@striving19
Copy link

谢谢您的回答,但我发现其实tensorrt和pth的推理速度也是基本一样的,因此: 1.我估计代码里的data loading和 preprocessing部分可能占了大部分时间,我会进一步打印每一步的时间并查找究竟是哪一部分耗时最久 2.关于使用Int8量化加速的提议非常好,但是我的任务对精度要求比较高,所以可能只会在目前情况得不到改善的前提下再考虑使用

感谢您的快速答复!

您好,这个问题您解决了吗,我在jetson orin nano上也遇到了推理速度慢的问题,想请教一下该如何解决

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants