Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] 带split op的模型报错 #203

Open
XiaotaoChen opened this issue Dec 23, 2024 · 1 comment
Open

[Bug Report] 带split op的模型报错 #203

XiaotaoChen opened this issue Dec 23, 2024 · 1 comment

Comments

@XiaotaoChen
Copy link

背景

  1. 我们在尝试将minicpm-v-2-6部署到bm1684x上,在尝试用自己的工具分别导出: vit, embedding, lm_head, block, block_cache模块;
  2. 分别使用model_transform.pymodel_deploy.py导出到mlir和bmodel;
  3. 模型转换问题主要发生在bf16数据格式的vit模型,问题原因为:bfloat16格式的onnx模型貌似onnx相关优化会有问题,导致vit部分split相关算子没有被优化掉;
  4. model_transform.py脚本转换vit模型时,首先在python层遇到split op的转换问题,原因在于我们导出的vit模型为bfloat16数据的onnx,split op输入是一个constant,而tpu-mlir中OnnxConverter.py:convert_split_op只考虑输入是activation tensor的情况,代码修改如下所示:
# op = self.getOperand(onnx_node.inputs[0])
# split input maybe a constant tensor
op = self.getOp(onnx_node.inputs[0])
  1. 修复该问题后,报tpuc-opt相关的优化pass错误,如下面所示;
  2. 可以复现该问题的最小模型见:vision_tower_bf16-cut.onnx

具体错误log

2024/12/23 17:05:04 - INFO : TPU-MLIR v1.12.beta.0-29-g5dacf2a47-20241223
2024/12/23 17:05:04 - INFO : 
         _____________________________________________________ 
        | preprocess:                                           |
        |   (x - mean) * scale                                  |
        '-------------------------------------------------------'
  config Preprocess args : 
        resize_dims           : same to net input dims
        keep_aspect_ratio     : False
        keep_ratio_mode       : letterbox
        pad_value             : 0
        pad_type              : center
        --------------------------
        mean                  : [0.0, 0.0, 0.0]
        scale                 : [1.0, 1.0, 1.0]
        --------------------------
        pixel_format          : bgr
        channel_format        : nchw

2024/12/23 17:05:04 - INFO : Input_shape assigned
2024/12/23 17:05:04 - WARNING : ConstantFolding failed.
2024/12/23 17:05:04 - INFO : ConstantFolding finished
2024/12/23 17:05:04 - INFO : skip_fuse_bn:False
2024/12/23 17:05:04 - INFO : Onnxsim opt finished
2024/12/23 17:05:04 - WARNING : ConstantFolding failed.
2024/12/23 17:05:04 - INFO : ConstantFolding finished
name:/visual/Constant_1_output_0
2024/12/23 17:08:20 - INFO : Save mlir file: /workspace/tpu-mlir/tmp_origin.mlir
[Running]: tpuc-opt /workspace/tpu-mlir/tmp_origin.mlir --shape-infer --canonicalize --extra-optimize -o /workspace/tpu-mlir/tmp.mlir
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: tpuc-opt /workspace/tpu-mlir/tmp_origin.mlir --init --shape-infer --canonicalize --extra-optimize --deinit --mlir-print-debuginfo -o /workspace/tpu-mlir/tmp.mlir
 #0 0x000062af85a6ee87 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/workspace/tpu-mlir/install/bin/tpuc-opt+0x869e87)
 #1 0x000062af85a6cbae llvm::sys::RunSignalHandlers() (/workspace/tpu-mlir/install/bin/tpuc-opt+0x867bae)
 #2 0x000062af85a6f80a SignalHandler(int) Signals.cpp:0:0
 #3 0x00007177ab158520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x000062af86eefe98 tpu_mlir::top::RangeOp::shape_inference() (/workspace/tpu-mlir/install/bin/tpuc-opt+0x1ceae98)
 #5 0x000062af86e00e23 tpu_mlir::detail::ShapeInterfaceInterfaceTraits::Model<tpu_mlir::top::RangeOp>::shape_inference(tpu_mlir::detail::ShapeInterfaceInterfaceTraits::Concept const*, mlir::Operation*) (/workspace/tpu-mlir/install/bin/tpuc-opt+0x1bfbe23)
 #6 0x000062af86f78910 tpu_mlir::top::ShapeInferPass::runOnOperation()::'lambda'(tpu_mlir::ShapeInterface)::operator()(tpu_mlir::ShapeInterface) const (/workspace/tpu-mlir/install/bin/tpuc-opt+0x1d73910)
 #7 0x000062af85bc0c5e void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) (/workspace/tpu-mlir/install/bin/tpuc-opt+0x9bbc5e)
 #8 0x000062af86f76964 tpu_mlir::top::ShapeInferPass::runOnOperation() (/workspace/tpu-mlir/install/bin/tpuc-opt+0x1d71964)
 #9 0x000062af8703c904 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/workspace/tpu-mlir/install/bin/tpuc-opt+0x1e37904)
#10 0x000062af8703cf31 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/workspace/tpu-mlir/install/bin/tpuc-opt+0x1e37f31)
#11 0x000062af8703f3d8 mlir::PassManager::run(mlir::Operation*) (/workspace/tpu-mlir/install/bin/tpuc-opt+0x1e3a3d8)
#12 0x000062af85a6053b performActions(llvm::raw_ostream&, std::shared_ptr<llvm::SourceMgr> const&, mlir::MLIRContext*, mlir::MlirOptMainConfig const&) MlirOptMain.cpp:0:0
#13 0x000062af85a5f904 mlir::LogicalResult llvm::function_ref<mlir::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>::callback_fn<mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, mlir::DialectRegistry&, mlir::MlirOptMainConfig const&)::$_2>(long, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&) MlirOptMain.cpp:0:0
#14 0x000062af872532c8 mlir::splitAndProcessBuffer(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<mlir::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>, llvm::raw_ostream&, bool, bool) (/workspace/tpu-mlir/install/bin/tpuc-opt+0x204e2c8)
#15 0x000062af85a59c0a mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, mlir::DialectRegistry&, mlir::MlirOptMainConfig const&) (/workspace/tpu-mlir/install/bin/tpuc-opt+0x854c0a)
#16 0x000062af85a5a0d4 mlir::MlirOptMain(int, char**, llvm::StringRef, mlir::DialectRegistry&) (/workspace/tpu-mlir/install/bin/tpuc-opt+0x8550d4)
#17 0x000062af85a58b1a main (/workspace/tpu-mlir/install/bin/tpuc-opt+0x853b1a)
#18 0x00007177ab13fd90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
#19 0x00007177ab13fe40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
#20 0x000062af85a57f25 _start (/workspace/tpu-mlir/install/bin/tpuc-opt+0x852f25)
Floating point exception (core dumped)
@XiaotaoChen XiaotaoChen changed the title [Bug Report] 带range op的模型报错 [Bug Report] 带split op的模型报错 Dec 23, 2024
@XiaotaoChen
Copy link
Author

XiaotaoChen commented Dec 23, 2024

貌似是lib/Dialect/Top/Interfaces/Range.cpp 中对start, delta, limit参数,将float转为int64时,存在数值差异导致。比如66行limit的值,0.999998987 被转成了0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant