Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add new operator ml_nms #325

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

wenzhengyin
Copy link
Collaborator

@wenzhengyin wenzhengyin commented Dec 13, 2022

Thanks for your contribution and we appreciate it a lot.

1. Motivation

Add new operator ml_nms

2. Modification

Code to add a new operator
Add file
bangc-ops\kernels\ml_nms\ml_nms.cpp
bangc-ops\kernels\ml_nms\ml_nms.mlu
bangc-ops\kernels\ml_nms\ml_nms.h
bangc-ops\test\mlu_op_gtest\pb_gtest\src\zoo\ml_nms\ml_nms.cpp
bangc-ops\test\mlu_op_gtest\pb_gtest\src\zoo\ml_nms\ml_nms.h
bangc-ops\test\mlu_op_gtest\pb_gtest\src\zoo\ml_nms\test_case\case_0.prototxt
Update file
bangc-ops\mlu_op.h
bangc-ops\mlu_op_kernel.h
bangc-ops\test\mlu_op_gtest\pb_gtest\mlu_op_test_proto\mlu_op_test.proto

3. Test Report

If you want to know how to do operator testing, you can see GTest-User-Guide-zh.

3.1 Modification Details

3.1.1 Accuracy Acceptance Standard

For static threshold standard details, see: MLU-OPS Accuracy Acceptance Standard.

  • diff1, diff1 <= 0
  • diff2, diff2 <= 0

3.1.2 Operator Scheme checklist

No. Details Check Results
1 Supported hardware MLU270
MLU290
MLU370
2 Job types block
U1
U4
3 Layouts NCHW
4 Whether multi-dimensions are supported yes
5 Whether element zero is supported yes
6 Data type(half/float) half / float
7 Whether there is size limit no

3.1.3 New Feature Test

If you have checked the following items, please tick the relevant box.

  • [yes ] Data type test
  • [yes ] Multidimensional tensor test
  • [yes ] Layout test
  • [yes ] Different size/integer remainder end segment/alignment misalignment test
  • [yes ] Zero dimensional tensor test/zero element test
  • [yes ] stability test
  • [no ] Multiple platform test
  • [no ] Gen_case module test
  • [no ] Nan/INF tests
  • [yes ] Bug fix tests
  • [no ] For memory leak check details, seeGTest-User-Guide-zh.
  • [no ] For code coverage check details, see: GTest-User-Guide-zh.
  • [no ] For I/O calculation efficiency check details, see: MLU-OPS Performance Acceptance Standard.

3.1.4 Parameter Check

When a new operator is submitted, the test points are given and the test results are stated.

Test Point Acceptance Standard Test Result (Error Message)
Whether it conforms to the operator restriction Normal error
Whether illegal parameters are passed Normal error

3.2 Performance Test

See MLU-OPS Performance Acceptance Standard for details.

Platform :MLU270

Operator Mlu_hardware_time(us) Mlu_interface_time(us) Mlu_io_efficiency Mlu_compute_efficiency Mlu_workwpace_size(Bytes) Data_type Shape
op_name
op_name

Platform :MLU290

Operator Mlu_hardware_time(us) Mlu_interface_time(us) Mlu_io_efficiency Mlu_compute_efficiency Mlu_workwpace_size(Bytes) Data_type Shape
op_name
op_name

Platform:MLU370

Operator Mlu_hardware_time(us) Mlu_interface_time(us) Mlu_io_efficiency Mlu_compute_efficiency Mlu_workwpace_size(Bytes) Data_type Shape
op_name
op_name

3.3 Summary Analysis

Please give a brief overview here, if you want to note and summarize the content.

return false;
}

mluOpStatus_t MlNmsParamCheck(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mluOpStatus_t MlNmsParamCheck(
mluOpStatus_t mlNmsParamCheck(

@@ -25,6 +25,7 @@

#include <stdint.h>
#include "cnrt.h"
#include "mlu_op.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要包含该头文件吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要mlu_op.h中的结构体

Copy link
Collaborator

@defei-coder defei-coder Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为mluOpDataType_t的使用才包含的mlu_op.h,实际上mluOpKernelMlNmsFloatFast中的Float已经指明了数据类型,这个参数可以删除。

}

__mlu_func__ void getSegNumMlNmsFast(int input_boxes_num, int* seg) {
if (taskDim > 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为使用的是U1任务,这里taskDim一定会大于1。建议将该判断去除,默认使用taskDim > 1时的计算。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为后续block任务预留

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果taskDim = 1,uint32_t((input_boxes_num % taskDim) > taskId)结果为0,此时*seg通过(input_boxes_num / taskDim) + uint32_t((input_boxes_num % taskDim) > taskId)得到的值和input_boxes_num相同。

__nram__ char worke_space[MAX_NRAM_SIZE / 16];
__memcpy((T*)worke_space,
boxes_data_ptr + (offset * 4),
seg * 4 * sizeof(T),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seg是通过所有box的个数在taskDim上进行拆分得到的,这里考虑不严谨(不应该认为计算得到的seg数量能够完全保存在nram上),这里实际需要的是片上单次处理的数据量,应该和内存大小相关。

#include "mlu_op_kernel.h"
#include "kernels/kernel.h"

#define NRAM_SIZE 2 * 1024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个宏没有被使用

Comment on lines +104 to +118
if (boxes_data_ptr_desc->dtype == MLUOP_DTYPE_HALF) {
mluOpFuncKernel = mluOpKernelMlNmsHalfFast;
apply_nram_size = (input_boxes_num * 6 * 2) + (input_boxes_num * 14 * 2);
} else {
mluOpFuncKernel = mluOpKernelMlNmsFloatFast;
apply_nram_size = (input_boxes_num * 6 * 4) + (input_boxes_num * 14 * 4);
}
if (apply_nram_size > MAX_NRAM_SIZE) {
if ((apply_nram_size % MAX_NRAM_SIZE) !=0) {
loop_num = (apply_nram_size / MAX_NRAM_SIZE) + 1;
} else {
loop_num = apply_nram_size / MAX_NRAM_SIZE;
}
}
if (loop_num > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分循环放在 kernel 内(mluOpFuncKernel)处理更合适,减少kernl launch的开销

__mlu_device__ void unionImple(T* boxes_data_ptr,
T nms_thres, int offset, int seg, int input_boxes_num,
int boxes_start_position, uint8_t* output_boxes_index) {
__nram__ char worke_space[MAX_NRAM_SIZE / 16];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__nram__ char worke_space[MAX_NRAM_SIZE / 16];
__nram__ char work_space[MAX_NRAM_SIZE / 16];

Comment on lines +121 to +124
x1 = worke_space + data_len;
y1 = worke_space + (data_len + compute_len);
x2 = worke_space + (data_len + (compute_len * 2));
y2 = worke_space + (data_len + (compute_len * 3));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x1 = worke_space + data_len;
y1 = worke_space + (data_len + compute_len);
x2 = worke_space + (data_len + (compute_len * 2));
y2 = worke_space + (data_len + (compute_len * 3));
x1 = worke_space + data_len;
y1 = x1 + compute_len;
x2 = y1 + compute_len;
y2 = x2 + compute_len;

result = (uint8_t*)worke_space + ((data_len + (compute_len * 8)) *
(sizeof(T) / sizeof(uint8_t)) + seg);
}
for (i = 0, j = 0; i < seg * 6; i+=6, j++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (i = 0, j = 0; i < seg * 6; i+=6, j++) {
for (i = 0, j = 0; i < seg * 6; i += 6, j++) {

Comment on lines +161 to +163
__bang_sub(h, y1, y2, compute_len);
__bang_sub(w, x2, x1, compute_len);
__bang_mul(area_ptr, h, w, compute_len);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

面积计算公式为:

area = (a[2] - a[0] + 1) * (a[3] - a[1] + 1); 

/是否少了 + 1操作?


// max x1
__bang_write_value(scores_max_boxes_ptr, compute_len, scores_max_boxes[0]);
__bang_cycle_sub(x1, x1, scores_max_boxes_ptr, compute_len, compute_len);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__bang_cycle_sub(x1, x1, scores_max_boxes_ptr, compute_len, compute_len);
__bang_sub(x1, x1, scores_max_boxes_ptr);

Comment on lines +212 to +213
__bang_cycle_add(tem, area_ptr, scores_max_boxes_area_ptr,
compute_len, compute_len);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__bang_cycle_add(tem, area_ptr, scores_max_boxes_area_ptr,
compute_len, compute_len);
__bang_add(tem, area_ptr, scores_max_boxes_area_ptr);

综合代码来看,tem这块内存并不需要,其他位置可以使用scores_max_boxes_area_ptr进行操作,这里可以将结果保存在area_ptr中。

} else {
result[i] = 0;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该for循环是否可以使用 bang_mul 处理?

auto output_ptr = parser_->getMetaTensor(1).cpu_ptr;
int input_boxes_num = input_desc->dims[0];
std::vector<std::vector<float>> boxes_data_ptr;
for (int i = 0; i < input_boxes_num * 4; i+=4) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int i = 0; i < input_boxes_num * 4; i+=4) {
for (int i = 0; i < input_boxes_num * 6; i += 6) {

单个box含有6个数据,这里应该修改为6


float area1 = abs(box1[0] - box1[2]) * abs(box1[1] - box1[3]);
float area2 = abs(box2[0] - box2[2]) * abs(box2[1] - box2[3]);
float inter = abs(x1 - x2) * abs(y1 - y2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有考虑两个box不相交的情况

boxes_data_ptr.push_back(data_ptr);
}
for (int i = 0; i < input_boxes_num ; i++) {
float iou = iouCompute(boxes_data_ptr[0], boxes_data_ptr[i]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只需要和第一个box计算交并比就可以了吗?

@defei-coder
Copy link
Collaborator

@wenzhengyin 有下面几个内容需要确认一下:
1、ml_nms 只需要和第一个box进行相比较吗?根据 iouCompute,看到cpu的计算逻辑是仅和第一个box进行计算。
2、当前的测例无法通过测试;
3、麻烦根据comment进行检查。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants