Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: Decode StreamChunk proto into primitive data types with Java instead of Rust/JNI #16738

Open
fuyufjh opened this issue May 14, 2024 · 4 comments
Assignees
Labels
help wanted Issues that need help from contributors no-issue-activity type/feature
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented May 14, 2024

Is your feature request related to a problem? Please describe.

I still think it's better to use Java to decode the StreamChunk protobuf into arrays of primitive values, in order to eliminate Java-native code call overhead. Let me create an issue for it separately.

Now it calls JNI function (in class Binding) to decode every value. The overhead sounds high to me.

image

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@wenym1
Copy link
Contributor

wenym1 commented May 14, 2024

Since we are doing embedded connector node, we can simply pass a pointer of stream chunk from rust to java, so that we don't need serializing and deserializing the stream chunk.

We have code to convert stream chunk to arrow. We can try encoding to arrow in rust and pass to java, and on java side we decode with arrow java sdk.

We can have a comparison on the cost of encode-decode and jni native method call to see which one to adopt.

@fuyufjh fuyufjh added the help wanted Issues that need help from contributors label May 16, 2024
@xxhZs
Copy link
Contributor

xxhZs commented May 27, 2024

do some bench with arrow. It seems that there is a slight improvement in arrow
The result is:
50 columns , type is varchar
arrow [ "25409 rows/s", "27161 rows/s", "26551 rows/s", "26806 rows/s", "27137 rows/s", "27042 rows/s", "26806 rows/s", "27066 rows/s", "26323 rows/s", "25707 rows/s", "26876 rows/s", "24458 rows/s", "26414 rows/s","29767 rows/s", "30445 rows/s", "29257 rows/s","23414 rows/s"]
streamchunk pointers[ "25259 rows/s", "25404 rows/s", "25307 rows/s", "23141 rows/s", "24789 rows/s", "24743 rows/s", "23141 rows/s", "23676 rows/s", "22755 rows/s", "22797 rows/s", "23011 rows/s", "22755 rows/s", "22086 rows/s", "22393 rows/s", "22505 rows/s", "22629 rows/s", "22341 rows/s", "22393 rows/s"]

50 columns , type is int
arrow["25707 rows/s", "26144 rows/s", "25880 rows/s", "23432 rows/s", "26033 rows/s", "25902 rows/s", "25600 rows/s", "25430 rows/s", "25664 rows/s", "25536 rows/s", "26078 rows/s", "25685 rows/s", "25771 rows/s", "25902 rows/s", "25924 rows/s"]
streamchunk pointers["28791 rows/s", "28899 rows/s", "29090 rows/s", "28710 rows/s", "27650 rows/s", "29118 rows/s", "28444 rows/s", "29063 rows/s", "28523 rows/s", "28818 rows/s", "28845 rows/s", "29118 rows/s", "28235 rows/s", "26853 rows/s", "26551 rows/s"]

5columns type is int int bigint varchar varchar
arrow ["223825 rows/s", "238139 rows/s", "230977 rows/s", "228401 rows/s", "233612 rows/s", "244294 rows/s", "234503 rows/s", "241125 rows/s", "211521 rows/s", "244294 rows/s", "207214 rows/s", "230112 rows/s", "229517 rows/s", "226715 rows/s", "206000 rows/s", "216231 rows/s", "215957 rows/s", "220215 rows/s", "221805 rows/s", "207918 rows/s", "214076 rows/s", "215578 rows/s", "214450 rows/s", "213950 rows/s", "217081 rows/s", "237243 rows/s", "187007 rows/s", "189981 rows/s", "223106 rows/s", "191767 rows/s", "210051 rows/s", "205829 rows/s", "205793 rows/s", "217102 rows/s", "219037 rows/s"]
streamchunk pointers ["171001 rows/s", "201972 rows/s", "203034 rows/s", "189981 rows/s", "200391 rows/s", "229587 rows/s", "196007 rows/s", "228228 rows/s", "167780 rows/s", "184721 rows/s", "195738 rows/s", "207622 rows/s", "217872 rows/s", "208979 rows/s", "208979 rows/s", "219143 rows/s", "204800 rows/s", "216142 rows/s", "180527 rows/s", "191401 rows/s", "238798 rows/s", "177935 rows/s", "203174 rows/s", "237858 rows/s", "186514 rows/s", "201178 rows/s", "213333 rows/s", "206148 rows/s", "211134 rows/s", "203781 rows/s", "208979 rows/s", "238798 rows/s", "188534 rows/s", "232369 rows/s", "180082 rows/s", "208671 rows/s"]

@xxhZs
Copy link
Contributor

xxhZs commented May 27, 2024

Copy link
Contributor

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues that need help from contributors no-issue-activity type/feature
Projects
None yet
Development

No branches or pull requests

3 participants