Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatcher decompressing data when not needed #31

Open
dimitrijejankov opened this issue Jul 7, 2018 · 6 comments
Open

Dispatcher decompressing data when not needed #31

dimitrijejankov opened this issue Jul 7, 2018 · 6 comments
Assignees

Comments

@dimitrijejankov
Copy link
Collaborator

Hi I've been adding stuff to the Dispatcher to get my new algorithm running.
I noticed that the dispatcher is decompressing the data even though it does not use the decompressed data.

In the file
https://github.com/riceplinygroup/plinycompute/blob/master/pdb/src/serverFunctionalities/source/DispatcherServer.cc

line 88 to 95 we are decompressing the data with snappy stored in tempPage. The decompressed result is stored in readToHere. But if we look at the 130 to 136 only the tempPage (the compressed data is sent), thus we are just doing useless work on the dispatcher.

@dimitrijejankov
Copy link
Collaborator Author

Another issue is that even if we are sending uncompressed bytes of a shallow copy the bool DispatcherServer::sendBytes has a hardcoded value to true for the indicator compressedOrNot, thus it will probably not work.

@jiazou-bigdata
Copy link
Collaborator

Thanks Dimitrije.

(1) The reason that it decompress data before forwarding the compressed data is to give dispatcher a chance to look into the content to do size checking or other things such as from line 98 to line 108.

(2) I once did write an example to successfully use shallow copy and send uncompressed bytes directly, please check it here: https://github.com/riceplinygroup/plinycompute/blob/master/applications/TPCHBench/tpchDataGenerator.cc

Let me know if you have further questions.

@dimitrijejankov
Copy link
Collaborator Author

Hi Jia,
Thanks for the reply.

(1) The only check we do is whether the vector is empty or not, can be done on the other side as well. The code for type validation is commented out 267-292 that is called on 110 and it will always pass.

(2) If you look at the sendBytes method 349 - 352 you will see that the second last argument is set to true. Which corresponds to the parameter compressedOrNot in StorageAddData. Meaning it will always be set as compressed regardless if the flag ENABLE_COMPRESSION is enabled or not.

@jiazou-bigdata
Copy link
Collaborator

Hey Dimitrije,

Not very sure what exactly your concerns are and how important it is. Just to clarify it a bit and share information/ideas from my side:

For your first point, we can easily change code to cancel the size checking and directly forward the compressed bytes, but I'm afraid for some complex dispatching features such as partitioning or building indexing while dispatching data, decompression is unavoidable.
In addition, size checking or other validation at dispatcher side is often reasonable as it is much easier to obtain/track error logs there than from many worker servers. Logging at client side is irrelevant with server design.

For your second point, I guess it also only requires small change of code to make the flag in sendBytes to be configurable and I suspect it once worked like that.
Some additional comments: for most of our experiments for the SIGMOD paper that processes hundreds of gigabytes of intermediate data (also our targeting machine learning use scenarios), adding compression has better performance in most of the cases, for some case where compression ratio is not good, (de)compression overhead doesn't seem significant. So if asked for a default configuration, I would recommend to have compression on, unless our targeting scenario changes.

Hope this helps.
Jia

@dimitrijejankov
Copy link
Collaborator Author

Hi Jia,
Thanks for your time. Good point, if the (de)compression is not a large overhead we can keep it. For the second issue is it fine if I go ahead and change that?

@jiazou-bigdata
Copy link
Collaborator

Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants