Fix scalability limitations of current implementation #46

andygrove · 2024-11-16T20:13:47Z

Query execution works by building up a tree of futures to execute each partition in each query stage.

The root node of each query stage is a RayShuffleWriterExec. It works by executing its child plan and fetching all of the results into memory (this is already not scalable because there could be millions or billions of rows). It then concatenates all of the batches into one large batch, which is returned. This large batch is then stored in Ray's object store and will be fetched by the next query stage.

The original disk-based shuffle mechanism (that was removed in #19) did not suffer from any of these issues because query results were streamed to disk in the writer and then streamed back out in the reader. However, this approach assumes that all workers have access to the same local file system.

The text was updated successfully, but these errors were encountered:

andygrove · 2024-11-16T20:49:51Z

One approach would be to bring back the original shuffle code and then add a mechanism for reading data from another node by implementing a gRPC based service (such as Arrow Flight protocol) in each worker.

robtandy · 2024-12-27T19:45:58Z

Would it be possible to use a ray generator for the shuffle writer and reader execution plans?

Rather than serializing all of the data to disk in the writer, and reading it in the reader, we yield record batches in the Task. Then ray sends those in the in memory object store (spilling if required).

Then we can pipeline the execution of the writer and readers so that stages need not wait for completion of dependent stages to start processing.

Would this work?

Willing to work on this if you think its a valid approach.

robtandy · 2024-12-27T19:49:06Z

It would be nice if ray facilitated streaming via network between Actors in addition to exchange of messages via the object store. If its only to exchange streams of batches would Arrow Flight be necessary or could something simpler be used, assuming the Actors exist only ephemerally.

EDIT: Is the serialization of Arrow to/from the object store zero copy? If so, then that might be good enough to start. That wasn't clear to me when reading the ray docs which mention zero copy for numpy data structures.

EDIT 2: It seems that the pickle protocol used has out of band support for zero copy serialization for arrow: https://peps.python.org/pep-0574/

andygrove mentioned this issue Nov 16, 2024

Cannot run benchmarks in k8s due to excessive spilling & OOM #44

Closed

andygrove linked a pull request Nov 17, 2024 that will close this issue

feat: Add support for cloud object storage (e.g. s3) based shuffle #48

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scalability limitations of current implementation #46

Fix scalability limitations of current implementation #46

andygrove commented Nov 16, 2024

andygrove commented Nov 16, 2024

robtandy commented Dec 27, 2024 •

edited

Loading

robtandy commented Dec 27, 2024 •

edited

Loading

Fix scalability limitations of current implementation #46

Fix scalability limitations of current implementation #46

Comments

andygrove commented Nov 16, 2024

andygrove commented Nov 16, 2024

robtandy commented Dec 27, 2024 • edited Loading

robtandy commented Dec 27, 2024 • edited Loading

robtandy commented Dec 27, 2024 •

edited

Loading

robtandy commented Dec 27, 2024 •

edited

Loading