Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make shuffle robust to small partitions #8259

Open
mrocklin opened this issue Oct 11, 2023 · 1 comment
Open

Make shuffle robust to small partitions #8259

mrocklin opened this issue Oct 11, 2023 · 1 comment
Labels

Comments

@mrocklin
Copy link
Member

My understanding is that we saw that benchmark performance increased substantially in shuffle workloads if we reduced the number of partitions. This led to optimizations like dask/dask-expr#327 . This seems great.

I'm curious though if we can also make p2p shuffle more robust to partition size. For example, one dumb idea would be to add yet-another-buffer and just accrue partitions until we had, say 512 MB of data, and only then start sharding things. Presumably then we wouldn't care if we were given very small partitions or not.

Reducing the number of input partitions is a great decision regardless. However, I suspect that there will be other times where we'll get lots of small partitions (dask/dask#10542 comes to mind). If we can become less sensitive here that seems good.

(Please excuse any ignorance on my part here. It's been a while since I've looked at P2P code)

@fjetter
Copy link
Member

fjetter commented Oct 11, 2023

Yes, this is a known issue and I agree that it should be robust to this. Early version of it were indeed robust but handling null columns, categoricals and other schema related foo required us to include schema metadata for every shards in the payload stream for comms and disk.

#7990 is a very extreme case that shows how metadata kills performance right now by having too strong fragmentation of output shards. That's one of the next things after we deal with lingering stability issues.

Concatenating / buffering more is definitely an option and I assume the most likely candidate here. Particularly on the disk side I frequently observe that we're actually not buffering much at all for various reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants