-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Implementation #2
Conversation
Click here to view all benchmarks. |
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall!
Some nits about documentation/comments that were copied over from Tape, and a few small questions.
------- | ||
`dask_nested.NestedFrame` | ||
""" | ||
nested = nested.map_partitions(lambda x: pack_flat(x)).rename(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we try adding a flat dask dataframe? How would pack_flat handle it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to fail if given a dask dataframe instead of a nested-dask nestedframe: *** ValueError: max() arg is an empty sequence
. I can open a ticket to support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah just a ticket should be fine for now.
But this seems like a blocker for out-of-memory datasets at the moment which is unfortunate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think out-of-memory datasets are still feasible. It seems to work fine with a flat Nested-Dask NestedFrame object so the main hiccup is just that we will need to convert any input dask dataframe to a NestedFrame before calling add_nested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should think about a short-cut for index-sorted datasets, it should work for out-of-memory datasets
} | ||
layer_nf = npd.NestedFrame(data=layer_data).set_index("index").sort_index() | ||
|
||
base_dn = dn.NestedFrame.from_nested_pandas(base_nf, npartitions=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that we're using a diversity of npartitions
here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!!!
src/dask_nested/core.py
Outdated
`dask_nested.NestedFrame` | ||
""" | ||
nested = nested.map_partitions(lambda x: pack_flat(x)).rename(name) | ||
return self.join(nested, how="outer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it outer
here? Should we make it configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outer
is here mainly to just not reject any data from either table, however I think making this configurable is a good idea. Is there a better/more intuitive default value for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this a kwarg, but still defaulting to outer. We could default to left to follow Dask, but I'm not sure if it's the most sensible default
------- | ||
`dask_nested.NestedFrame` | ||
""" | ||
nested = nested.map_partitions(lambda x: pack_flat(x)).rename(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should think about a short-cut for index-sorted datasets, it should work for out-of-memory datasets
Resolves #3. Resolves #8
Change Description
This PR lays down the foundation for the nested-dask package (currently dask-nested but we will change this). It implements a Dask API for the v0.1 Nested-Pandas high-level API, and provides a limited "nest" accessor object. The vast majority of functionality is just lean
map_partitions
wrappings of nested-pandas, this is by design as we intend to mainly focus development on the nested-pandas side where applicable and mainly use Dask just to handle partitioning. There are (and will be in the future likely) some exceptions to this, such asto_parquet
.This PR also establishes a basic unit test and benchmarking suite. Documentation is the notable exception, but this PR is already way too large so it will come in a later PR.
Solution Description
Code Quality
Project-Specific Pull Request Checklists
Bug Fix Checklist
New Feature Checklist
Documentation Change Checklist
Build/CI Change Checklist
Other Change Checklist