Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EnsembleFrame Support to Tape #308

Merged
merged 50 commits into from
Dec 8, 2023
Merged

Add EnsembleFrame Support to Tape #308

merged 50 commits into from
Dec 8, 2023

Conversation

wilsonbb
Copy link
Collaborator

@wilsonbb wilsonbb commented Dec 4, 2023

Refactors TAPE to add support for tape.EnsembleFrames, a TAPE-specific extension of Dask DataFrames

While the Ensemble-level API for manipulating the Source and Object frames is still supported, users now

An EnsembleFrame is a Dask DataFrame tracked by the Ensemble which has

  • Direct support for using the Dask API on the underlying frame
  • tracking for whether or not it is a 'dirty' table due to an update rows and whether the table may need to be synced
  • an associated identifying label
  • a reference back to the Ensemble object which tracks the frame. This allows the frame to inform the Ensemble when a sync between tables (currently only Object and Source) may be required.

The Ensemble now can track multiple EnsembleFrames

  • Supporting functions such as Ensemble.select_frame(label), Ensemble.update_frame(frame), Ensemble.drop_frame(frame)
  • Access to the Object and Source frames as Ensemble.object and Ensemble.source which are subclasses of `Ensemble
  • Calls to the Object and Source tables such as result = Ensemble.source.query() now return a view of the updated frame without affecting the underlying Source and Object tables, until result.update_ensemble() is called
  • Minimal wrapper subclasses of Pandas dataframes such as TapeFrame, TapeObjectFrame, and TapeSourceFrame are used as Dask meta to help Dask infer which EnsembleFrame object to instantiate
  • A unique label is generated for tracking EnsembleFrames which are the output of Ensemble.batch. Users can specify their own label as an argument.

Note that the commits in this PR were independently reviewed as they were added to the tape_ensemble_refactor branch. (One exception is setting client=False in test_analysis.py to reduce flakiness from reusing a disconnected client).

wilsonbb and others added 30 commits August 25, 2023 15:38
…working

A minimal Dask Dataframe subclass for the Ensemble
…working

Make convert_flux_to_mag part of the EnsembleFrame
…working

Ensembles can now track a group of labeled frames
…working

Preserve EnsembleFrame metadata after assign()
Merge main into tape_ensemble_refactor
Merge updates from main into refactor working branch
…working

Subclass EnsembleFrames for Object and Source Tables
* Adds EnsembleFrame.update_ensemble()

* Use EnsembleFrames throughout the Ensemble

* Udpdate ensemble test

* Extends update_ensemble test cases

* Unpin sphinx to address docs build fail

* Fix minor test error

* Remove debug line
* EnsembleFrames should propagate is_dirty

* Test that a frame's dirty status propagates

* Update doc strings

* Address review comment
* Add ensemble loader functions for dataframes

* Updated unit tests

* Lint fixes

* Always update column mapping

* Addressed review comments

* Ensure object frame is indexed

* adds a dask_on_ray tutorial

* add performance comp; add use_map comment

---------

Co-authored-by: Doug Branton <[email protected]>
* Add ensemble loader functions for dataframes

* Updated unit tests

* Lint fixes

* Always update column mapping

* Addressed review comments

* Ensure object frame is indexed

* adds a dask_on_ray tutorial

* add performance comp; add use_map comment

* sync with map_partitions

* sync with map_partitions

* sync with map_partitions

* sync with map_partitions

* coalesce with map_partitions

* use dataframes instead of series

* add descriptive comments

* implement suggestions

* Update TAPE README.md

Update the project description for TAPE to better reflect the current state and goals of the project.

* Set object table index for from_dask_dataframe

* add zero_point as float input
:q
q

* add ensemble default cols

* S82 RRLyr notebook

* Move rrlyr nb to examples

* Update requirements.txt to unpin sphinx

* Update pyproject.toml to unpin sphinx

* add calc_nobs

* add calc_nobs

* add calc_nobs

* reduce scope of sync_tables

* address divisions issue

* add temporary cols test

* improve coverage

* add temporary kwarg to assign

* add temporary kwarg to assign

* drop divisions

* drop brackets

* fix bug in sync

* Issue 199: Added static Ensemble read constructors to tape namespace (#256)

* Added static read constructors to tape namespace
* Removed @staticmethod as python 3.9 didn't like it
* Reformatted via black
* Changed read_dask_dataframe to call from_ method
* Collapsed create dask client args to single arg
* Fixed dask_client parameter
* reformatted via black
* Added missing unit test
* Resolved code review comments from PR 256

* Fixed failing unit test

Removed reference to Ensemble._nobs_band_cols field

* fix bug in sync

---------

Co-authored-by: Doug Branton <[email protected]>
Co-authored-by: Konstantin Malanchev <[email protected]>
Co-authored-by: Olivia R. Lynn <[email protected]>
Co-authored-by: Chris Wenneman <[email protected]>
…280)

* FIx _Frame.set_dirty

* Update propgating data in map_partitions

* Fix typo
…g if row count changed (#281)

* Mark frames dirty without len() call

* Move calls to set_dirty to EnsembleFrame
Merge main into tape_ensemble_refactor
* Add meta handling for batch

* Add unit tests for custom meta

* Remove unit test sanity check, fix warning output

* Provide default labels for result frames.
wilsonbb and others added 7 commits November 30, 2023 16:32
* check divisions, enable lazy syncs

* check divisions, enable lazy syncs

* initial tests

* add tests; calc_nobs preserve divisions

* batch with divisions

* cleanup

* fix sf2 tests

* add sync_tables check

* cleanup

* fix calc_nobs reset_index issue

* per table warnings; index comments

* add map_partitions mode for calc_nobs when divisions are known

* build metadata

* build metadata

* add multi partition test

* add version file to init

* add small test

* Fix table syncing to use inner joins. (#303)

* Fix table syncing to use inner joins.

* fix lint error

* Update test

---------

Co-authored-by: Doug Branton <[email protected]>
…_the_merge

Revert "Merge Main into Ensemble Refactor Branch"
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

codecov bot commented Dec 5, 2023

Codecov Report

Attention: 26 lines in your changes are missing coverage. Please review.

Comparison is base (626904c) 94.03% compared to head (a415078) 94.46%.

Files Patch % Lines
src/tape/ensemble.py 91.19% 17 Missing ⚠️
src/tape/ensemble_frame.py 96.28% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #308      +/-   ##
==========================================
+ Coverage   94.03%   94.46%   +0.43%     
==========================================
  Files          23       24       +1     
  Lines        1207     1518     +311     
==========================================
+ Hits         1135     1434     +299     
- Misses         72       84      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wilsonbb wilsonbb marked this pull request as ready for review December 5, 2023 00:20
@wilsonbb wilsonbb requested review from dougbrn and hombit December 5, 2023 00:20
@wilsonbb wilsonbb requested a review from dougbrn December 5, 2023 23:51
Copy link
Contributor

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have few minor comments, but we could convert them to issues to not block the PR. Thank you, @wilsonbb and @dougbrn!

src/tape/__init__.py Show resolved Hide resolved
src/tape/ensemble.py Show resolved Hide resolved
src/tape/ensemble_frame.py Show resolved Hide resolved
src/tape/ensemble_frame.py Show resolved Hide resolved
new_frame.set_dirty(self.is_dirty())
return new_frame

def copy(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to implement __copy__()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that logically we should, but neither Dask in its overrides of Pandas nor dask-geopandas did so. I was a bit surprised by this and forgot about it, but I should revisit.

src/tape/ensemble_frame.py Show resolved Hide resolved
src/tape/ensemble_frame.py Show resolved Hide resolved
src/tape/ensemble_frame.py Show resolved Hide resolved
src/tape/ensemble_frame.py Outdated Show resolved Hide resolved
src/tape/ensemble_frame.py Show resolved Hide resolved
Copy link
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, one question on whether to remove some Ensemble functions, but this can just be logged in an issue if we end up wanting to do that. Thanks so much for all the great work on this @wilsonbb !

src/tape/ensemble.py Outdated Show resolved Hide resolved
src/tape/ensemble.py Show resolved Hide resolved
@wilsonbb
Copy link
Collaborator Author

wilsonbb commented Dec 8, 2023

LGTM! I have few minor comments, but we could convert them to issues to not block the PR. Thank you, @wilsonbb and @dougbrn!

Thank you for the time taken to look at this!

@wilsonbb wilsonbb merged commit 1e8abff into main Dec 8, 2023
10 checks passed
@dougbrn dougbrn deleted the final_merge branch April 4, 2024 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants