-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix table syncing to use inner joins. #303
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #303 +/- ##
=======================================
Coverage 94.03% 94.03%
=======================================
Files 23 23
Lines 1207 1207
=======================================
Hits 1135 1135
Misses 72 72 ☔ View full report in Codecov by Sentry. |
dougbrn
approved these changes
Dec 1, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for this!
wilsonbb
added a commit
that referenced
this pull request
Dec 1, 2023
* check divisions, enable lazy syncs * check divisions, enable lazy syncs * initial tests * add tests; calc_nobs preserve divisions * batch with divisions * cleanup * fix sf2 tests * add sync_tables check * cleanup * fix calc_nobs reset_index issue * per table warnings; index comments * add map_partitions mode for calc_nobs when divisions are known * build metadata * build metadata * add multi partition test * add version file to init * add small test * Fix table syncing to use inner joins. (#303) * Fix table syncing to use inner joins. * fix lint error * Update test --------- Co-authored-by: Doug Branton <[email protected]>
wilsonbb
added a commit
that referenced
this pull request
Dec 8, 2023
* A minimal Dask Dataframe subclass for the Ensemble * Addressed comments, added test fixture. * Make convert_flux_to_mag part of the EnsembleFrame * Ensembles can now track a group of labeled frames * Preserve EnsembleFrame metadata after assign() * Parquet support for frame subclasses checkpoint * Reverting changes to tests * Adds test for objsor_from_parquet * Addressed comments * Removed adding column via apply * Fix comment typo * Fix EnsembleFrame.set_index * Add update_ensemble() and Use EnsembleFrames (#252) * Adds EnsembleFrame.update_ensemble() * Use EnsembleFrames throughout the Ensemble * Udpdate ensemble test * Extends update_ensemble test cases * Unpin sphinx to address docs build fail * Fix minor test error * Remove debug line * Propagate EnsembleFrame._is_dirty (#264) * EnsembleFrames should propagate is_dirty * Test that a frame's dirty status propagates * Update doc strings * Address review comment * Have update_frame mark frames as dirty (#267) * Remove calls to set_dirty in ensemble (#269) * Update refactor (#274) * Add ensemble loader functions for dataframes * Updated unit tests * Lint fixes * Always update column mapping * Addressed review comments * Ensure object frame is indexed * adds a dask_on_ray tutorial * add performance comp; add use_map comment --------- Co-authored-by: Doug Branton <[email protected]> * Merge main into tape_ensemble_refactor (#277) * Add ensemble loader functions for dataframes * Updated unit tests * Lint fixes * Always update column mapping * Addressed review comments * Ensure object frame is indexed * adds a dask_on_ray tutorial * add performance comp; add use_map comment * sync with map_partitions * sync with map_partitions * sync with map_partitions * sync with map_partitions * coalesce with map_partitions * use dataframes instead of series * add descriptive comments * implement suggestions * Update TAPE README.md Update the project description for TAPE to better reflect the current state and goals of the project. * Set object table index for from_dask_dataframe * add zero_point as float input :q q * add ensemble default cols * S82 RRLyr notebook * Move rrlyr nb to examples * Update requirements.txt to unpin sphinx * Update pyproject.toml to unpin sphinx * add calc_nobs * add calc_nobs * add calc_nobs * reduce scope of sync_tables * address divisions issue * add temporary cols test * improve coverage * add temporary kwarg to assign * add temporary kwarg to assign * drop divisions * drop brackets * fix bug in sync * Issue 199: Added static Ensemble read constructors to tape namespace (#256) * Added static read constructors to tape namespace * Removed @staticmethod as python 3.9 didn't like it * Reformatted via black * Changed read_dask_dataframe to call from_ method * Collapsed create dask client args to single arg * Fixed dask_client parameter * reformatted via black * Added missing unit test * Resolved code review comments from PR 256 * Fixed failing unit test Removed reference to Ensemble._nobs_band_cols field * fix bug in sync --------- Co-authored-by: Doug Branton <[email protected]> Co-authored-by: Konstantin Malanchev <[email protected]> Co-authored-by: Olivia R. Lynn <[email protected]> Co-authored-by: Chris Wenneman <[email protected]> * Fix EnsembleFrame.set_dirty and map_partitions metadata propagation (#280) * FIx _Frame.set_dirty * Update propgating data in map_partitions * Fix typo * Ensemble.update_frame no longer infers if a frame is dirty by checking if row count changed (#281) * Mark frames dirty without len() call * Move calls to set_dirty to EnsembleFrame * Support storing batch results for custom meta (#285) * Add meta handling for batch * Add unit tests for custom meta * Remove unit test sanity check, fix warning output * Provide default labels for result frames. * Update Remaining TAPE Documentation Notebooks for the Refactor (#298) * Remove ._source and ._object * Update notebooks for refactor * Fix find-replace error * Update Docs for TAPE EnsembleFrame Refactor (#290) * Initial commit for notebooks with refactor API * Removed _object and _source references * Added sync tables example * Address comment * Addressed frame renaming * Update docs/tutorials/working_with_the_ensemble.ipynb Co-authored-by: Konstantin Malanchev <[email protected]> * Addressed comments * Clear output --------- Co-authored-by: Konstantin Malanchev <[email protected]> * Allow EnsembleFrame.compute to Trigger Object-Source Table Syncing (#295) * Allow EnsembleFrame.compue to sync tables * Fixed docstring * Add Explicit Metadata Propagation for EnsembleFrame joins (#301) * Support propagating frame metadata in joins * Update doc strings and test * Update test * Merge Main into Ensemble Refactor Branch (#304) * check divisions, enable lazy syncs * check divisions, enable lazy syncs * initial tests * add tests; calc_nobs preserve divisions * batch with divisions * cleanup * fix sf2 tests * add sync_tables check * cleanup * fix calc_nobs reset_index issue * per table warnings; index comments * add map_partitions mode for calc_nobs when divisions are known * build metadata * build metadata * add multi partition test * add version file to init * add small test * Fix table syncing to use inner joins. (#303) * Fix table syncing to use inner joins. * fix lint error * Update test --------- Co-authored-by: Doug Branton <[email protected]> * Revert "Merge Main into Ensemble Refactor Branch (#304)" This reverts commit 5c847e1. * Fix linting * Remove unsupported type annotations * Fix merge error * Use client=False in test_analysis * Remove '_object' and '_source' fields * Fix linting errors * Address review comments, add tests --------- Co-authored-by: Doug Branton <[email protected]> Co-authored-by: Konstantin Malanchev <[email protected]> Co-authored-by: Olivia R. Lynn <[email protected]> Co-authored-by: Chris Wenneman <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In #283, we introduced logic to allow for lazy syncing of tables with known divisions. This was accomplished via a
join
call which was a left join by default.However since we always sync the object table before the source table, it's possible to create an arrangement of operations values removed from the source table's index can be synced back from a dirty object table. This PR updates
test_ensemble.test_sync_tables
to have such a failing case and fixes it by switching to inner joins.