Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensemble Methods to retrieve a data subset #361

Merged
merged 15 commits into from
Jan 31, 2024
Merged

Ensemble Methods to retrieve a data subset #361

merged 15 commits into from
Jan 31, 2024

Conversation

dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Jan 30, 2024

Change Description

Resolves #356. Adds new methods for retrieving a subset of Ensemble data.

  • My PR includes a link to the issue that I am addressing

Solution Description

I wrote some solution details in #356, but the summary is that this PR adds one new method for returning a subset of objects and their sources, via Ensemble.sample_objects. And additionally overwrites the behavior of dasks partition slicing to also set the dirty flag (meaning partition slices will propagate from object to source and vice-versa), making slices on object or source partitions function as another valid method to sample the data.

Code Quality

  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

  • I have added a function that requires a sync_tables command, and have added the neccesary sync_tables call

Bug Fix Checklist

  • My fix includes a new test that breaks as a result of the bug (if possible)
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

New Feature Checklist

  • I have added or updated the docstrings associated with my feature using the NumPy docstring format
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover my new feature
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Documentation Change Checklist

Build/CI Change Checklist

  • If required or optional dependencies have changed (including version numbers), I have updated the README to reflect this
  • If this is a new CI setup, I have added the associated badge to the README

Other Change Checklist

  • Any new or updated docstrings use the NumPy docstring format.
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover any changes
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@dougbrn dougbrn changed the title Sample ensemble Ensemble Methods to retrieve a data subset Jan 30, 2024
Copy link

github-actions bot commented Jan 30, 2024

Before [6a694c4] After [d88c99c] Ratio Benchmark (Parameter)
39.5±0.2ms 39.4±0.3ms 1 benchmarks.time_batch
43.8±2ms 42.4±0.2ms 0.97 benchmarks.time_prune_sync_workflow

Click here to view all benchmarks.

Copy link

codecov bot commented Jan 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f4109c6) 95.16% compared to head (fe82855) 95.26%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #361      +/-   ##
==========================================
+ Coverage   95.16%   95.26%   +0.09%     
==========================================
  Files          24       25       +1     
  Lines        1634     1667      +33     
==========================================
+ Hits         1555     1588      +33     
  Misses         79       79              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dougbrn dougbrn marked this pull request as ready for review January 31, 2024 18:19
@dougbrn dougbrn requested a review from wilsonbb January 31, 2024 18:28
src/tape/ensemble.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@wilsonbb wilsonbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me.

I am worried about the client management issues, but I feel we can file a separate issue for that if preferred

Comment on lines +508 to +513
# turn off cleanups -- in the case where multiple ensembles are
# using a client, an individual ensemble should not close the
# client during an __exit__ or __del__ event. This means that
# the client will not be closed without an explicit client.close()
# call, which is unfortunate... not sure of an alternative way
# forward.
Copy link
Collaborator

@wilsonbb wilsonbb Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel there is a path for some ugly answers to this (implement our own client manager with reference counting, each ensemble keeps track of its parents/children in a tree-like structure that gets updated on-exit, etc) but not sure we need to solve this now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, agree that this is something to keep an eye on, but feels out of scope for this PR, I'll make an issue

src/tape/ensemble.py Outdated Show resolved Hide resolved
src/tape/ensemble.py Outdated Show resolved Hide resolved
tests/tape_tests/test_ensemble.py Outdated Show resolved Hide resolved
src/tape/ensemble.py Outdated Show resolved Hide resolved
src/tape/ensemble.py Outdated Show resolved Hide resolved
@dougbrn dougbrn requested a review from wilsonbb January 31, 2024 20:38
Copy link
Collaborator

@wilsonbb wilsonbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Doug!

@dougbrn dougbrn merged commit 482f368 into main Jan 31, 2024
13 checks passed
@dougbrn dougbrn deleted the sample_ensemble branch February 8, 2024 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add possibility to do computation on subset of the dataset
3 participants