Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WaveformBrowser compatibility to DataLoader #484

Merged
merged 30 commits into from
May 3, 2023
Merged

Conversation

iguinn
Copy link
Collaborator

@iguinn iguinn commented May 1, 2023

Implemented DataLoader.browse, added DataLoader.load_iterator, and made necessary changes to LH5Iterator and WaveformBrowser to make these work well.

Details:

  • Improved ability of LH5Iterator to work across tiers and groups. It can now take a list of groups (enabling different groups from the same file). It can also "friend" together multiple LH5Iterators to read them in parallel, much like friending TChains in ROOT
  • lgdo.expand_path can now work with a base_path. This fixed some bugs associated with using LH5_Store with a base path
  • Added load_iterator to DataLoader. This constructs and returns an LH5Iterator across all tiers/groups/files. This can be used for iterating over files in chunks (although I see @gipert working on something similar). Right now this is only implemented for hits mode. This also doesn't solve the problem of needing to separately load the entry list first (some sort of lazy entry_list would be really nice here). Since the WaveformBrowser is built around LH5 iterator, this is useful to have.
  • WaveformBrowser can now accept an LH5Iterator as an input as an alternative to a file list and group list
  • Implemented DataLoader.browse to return a WaveformBrowser object. Note that this ignores the field_mask created by the DataLoader in favor of constructing it from the requested lines to draw.
  • Made LH5Iterator initialize faster by being "lazy" about mapping files. Before, a map of file lengths and entries in each file was constructed on start up; with large numbers of files, this caused a long hang from opening up all of the files and reading the size of each group. Now, the maps are constructed as you iterate through files; this gets rid of the hang on start up (although you still see a hang whenever it reaches a new file). This is a quality of life improvement for waveform browsing mainly.

@codecov
Copy link

codecov bot commented May 1, 2023

Codecov Report

Patch coverage: 70.05% and project coverage change: +0.23 🎉

Comparison is base (ecff792) 47.93% compared to head (ab95295) 48.17%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     legend-exp/pygama#484      +/-   ##
==========================================
+ Coverage   47.93%   48.17%   +0.23%     
==========================================
  Files         104      104              
  Lines       12314    12443     +129     
==========================================
+ Hits         5903     5994      +91     
- Misses       6411     6449      +38     
Impacted Files Coverage Δ
src/pygama/flow/data_loader.py 75.93% <38.46%> (-3.58%) ⬇️
src/pygama/vis/waveform_browser.py 71.03% <64.28%> (+1.03%) ⬆️
src/pygama/lgdo/lh5_store.py 86.23% <80.53%> (+0.05%) ⬆️
src/pygama/lgdo/compression/radware.py 21.53% <100.00%> (ø)
src/pygama/lgdo/compression/varlen.py 49.60% <100.00%> (ø)
src/pygama/lgdo/lgdo_utils.py 96.36% <100.00%> (+0.36%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@gipert gipert added vis Data visualization flow High-level data management labels May 2, 2023
@gipert
Copy link
Member

gipert commented May 2, 2023

Hi Ian, very nice!

Don't you think we should replace .next() with your .load_iterator()?

some sort of lazy entry_list would be really nice here

I totally agree.

@gipert gipert added this to the v1.3 milestone May 2, 2023
@gipert
Copy link
Member

gipert commented May 2, 2023

Oh, seems like Numba throws a new warning in the latest release...

@iguinn
Copy link
Collaborator Author

iguinn commented May 3, 2023

The error that is currently causing the checks to fail is due to a new warning that appears in numba 0.57.0 (released yesterday). I've submitted a bug report for it: numba/numba#8936. For now, I've added this to the list of numba versions to avoid in setup.

@iguinn
Copy link
Collaborator Author

iguinn commented May 3, 2023

Ok, now that all checks are passing again:

next() and load_iterator() currently aren't exactly the same, since next() returns data formatted as requested by the user, while get_iterator() returns only an LH5Iterator object (which returns lgdo.Table, i_entry, n_entries). It seems like next() will be easier for users, while LH5Iterator might be more useful as an internal function (it would be very useful for skimming files). Another option we could consider is to make the LH5Iterator one of the options for the output of load(). @gracesong312 or @jasondet do you have any thoughts?

Either way, if there are no objections I would suggest merging this quickly, as I think @erin717 is planning on using browse() in her tutorial.

@gipert
Copy link
Member

gipert commented May 3, 2023

In my view, DataLoader should always return LGDOs, which can be later viewed in alternative formats by the user (there is already Table.to_dataframe(), but we should add more functionality in the future, see legend-exp/legend-pydataobj#4). Looks a cleaner interface to me.

In this view, I like very much the idea that .load() returns an LGDO and .next() returns an LH5Iterator chunking the same LGDO.

Anyways, we can always change things at some point in the future, if you like the idea. You can go ahead and merge from my point of view!

@iguinn iguinn merged commit f22d9f6 into legend-exp:main May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flow High-level data management vis Data visualization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants