Skip to content
This repository has been archived by the owner on Jan 14, 2025. It is now read-only.

Enable Ensemble.from_parquet to use parquet metadata files when available #348

Open
dougbrn opened this issue Jan 17, 2024 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@dougbrn
Copy link
Collaborator

dougbrn commented Jan 17, 2024

Ensemble.from_parquet has evolved to do a lot of TAPE-specific things, from setting up the column mapper to setting the index on the chosen id column. As a result, there's some friction between the function and the ability to support reading from the parquet _metadata files that sometimes are present in parquet data directories. In particular, within Ensemble.save_ensemble we generate these _metadata files for each EnsembleFrame, but Ensemble.from_ensemble only uses them for EnsembleFrames that aren't object and source, as object and source are handled by Ensemble.from_parquet. We should investigate making use of these files when they are available, as they can offer speedups like pre-defined divisions which help us avoid needing to do any up-front divisions calculations.

Without this, we are doing some needless computation which will hurt the usefulness of ensemble.save_ensemble as being a time-saving option for users.

@dougbrn dougbrn added the enhancement New feature or request label Jan 17, 2024
@dougbrn dougbrn self-assigned this Jan 20, 2024
@dougbrn
Copy link
Collaborator Author

dougbrn commented Jan 20, 2024

Investigation Update

Performance Improvement

I've been digging into this today. Some really encouraging benefits to this are clear. Avoiding the initial set_index call seems to have significant performance and memory-usage improvements when we can use a _metadata file instead. When testing locally on the eclipsing binary demo, I got these results:

* Workflow with set_index divisions generation: 
    * Timing: 42 seconds
    * Memory:
        * Total Peak use: ~1.5 GB
* Workflow with _metadata divisions generation:
    * Timing: 27 seconds
    * Memory
        * Total Peak use: ~600MB
* Comparisons:
    * ~50% speed up
    * ~2x Improvement in peak memory efficiency

Dataset Details:
Total Size: ~10 GBs
Number of Sources: 453,653,104
Number of Objects: 3,492,890

The memory improvement is particularly noteworthy, as set_index (for this workflow at least) seems to be the main culprit of the excessive memory usage we've been noticing.

Challenges to Overcome

  1. Being able to utilize these _metadata file requires that the input parquet data is already sorted and cohesive (no source split across partitions) for the chosen index. The sorted aspect is of consequence as different chosen ids will sort in a different global ordering, and the cohesive element is particularly important as in the case where we have metadata but our parquet data is not cohesive, the generated divisions will cause a lot of sources to be un-locatable from the Dask API. Things like groupby's, locs, etc will only use the divisions to find a given id, missing that other sources for that id live in different partitions from the divisions-specified partition. This is all to say that using this metadata will not be a default that we can support, and that users will need to either verify themselves that it's usable, or save their dataset to disk using Ensemble.save_ensemble which will guarantee these needed priors and generate the correct metadata to use.
  2. There is some concern in the dask community about the usage of _metadata files at very large scale. Particularly in the case where the _metadata becomes so large that it can't fit into memory itself. I'm not too concerned with this at the moment as in the eclipsing binary case 10GBs of parquet data only produces ~0.5 MBs _metadata files. So the scale where this starts to become a potential issue is much larger than we're currently operating at.

Proposed Solution

From a usability perspective, it seems unacceptable to ask users to figure out for themselves whether they can use _metadata or not. It's really deep in the internals of the Ensemble, and failure states of incorrectly using it are not obvious at all. From this, my thought is to restructure the kwargs of Ensemble.from_parquet such that the sort, sorted, and whatever kwarg would have given a user control over this should be rolled into an divisions_method kwarg.

The main usability improvement of this, is that we would add a new default method called "infer". With infer, we would internally use the Ensemble.check_sorted and Ensemble.check_lightcurve_cohesion functions to determine the most efficient path to generating divisions. These functions may need to move into utility functions to not require a whole ensemble being spun up to use them. Executing these function will add overhead to the performance, but should ensure that loading will almost always succeed without them needing to know how to set the kwargs. Other options for divisions_method would be:

  • False; don’t calculate divisions at own risk
  • “metadata”; tells tape to load from the available metadata
  • “sorted”; tells tape to construct via a set index call when the index is known to be sorted
  • “sort”; tells tape to construct via a set index call with a full sort of the index

"infer" is the robust option that should take some decision-making off the user to know what their data looks like, the goal for speed is to probably be able to use “metadata” which would be the default of Ensemble.from_ensemble. This would encourage users to save to disk to get a version of their data that is most efficient (obviously this is a big ask for extremely large datasets). It's very likely we can enable this by default for workflows loading from hipscat/LSDB as well. But we should also consider some tooling to generate metadata files for a parquet directory if empty, but that will only work if check_lightcurve_cohesion is true so unsure how much it would really help.

@dougbrn
Copy link
Collaborator Author

dougbrn commented Jan 22, 2024

For the "infer" option, it would be good to also do some logging of infer results, letting users know which option to pick for the next run to avoid the overhead of determining that option each time.

@dougbrn dougbrn removed their assignment Aug 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant