-
Notifications
You must be signed in to change notification settings - Fork 3
Enable Ensemble.from_parquet
to use parquet metadata files when available
#348
Comments
Investigation UpdatePerformance ImprovementI've been digging into this today. Some really encouraging benefits to this are clear. Avoiding the initial
Dataset Details: The memory improvement is particularly noteworthy, as Challenges to Overcome
Proposed SolutionFrom a usability perspective, it seems unacceptable to ask users to figure out for themselves whether they can use _metadata or not. It's really deep in the internals of the Ensemble, and failure states of incorrectly using it are not obvious at all. From this, my thought is to restructure the kwargs of The main usability improvement of this, is that we would add a new default method called
|
For the "infer" option, it would be good to also do some logging of infer results, letting users know which option to pick for the next run to avoid the overhead of determining that option each time. |
Ensemble.from_parquet has evolved to do a lot of TAPE-specific things, from setting up the column mapper to setting the index on the chosen id column. As a result, there's some friction between the function and the ability to support reading from the parquet _metadata files that sometimes are present in parquet data directories. In particular, within Ensemble.save_ensemble we generate these _metadata files for each EnsembleFrame, but Ensemble.from_ensemble only uses them for EnsembleFrames that aren't object and source, as object and source are handled by Ensemble.from_parquet. We should investigate making use of these files when they are available, as they can offer speedups like pre-defined divisions which help us avoid needing to do any up-front divisions calculations.
Without this, we are doing some needless computation which will hurt the usefulness of ensemble.save_ensemble as being a time-saving option for users.
The text was updated successfully, but these errors were encountered: