Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EnsembleFrame Refactor Milestone 2: Add Ensemble Functionality for EnsembleFrame objects #193

Closed
dougbrn opened this issue Aug 11, 2023 · 0 comments · Fixed by #217 or #222
Closed
Assignees
Labels
enhancement New feature or request

Comments

@dougbrn
Copy link
Collaborator

dougbrn commented Aug 11, 2023

#191 Milestone 2 focuses on adding functionality to the Ensemble that allows it to track and manage EnsembleFrame objects, for the purposes of the milestone, this would be added without considering too much removal of current ensemble functionality. The goals are fairly straightforward:

  • A Ensemble.frames property added to track the EnsembleFrames tied to the Ensemble
    • Ensemble.frames = {“source”:SourceFrame, “object”:ObjectFrame, “Result1”: EnsembleFrame, “model_params”: EnsembleFrame} # Where each value is an instance of the class
  • Ensemble.source and Ensemble.object should be shorthands to access the required SourceFrame and ObjectFrame objects

The following functions would be a minimum set API:
*Ensemble.select_frame(frame_label): Returns the associated EnsembleFrame object, would allow a user to work with the EnsembleFrame directly using the Dask API. We’d need to make sure that the Ensemble itself is updated cleanly as the user works with their data.

  • Ensemble.frame_info(frames=None): Returns the information for a subset or all of the available frames, showing column information, memory usage, etc.
  • Ensemble.add_frame(dataframe, frame_label): Adds a dataframe to the Ensemble, useful for filtering an EnsembleFrame and adding the result to a new view.
  • Ensemble.update_frame(EnsembleFrame): Similar to add_frame, but uses the label to automatically update Ensemble.frames.
  • Ensemble.drop_frame(frame_label): Drops a frame from the ensemble and closes it so that it doesn’t persist in memory
  • Ensemble.from_parquet(file_path, col_mapper=None): Loader functions would be at the Ensemble level and would yield a new EnsembleFrame tied to the provided label. This may not be the best format for loading Object and Source data. [from_parquet already exists as an Ensemble function, it may also be fine to wait on implementing this until further milestones]
  • Ensemble.objsor_from_parquet(source_file, object_file, column_mapper) (?): A more structured function that loads in the ObjectFrame and SourceFrame data, with associated column_mappings. I will admit the function name here leaves a lot to be desired, maybe there’s a better way to approach this.

Finally, the above API may not be the optimal implementation. If there are thoughts on alternatives that may feel more intuitive to users, please feel free to explore them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants