Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Table Syncing #245

Open
dougbrn opened this issue Sep 28, 2023 · 3 comments
Open

Improvements to Table Syncing #245

dougbrn opened this issue Sep 28, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@dougbrn
Copy link
Collaborator

dougbrn commented Sep 28, 2023

Related to #243

Table Syncs are a key component of the TAPE Ensemble design, where Object and Source are kept up to date with one another as the user operates on one or both tables. Currently with Table Syncs, we do the following:

  • Object to Source Syncs: When Object is modified, grab the index of object and filter Source based on that index
  • Source to Object Syncs: When Source is modified, groupby each (object, band) pairing, count them to produce a skinny object table with nobs_total, and an nobs column for each band populated. Join the current object table to the new object table.

With the Source to Object Sync, we are doing much more work to perform the sync, because we assume that these nobs columns will be useful to users. However, from discussion and some real workflows, it's clear that some users will not need these nobs columns for their work. In these cases, we are introducing needless operations to the workflow.

Given this, we should introduce more options for ensemble syncing.

  • At minimum, users should be able to have a "lean" sync mode, that only drops rows from object/source based on lack of occurence of ids in the other table (except in the case where the user has set the flag to have Object preserve ids with no source information).
  • Another available option should be to track just nobs_total for source to object syncs
  • And, dependent on Source->Object Sync Slowness #243, have the option to sync the full nobs by band as is currently done by TAPE

Beyond this, nobs information is likely not the only information that a user may want to sync. As @hombit mentioned: duration (per band), mean mag, variability index, etc. are all examples of potentially trackable metrics. There is potential to abstract syncs to enable users to setup their own sync criteria, something where given a set of input columns from one table, a function, and a set of output columns TAPE is instructed to run that function whenever the dirty flag is set. We should make sure that any implementation has use cases demonstrating clear upside to the alternative, of users just manually running functions to update columns when they need that column for the next analysis step.

@dougbrn dougbrn added the enhancement New feature or request label Sep 28, 2023
@dougbrn
Copy link
Collaborator Author

dougbrn commented Oct 2, 2023

Additional points from meeting discussion:

  • I still don't neccesarily think we have strong use cases where sync generating any additional information is helpful to users, beyond just having an explicit function to generate that information when the user needs it.
  • A downside to the explicit function route, is that these results columns can become out of date. For example, if I calculate nobs_total, then filter the source table, nobs_total is no longer correct.
  • In the above case, one potential consensus was that it would be good to flag the nobs_total column when it's generated as a column that can become out of date ("transient" column? Degeneracy with astro lingo...). When the parent table of a "transient" column is dirty, the sync should involve dropping that column. Motivation being it's better for users to fail to grab bad information, than be able to operate on bad information.
  • Implementation-wise, the above would mean that explicit tape functions should have a kwarg to flag their outputs as "transient", defaulting to true in instances where it makes sense (nobs_total, nobs_by_band). This way users can turn it off if they'd like to proceed with potentially out of date data. We should also give users a section of the API to manually flag columns, though I suspect it wouldn't be used much.

Overall, it seems like we're leaning towards minimizing the amount of additional effort TAPE puts in to workflows, and rather just make sure that users have the tooling to do things more explicitly. Any fluff/flourish we're doing on the backend can really pile up on users looking to work with large datasets.

@dougbrn
Copy link
Collaborator Author

dougbrn commented Oct 4, 2023

I'm going to take this, and implement a lean sync (mentioned above) as a replacement for the current sync. We can still think about adding more sync options/generalization, but for now the lean mode seems to be the correct baseline behavior.

@dougbrn dougbrn self-assigned this Oct 4, 2023
@dougbrn
Copy link
Collaborator Author

dougbrn commented Oct 6, 2023

As of #254, we are now syncing just the IDs against one another ("lean" sync as described above)

@dougbrn dougbrn removed their assignment Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant