Improvements to Table Syncing #245

dougbrn · 2023-09-28T22:03:35Z

Related to #243

Table Syncs are a key component of the TAPE Ensemble design, where Object and Source are kept up to date with one another as the user operates on one or both tables. Currently with Table Syncs, we do the following:

Object to Source Syncs: When Object is modified, grab the index of object and filter Source based on that index
Source to Object Syncs: When Source is modified, groupby each (object, band) pairing, count them to produce a skinny object table with nobs_total, and an nobs column for each band populated. Join the current object table to the new object table.

With the Source to Object Sync, we are doing much more work to perform the sync, because we assume that these nobs columns will be useful to users. However, from discussion and some real workflows, it's clear that some users will not need these nobs columns for their work. In these cases, we are introducing needless operations to the workflow.

Given this, we should introduce more options for ensemble syncing.

At minimum, users should be able to have a "lean" sync mode, that only drops rows from object/source based on lack of occurence of ids in the other table (except in the case where the user has set the flag to have Object preserve ids with no source information).
Another available option should be to track just nobs_total for source to object syncs
And, dependent on Source->Object Sync Slowness #243, have the option to sync the full nobs by band as is currently done by TAPE

Beyond this, nobs information is likely not the only information that a user may want to sync. As @hombit mentioned: duration (per band), mean mag, variability index, etc. are all examples of potentially trackable metrics. There is potential to abstract syncs to enable users to setup their own sync criteria, something where given a set of input columns from one table, a function, and a set of output columns TAPE is instructed to run that function whenever the dirty flag is set. We should make sure that any implementation has use cases demonstrating clear upside to the alternative, of users just manually running functions to update columns when they need that column for the next analysis step.

dougbrn · 2023-10-02T18:13:33Z

Additional points from meeting discussion:

I still don't neccesarily think we have strong use cases where sync generating any additional information is helpful to users, beyond just having an explicit function to generate that information when the user needs it.
A downside to the explicit function route, is that these results columns can become out of date. For example, if I calculate nobs_total, then filter the source table, nobs_total is no longer correct.
In the above case, one potential consensus was that it would be good to flag the nobs_total column when it's generated as a column that can become out of date ("transient" column? Degeneracy with astro lingo...). When the parent table of a "transient" column is dirty, the sync should involve dropping that column. Motivation being it's better for users to fail to grab bad information, than be able to operate on bad information.
Implementation-wise, the above would mean that explicit tape functions should have a kwarg to flag their outputs as "transient", defaulting to true in instances where it makes sense (nobs_total, nobs_by_band). This way users can turn it off if they'd like to proceed with potentially out of date data. We should also give users a section of the API to manually flag columns, though I suspect it wouldn't be used much.

Overall, it seems like we're leaning towards minimizing the amount of additional effort TAPE puts in to workflows, and rather just make sure that users have the tooling to do things more explicitly. Any fluff/flourish we're doing on the backend can really pile up on users looking to work with large datasets.

dougbrn · 2023-10-04T17:20:03Z

I'm going to take this, and implement a lean sync (mentioned above) as a replacement for the current sync. We can still think about adding more sync options/generalization, but for now the lean mode seems to be the correct baseline behavior.

dougbrn · 2023-10-06T21:56:27Z

As of #254, we are now syncing just the IDs against one another ("lean" sync as described above)

dougbrn added the enhancement New feature or request label Sep 28, 2023

dougbrn mentioned this issue Sep 28, 2023

Add explicit nobs calculator functions #246

Closed

dougbrn mentioned this issue Oct 2, 2023

Source->Object Sync Slowness #243

Closed

dougbrn self-assigned this Oct 4, 2023

dougbrn mentioned this issue Oct 4, 2023

Reduce scope of sync_tables #254

Merged

dougbrn removed their assignment Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to Table Syncing #245

Improvements to Table Syncing #245

dougbrn commented Sep 28, 2023

dougbrn commented Oct 2, 2023

dougbrn commented Oct 4, 2023

dougbrn commented Oct 6, 2023

Improvements to Table Syncing #245

Improvements to Table Syncing #245

Comments

dougbrn commented Sep 28, 2023

dougbrn commented Oct 2, 2023

dougbrn commented Oct 4, 2023

dougbrn commented Oct 6, 2023