Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC - Time Grid #1057

Open
1 of 2 tasks
Tracked by #1025
jawache opened this issue Oct 18, 2024 · 14 comments
Open
1 of 2 tasks
Tracked by #1025

EPIC - Time Grid #1057

jawache opened this issue Oct 18, 2024 · 14 comments
Assignees
Labels
EPIC Used to denote an issue that represents a whole epic. Core team only
Milestone

Comments

@jawache
Copy link
Contributor

jawache commented Oct 18, 2024

Sub of #1025

Tasks:

  • Look at a deeper level to understand why using time sync in different places produces different results
  • Document the explanation / best practice

Terminology

  • Time Series - A set of observations that can have duplicate entries for the same time and duration.
  • Unique Time Series - A set of observations with only one entry for each time and duration.
  • Time Grid - A Time Series aligned to a globally configured grid.
  • Time Slot - A single time slot of a given duration.
  • Observation - A set of data that sits in a Time Bucket.
  • Time Syncing - The process of aligning a Time Series to a Time Grid.

Background

Since close to the inception of Impact Framework, we've ensured time is one of the dimensions that the IF reports impacts. We want to surface where it makes sense how the impacts change over time to make sure we can highlight the moments where it's high and allow people to dive into the why.

The problem is that we need to sync up the start, end, durations, and number of observations across each component to ensure everything snaps to a grid so that we can sum up data across all components for every time bucket.

When is having a synced set of observations essential?

  • It's essential for aggregation to function; all observations must be on the time grid.
  • It's essential for a tiny set of plugins, e.g., WattTime. But most plugins don't care about the observations being on a time grid.

Plugins don't need the data snapped to a grid, it's primarily useful for aggregation only.

This is not easy; it has caused several problems and complexities, making writing manifest files unusually hard.

Problem

What determines the time window and durations of a component?

  • The static observations OR the first plugin in the pipeline defines the time window and duration.
  • So, if we want everything on a particular grid, we need to configure every component so the first plugin returns the same time series OR statically add the observations to every component with the same time series.
  • There is the helper TimeSync plugin, but you must be careful where you put it in the pipeline, it can only go in certain places or the manifest errors out - it's tough for the end user to know where to place it in the pipeline intuitively.
  • It is very manual, something you must ensure is aligned for every component. If even one component is off, the whole manifest and aggregation fails.

It's up to the user, component by component, to make sure the observations are snapped to one standard global time grid.

There is a complex relationship between grouping and time-syncing

  • Since TimeSync bleeds from cells next to it, a unique time series is needed for it to work.
  • If a Time Series is not unique, we need to group it into components with a Unique Time Series to run TimeSync on the Unique Time Series.
  • So, grouping must ensure every component has a unique time series for TimeSync to work. We can't just group by whatever makes sense for the end user to group by; grouping also ensures that every component ends up with a unique time series.
  • We DON'T need a unique time-series for aggregation. A time series can have multiple observations for the same time bucket. If all the observations are on the same time-grid, horizontal and vertical syncing can happen.

Grouping is needed for time syncing, and it's not always clear where you have to put the TimeSync plugin, it has to be after grouping, but it might not be immediately after. Time syncing reduces the usefulness of grouping since you can't group it however you want.

Solution

Have the user define a global time grid and make the framework (including plugins) responsible for automatically aligning all the time series to that time grid.

We define a global time-grid setting where we define the start, end, and window. start and end can be hardcoded date times or relative offsets, e.g., start = now, end can be 30 mins ago, and window can be 60.

We start by making sure everything is snapped to the globally configured time grid:

  • By default, the framework passes a set of observations snapped to that time grid to each component, even if the component has no configured inputs.
  • If the component has static data already configured, and that static data is not snapped to that time grid, the framework errors out.
  • So, straight away, the starting point for all components is a set of observations snapped to a time grid.

Plugins are responsible for ensuring that any returning observations are snapped to that global time grid:

  • If a plugin returns observations that are not snapped to that time grid, the framework errors out
  • Importer plugins can look at their inputs to see the start, end, and window and use that information to determine the structure of the outputs it returns.
  • The WattTime plugin already uses the input data to decide what to return, so this solves that problem.

We provide plugin authors some support in helping them align their time series to the time grid:

  • Provide a TimeGrid utility function to every plugin, which they can use to snap data to that grid if it isn't.
  • We can configure a plugin in the initialize section with snap-to-time-grid: true, and then we would automatically run time-sync on the output.
    • This is also useful for backward compatibility, we can make all existing plugins work int the new arch by running TimeSync on their outputs.
  • Or a plugin with their returned config can tell IF to snap-to-time-grid: true, but again, that would only work if the plugin returns a unique time series.

How does aggregation work in a world where all observations are guaranteed to snap to a time-grid?

For example, a component does not have a unique time series.

  • T0 is the 0'index cell in the Time Grid.
  • There can be three observations at the T0 cell for the same component
  • This means this component does NOT have a unique time series, but it doesn't matter since we just aggregate all T0 cells up to the parent grouping node.

6bca1e2c3bf018391f0969b43f108244

E.g. Component has a sparse time series

The component is missing some data in cell, that's ok - we are just summing up every cell it does have, up the tree.

E.g. Component represents servers spinning up and then down.

9bcd1a52235099b0999adc39a3696cb8

  • This is an example of a sparse, non-unique-timeseries.
  • I imagine importers will often return sparse non-unique-timeseries like these, i.e. if you import all VM data for a Kubernetes cluster, you will import data that represents servers spinning up and spinning down.
  • The pressure to make sense of this data should now be put into the importers themselves instead of to the watchers and the other plugins in the framework.

91ddf05480c83bd4cb3b1f27b5187fe9

Advantages of this approach

  • All observations become atomic through this approach, and watchers don't need to care about TimeSync'ng. It takes the pressure off the end user, they don't have to worry about time syncing, it's implicit. Some plugins might need to work harder but that's ok.
  • Plugins are the best ones for figuring out how to TimeSync their own data. We don't need to maintain a units.yml to provide the data needed to understand how to TimeSync.
  • You could do re-grouping without re-computation. You could even do zero grouping, one single component for every server in your fleet, and still get a time series.
  • We return to just one "pipeline" followed by an optional grouping.
  • The time ID (T0, T1, index of the TimeGrid) connects back to the visualizer.
@jawache jawache added the EPIC Used to denote an issue that represents a whole epic. Core team only label Oct 18, 2024
@jawache jawache self-assigned this Oct 18, 2024
@jawache
Copy link
Contributor Author

jawache commented Oct 18, 2024

FYI @jmcook1186 and @zanete I've added the design doc here for the time-grid proposal.

@jmcook1186
Copy link
Contributor

jmcook1186 commented Oct 21, 2024

@jawache @zanete @narekhovhannisyan

Some additional context on time-sync and time-grid - eventually getting around to the potential pain points that we ought to discuss in refinement:

Ok, agree wholeheartedly that the current time-sync is not perfect because of the need to position it correctly in each pipeline.

what I understand from your design ticket is that the actual functionality of time-sync is ok, but it needs refactoring so that it runs as a part of the framework execution flow and NOT as a plugin, which I also agree would be an improvement.

So, now let's get into the nitty gritty, which is the mechanics of "snapping to the grid", which is a phrase hiding a lot of complexity.

When we say "snap to grid" we are proposing to execute all, or a subset of the following operations:

  • time series trimming (dumping unnecessary data from the start or end of the time series to make the start and end times match the time-grid config)
  • time series extrapolation (extending the available data so the start and end times match the time-grid config)
  • time series resampling (adjusting the number of elements in the time series array so that the observations are spaced according to the interval defined in time-grid config)

Let's first look at the "right" way to do each of these before examining whats right for IF manifest files. It's logical to start wth resampling, because unless your initial time series already has an interval equal to that defined in the time-grid config, you can't just trim from the start and end and have timestamps that synchronize - correcting the interval is a necessary precondition.

resampling

Resampling can mean coarsening data to a lower time resolution (e.g. seconds to minutes) - this is "downsampling" - or adding finer resolution ticks between the existing ones (e.g. minutes to seconds) - this is "upsampling".

Both require the creation of new information, which is typically done by applying some function to the existing data and then evaluating that function at the new set of points. A common example is using a least square regression. You find a function f(x) that minimizes the squared error between the known values at a set of x values and the estimates yielded by the function at those same x values. Depending on the type of data the evaluation might be at a point or it might be a definite integral between t and t+duration.

Why don't we do this in IF? Several reasons:

  • It requires the users to define in advance the type of function that's most appropriate to use to resample the time series (e.g. linear, polynomial, spline) and - if a polynomial, it's order. This is typically done by manually interrogating the data and it put a heavy burden on the user to do relatively deep pre-investigations of their time series before launching their IF run.
  • It is inflexible to different types of data we encounter in a typical observation. For example, some of our values remain constant even after resampling, some are additive, and some are proportions, meaning one function is not universally applicable even within a single observation.
  • It's computationally expensive to algorithmically explore a large parameter space to find the most suitable function to fit to each time series.

What do we do instead?

Today we assign each parameter a method that is used for both aggregation and time sync. The method reveals whether a value is additive, constant or proportional, and we dealwith each type individually.

For all time series, we divide up time into intervals of 1s duration. Then, we recompose a time series to the desired interval according to the method for each value:

  • for method==sum we divide the given value by the ratio of the existing interval to the desired interval and assign the result to each new timestep, yielding an even distribution of the value across the "new" time points - e.g. if we have carbon measured at 1 minute resolution we multiply the value by 1/60 to yield the value in each 1s timestep. Then, let's say we want to resample to 30s intervals, each timestep has a carbon value that's 30x the 1s value.
  • For method==copy we simply copy the same value into every timestep, regardless of the interval. For example, it would not make sense to scale down grid-intensity to smaller values during the upsampling to 1s resolution, or it would create spurious pockets of low intensity energy that did not really exist.
  • For method==avg we take the average of adjacent values and use that as the fill value for new time steps. This is used for value such as cpu-util where we want to capture a value expressed as a percentage, but don't want to sum them across time steps because that would create erroneously high values that were not really observed.

The resampling works as long as the time series is continuous, meaning tn+1 === tn + duration_n + 1. The complication comes when we have gaps in the time series. Then, we have to decide how to fill them. Today, we infill with "zeroish objects" which have the desired timestamps and durations required by the time-syncconfig but the other values are all set to zero, if they have method==sum or method==avg but set to their constant values if they have method==none. This isn't a perfect solution - for example, cpu-util == 0 can be interpreted as a server that is shut down (i.e. power==0.) or a server that is idle (and using ~50-70% of the power at 100% utilization).

Some combination of the above procedures is used to adjust the intervals of each time series so they match the interval of the time-sync config.

In summary, we decompose every time series to 1s resolution, then batch those 1s observations into new timestamps with the desired interval using an aggregation method defined for each parameter.

trimming

Once we have resampled, trimming is easy - we just discard any data whose timestamp is earlier than the defined start time, or later than the defined end time.

extrapolating

We don't actually extrapolate any data to extend time series - we just fill with our aforementioned zeroish objects until the first timestamp matches the defined start, and the last timestamp matches the define end. There are many ways we could choose to extrapolate the dataset, but we decided that it's best not to create data we don't have any knowledge of outside of the time range of the given observations.

How this works with aggregation

The reason this time-sync protocol works well with our time and component aggregation is because the methods used to resample each parameter are the same as the methods used to aggregate those parameters over time, or across components. It's very important to be able to configure those methods, because as we have seen for SCI vs PCF, sometimes it's right to average observations within a time series but sum them across components, or vice-versa, and some parameters only make sense to hold constant. We need a way to label these choices for each parameter and the right choice is context-dependent so it must be easy to change. It will also break things if you change the method for aggregation but not for time-sync or the other way around, so it really does make sense for both time-sync and aggregation to refer to the same variable to determine their method, rather than being defined separately.

Implications for time-grid

We need to persist the coupling between time series resampling and aggregation, because:

  • the method used is parameter-specific
  • the method used for each parameter is context-specific
  • the method for each parameter MUST be consistent between time-sync and aggregation, or we will inadvertently create artefacts in our resampled and aggregated results

All the logic I described above amounts to a lot of complicated code. It doesn't feel great to offload this to plugins unless we can move it into if-core and expose it in a simple way to plugins, similar to how we provide error handling.

The ability to use "fills" as well as interpolations and copies is also very important, as we have to be able to maintain the "spikiness" of time series after resampling, for example, if in a month's work of daily observations, a server is only spun up once for 30 mins, our time series must be flat except for a burst coincident with the real moment of activity for that server, and not artificially smooth, redistribute or extent the values elsewhere in the time series.

There are nuances to how we treat duration too that need to be more tightly specified as we refine this ticket. Today it's not completely clear how to use duration correctly, especially when we are using static time series hardcoded into the manifest. For example, say we have daily timesteps covering a month, and on five days in the month we hold zoom calls that last 3 minutes. We might set the duration to 1800 seconds to indicate that the call lasted that long, but since there is no energy used outside of that 30 minutes at all, and our data is at daily resolution, it's acceptable to have duration set to 86400 seconds (1 day), with the advantage that this yields a continuous time series. However, when time-grid is used to resample the time series by default, this will be problematic, as it will often create artefacts that distort the original distribution.

Non-unique time series might pose some challenges for resampling, as we'll have to be accepting of repeated timestamps but also redistribute values over time appropriately. There will need to be some logic to separate out time series with repeating timestamps into multiple unique time series first, resample them all individually, and then recombine. Otherwise, I'm struggling to imagine what the resampling algorithm looks like.

Since we can have non-unique time series in time-grid we need to be very robust in handling zeros and ensure we never error out because of zero values for any parameter during aggregation, as this situation will probably be more common.

We will also have to be able to provide custom fill values for certain named parameters. A good example is cpu-util which could be infilled with 0, off or an interpolate value, each having very different outcomes for the time series that can make the difference between realistic and plain wrong.

Integration issues with some plugins

Some plugins such as Watt-time, Github and others have limits on the time range and interval served by their API. This is a problem if we resample before making requests, but ok if we resample after making requests, assuming the initial time series is in range for the target APIs. We will need some strategy to gracefully handle failing API requests in those cases.

@zanete zanete moved this to In Refinement in IF Oct 21, 2024
@zanete zanete added this to the IF 1.0 milestone Oct 21, 2024
@jawache
Copy link
Contributor Author

jawache commented Oct 21, 2024

Good points @jmcook1186, I agree we need to think through all these points - regardless of where we end up, we need to document the snapping process and the complex nuanced issues that arise from different choices.

This isn't a perfect solution - for example, cpu-util == 0 can be interpreted as a server that is shut down (i.e. power==0.) or a server that is idle (and using ~50-70% of the power at 100% utilization).

Very interesting point. I agree; if we are sampling, then we are making assumptions, and there are many ways those assumptions can be wrong.

The only people who know for sure what data to use is the plugin or the manifest builder, IF as a framework doesn't know and will need to be configured to act correctly.

Some plugins such as Watt-time, Github and others have limits on the time range and interval served by their API. This is a problem if we resample before making requests, but ok if we resample after making requests, assuming the initial time series is in range for the target APIs. We will need some strategy to gracefully handle failing API requests in those cases.

The above also implies that the responsibility for returning data snapped correctly lies with the plugin/data source itself. We do quite a "rough" sampling, but maybe the WattTime folks have strong opinions that any sampling our framework does wouldn't do the data any good. I also think both those APIs will return data in the grid you want (I think, not dived into them myself).

There are nuances to how we treat duration too that need to be more tightly specified as we refine this ticket. Today it's not completely clear how to use duration correctly, especially when we are using static time series hardcoded into the manifest. For example, say we have daily timesteps covering a month, and on five days in the month we hold zoom calls that last 3 minutes. We might set the duration to 1800 seconds to indicate that the call lasted that long, but since there is no energy used outside of that 30 minutes at all, and our data is at daily resolution, it's acceptable to have duration set to 86400 seconds (1 day), with the advantage that this yields a continuous time series. However, when time-grid is used to resample the time series by default, this will be problematic, as it will often create artefacts that distort the original distribution.

This is a super interesting point, and I hadn't considered it.

There is an expectation, say for a time-series of CPU util, that the CPU util is the average for the duration. There might be lots of variability inside the duration, but by the time we get the data, it's been averaged out over the duration.

When it comes to a time series of the number of visits to a website, this is the number of visits for the duration. If gathering manually via Google Analytics, GA does the sampling/aggregation to get the data in the granularity you want.

Regarding the zoom call, something for which "duration" has a special meaning, we perhaps mean "number of minutes of Zoom call for that duration", so in a day, we might have 160 minutes of zoom calls?

So in a way, we are already getting some form of aggregated data, with their own logic for how to aggregate depending on the type of data. CPU util does an average, visits and zoom-minutes does a sum. So even before we get the data, there is already some timeseries maths going on.

Who's responsible?

All the logic I described above amounts to a lot of complicated code. It doesn't feel great to offload this to plugins.

I agree it's like a hot potato and we're figuring out where it should land.

If we put the responsibility on plugin authors, they just need to do this once properly for each plugin at code time. If we put the responsibility on the end user then they will each need to build up expertise in time snapping to know what to chose for each situation. If we put the responsibility on ourselves then we will need a framework which works for all use cases and my worry is that to do this properly we'll end up building a competitor, or a very poor cousin, to one of the existing time series databases. (Also putting this on ourselves, also means putting this on the end user, since we are are just making ourselves configurable).

So basically what I'm proposing is not to move TimeSyncing to the Framework from a Plugin, but to remove it entirely from the Framework and into Plugins.

In "Grid Mode" all we would do is:

  • If a component has no inputs, generate a series of just timestamp and duration snapped to our grid and that's what get's passed to the first plugin.
  • If a plugin returns a time-series that is not snapped to the grid, we error out.

My assumption is that most existing importer plugins either don't care, and already return data pre-snapped to a grid or can be easily converted to do so.

My second assumption is that anyone who needs to do some complex snapping, can either use a library version of TimeSync or perhaps another solution like an in-mem DB or maybe another 3rd party library.

I suppose my third assumption is that snapping to a grid is a rare activity, it's the first plugins job and the rest of the plugins just care about individual observations. The only plugin I know of that cares about snapping post-importing, is WattTime and their new API already does the snapping for you (I think) but I could be wrong.

There is also the more manual use case. If you can't gather data pre-snapped we could create a helper generic csv-ts-importer plugin which imports data, but also snaps it to a grid via config rules and TimeSync. But importantly, the plugin snaps the data to the grid and the framework itself doesn't do the snapping.

@jawache
Copy link
Contributor Author

jawache commented Oct 21, 2024

There is also the option one just dropping time-series support, it's so complex with so many edge cases.

As in calculating a single value is very easy, spreading that over a time-series seems to add an order of magnitude complexity and i'm not 100% it's worth it.

(I know I've been the one banging my drum about timeseires/grids for a while).

@jawache
Copy link
Contributor Author

jawache commented Oct 21, 2024

@jmcook1186 I can't stop thinking about what if we just dropped snapping/syncing all together for now, maybe it's a 2.0 feature once we've got many more users.

  • It makes writing a manifest file much much easier.
  • It makes teaching someone how to write a manifest file, much much easier.
  • It makes writing the visualiser easier (visualise data as a pie chart, rather than a time series). The visualiser only works right now if data is a grid, if it was just a pie chart of the different components then it would work for all manifests.
  • Removing the dimension of time makes comparison between two manifests easier, you just need to match nodes not nodes and times.
  • It aligns more with how PCFs are written/computed, you could just translate a PCF into an IMP if we didn't have to worry about time.
  • I suspect all the hackathon solutions didn't have grids or didn't take advantage of vertical aggregation.
  • So far we don't really have a great example of why time series analysis surfaces useful actions. The coarser node level data has surfaced useful data, e.g. it's all zoom.

I write all the above with the full knowledge I've been championing time series all along, but it's hitting me that if we just drop it for now, we make a simpler product that's much easier to adopt.

@jawache
Copy link
Contributor Author

jawache commented Oct 21, 2024

And sorry, you're getting a stream of consciousness now 😅 One more thing, I just realised what if we treated time as a continuous variable we sampled instead of a discrete variable we bucketed?

I still think this is 2.0, I like dropping time completely for now.

What if instead of snapping data to discrete buckets and aggregating up, what if grouping nodes just sampled data from underlying components?

So for the website SCI example, it's 31 discrete day buckets, but if you want you could just say "what was the carbon at midnight every Monday" and we just take carbon from whatever observations that match.

That would result in the grouping nodes having 4 observations instead of 31, and the grouping node observations would not have a duration, it's just the data from that instance of time, sampled from the sub components.

The totals for each component would aggregate up, so the overall total would be accurate.

The sampled time series for each grouping node would not be accurate, but they would be indicators of how the data is changing underneath.

So we basically stop treating time series as the source of truth, but just as a way to visualise how the underlying data is changing.

Sorry, again I still think this is 2.0 but didn't want to lose this thought.

@jmcook1186
Copy link
Contributor

@jawache

My gut reaction is that it will be a real shame to drop time series support. I understand that, if you are using IF as an alternative route to a PCF or LCA, then our treatment of time can be simplified a lot. But, we are not just a reporting tool - we are perhaps most useful as a tool for introspection, decision making and mitigation analysis. In particular, we prioritise the SCI as a computable metric because of its appropriateness as a KPI. A useful KPI is addressable in near real time. If we don't handle data that's granular in time, we don't support real time course corrections or the ability to model mitigation strategies into the future.

So here's a counter-proposal with some simplification, better UX for users, and visualizer support but without dropping time series altogether:

Confine time-sync to the observe phase

If you already have static data you can just use it as normal, as you skip observe anyway.
This also frees people that are already skilled to do their own preprocessing of their time series externally to IF, e.g. in Pandas or whatever, and then add it to their manifest, as they can just skip observe.

To support cases where people have data that's not naturally time-series have incomplete or uneven time series and they are fine with that, we can revisit aggregate to handle those data with clear caveats (i.e., aggregate just naively does sum/avg/copy of whatever it finds in each component's output array according to each parameter's method) and then aggregates the across components, without trying to do time slot by time slot aggregation - just don't configure time-sync and we'll still do what we can.

If you have patchy or incomplete time series that you do want to synchronize, then we do this in the observe phase as it is part of the data gathering and preprocessing stage of executing the manifest, not the compute phase. In this case, you "set the scene" early with correctly configured time series - even if we have to fall back to fill values because of failing importers etc, guaranteeing that if time sync is toggled on, then the compute phase will receive identically distributed data across all components and aggregate can work as it does today.

This has the additional benefit of syncing fewer parameters as most of the parameters that exist in a computed manifest are intermediates and results from compute plugins. As well as reducing the computational load on time-sync it also reduces the frequency of weird edge cases, because there are fewer parameters to handle.

The only config needed is start, stop and interval. This is applied globally so that by the end of observe we always have synchronized time series and we error out if that's somehow not true. Assuming the time stamps match across all components, it's fine to retroactively add a time index to each observation to support the visualizer if necessary.

Deeper dive into time-sync in observe

There's two ways this can happen.

sync first, then import

One is to do time-sync on the time series skeleton - i.e. before any data exists. it only generates an array of objects containing timestamp and duration. Then, the importers and other observe plugins have to generate input data for each of those moments.
The benefit of this is that time-sync is easy and fast and it doesn't have to deal with any edge cases - the plugins have to return the right value given the timestamp and duration.

The downside is that we'd have to handle cases where the plugin can't return valid results for every timestamp - maybe its an API that has a fixed range that doesn't match, or an API failure. Somewhere, we have to code some logic to adjust the API return value to a new time slot, or return a fill value. maybe this happens in the importer plugin - we make it a requirement that an importer can handle any time range or interval, even if it's not a native feature of the API the plugin wraps. So this does add some burden to importer plugins to ensure they can handle most start, stop and interval values gracefully.

This makes some sense, as each plugin will typically only have to impose logic for a small subset of parameters, which is a much smaller lift than writing a universal time-sync protocol that generalizes to everything.

Another significant benefit is that this is robust to any weirdness with plugins migrating in and out of observe and compute (e.g. we have had situations where plugins we would expect to be in observe have had to be moved into compute so they can use data yielded from other plugins in the pipeline).

import first, then sync

The alternative is to run the importers first, then sync the time series after it is enriched with input data. This is effectively what happens now, and I think we are basically lucky that we don't have known examples of use-cases where there are multiple conflicting importers that have to sync within a single component. We have a time range and interval that we pass to each importer and capture the results in our inputs array. The time series can still be gappy, incomplete or uneven at this point, but we pass the data to compute anyway and let each plugin operate over the data it sees. Then later, we pass everything to time-sync to make sense of and redistribute into correct time slots to support aggregate. There's a plausible halfway-house where we run our observe plugins and time sync them before running any compute plugins.

The benefit of this is good UX and DX as it's all abstracted away from plugin builders and users. The downside is we have to maintain a universal time-syncer as we don't know what importers people will build and use.

summary

I could be convinced to drop time series support, but it will be a heavy lift to add it again later on, and I think it's an important differentiator that is worth the work to maintain. I disagree that the time series hasn't surfaced important observations - in your Vienna talk you identified a day with anomalously high netlify builds explaining a peak in our website's SCI score. That's a great example of time series yielding information that can lead to behavioural changes that can immediately reduce SCI!

@jawache
Copy link
Contributor Author

jawache commented Oct 23, 2024

@jmcook1186 so based on our conversation yesterday, my position now is that the base case use of IF should be one that doesn't need time-syncing/snapping but I agree that time-syncing does add a lot of value, but also a lot of complexity for the end user.

So basically there at least needs to be some steps in the journey, and if we have the attitude that time-syncing is mandatory, the current first step is really high. One way we've made time-syncing mandatory is the visualizer is mostly useless without time-synced data.

The conversation we've been having in this ticket was about making that first step easier, by making time-syncing easier, because time-syncing was seen by me as mandatory to getting any usefulness out of IF. But the other way to make the first step easier is just to say that the first step is without time-syncing, and time-syncing is an advanced feature, a second step in the journey.

It's subtle but it changes the priority of things and gives us a path. For instance it means that we really should make the visualizer more useful even without time-synced data, it also helps us to prioritize the training material, but it also makes me feel ok about time-syncing just being fundamentally a complex activity. It's step 2, more advanced, as a user you're going to have to think through some complex scenarios.

Also, BTW, reading through your last response "sync first, then import" feels to me the same as what I was proposing above - I suspect we're actually much closer aligned and need to just whiteboard it together.

@jmcook1186
Copy link
Contributor

jmcook1186 commented Oct 24, 2024

@jawache @zanete

To summarise the outcomes of our conversation yesterday:

  • @narekhovhannisyan / @manushak will check whether time-sync is still sensitive to position in the pipeline - try with a long pipeline of combination of builtins / incl all of them
  • @narekhovhannisyan / @manushak will check whether we can aggregate components with un-sync'd time series, and if not (which I expect) I will refine a ticket to adjust this behaviour
  • TBC will try writing a manifest with only single-observation input data and see what issues arise

Assuming this only yields minor refinements to the existing time-sync plugin, we will maintain the current status quo, with small tweaks to the behaviour to ensure we support simpler manifests.

@narekhovhannisyan
Copy link
Member

narekhovhannisyan commented Nov 3, 2024

@jmcook1186 it seems the first point is fixed, tried to move the time sync to different positions in the pipeline (pipeline-with-mocks file), got the same result (checked by online yaml diff).

@zanete zanete moved this from In Refinement to In Progress in IF Nov 4, 2024
@jmcook1186
Copy link
Contributor

The current status of this is that we have confirmed time-sync can now run anywhere in a pipeline, but the position can lead to non-negligible differences in the final aggregated SCI value because it changes exactly what data is operated on by the time-sync resampling. We just need to decide what we want to recommend as the accepted best practise. My base case is that we ougt to recommend executing time-sync as the first step in the compute pipeline, but am open to counter-suggestions.

@jawache @narekhovhannisyan @manushak any opinions?

@zanete
Copy link

zanete commented Nov 22, 2024

all that remains is a decision on what we want to recommend as the best practice, due to the side effects. Confirm with @jawache that we should recommend it running first (more #1057 (comment)) Then it becomes a sentence update in the docs

@zanete zanete assigned jawache and unassigned jawache Nov 22, 2024
@zanete
Copy link

zanete commented Nov 25, 2024

status update: what remains is updating the documentation with the above best practice - @jmcook1186 🙏

@jawache
Copy link
Contributor Author

jawache commented Nov 26, 2024

@jmcook1186 I think as a stop gap "this will 100% change in the future, alpha" solution then yeah agreed with the start of the compute pipeline seems reasonable.

Time Sync to function correctly needs a unique time series and for some use cases that can only be achieved after the grouping step so first in the compute seems reasonable.

I remember originally proposing that this should happen together with grouping as an implicit "IF wide" config with global TimeSync settings so the same syncing is applied to all, I can't remember the rationale at the time but during development the decision was made to not make it implicitly and still keep it as a plugin. Let's not lose what those reasons were, perhaps make sure they are documented in the training so others can learn from them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPIC Used to denote an issue that represents a whole epic. Core team only
Projects
Status: Pending Review
Development

No branches or pull requests

5 participants