-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EPIC - Time Grid #1057
Comments
FYI @jmcook1186 and @zanete I've added the design doc here for the time-grid proposal. |
@jawache @zanete @narekhovhannisyan Some additional context on Ok, agree wholeheartedly that the current time-sync is not perfect because of the need to position it correctly in each pipeline. what I understand from your design ticket is that the actual functionality of time-sync is ok, but it needs refactoring so that it runs as a part of the framework execution flow and NOT as a plugin, which I also agree would be an improvement. So, now let's get into the nitty gritty, which is the mechanics of "snapping to the grid", which is a phrase hiding a lot of complexity. When we say "snap to grid" we are proposing to execute all, or a subset of the following operations:
Let's first look at the "right" way to do each of these before examining whats right for IF manifest files. It's logical to start wth resampling, because unless your initial time series already has an resamplingResampling can mean coarsening data to a lower time resolution (e.g. seconds to minutes) - this is "downsampling" - or adding finer resolution ticks between the existing ones (e.g. minutes to seconds) - this is "upsampling". Both require the creation of new information, which is typically done by applying some function to the existing data and then evaluating that function at the new set of points. A common example is using a least square regression. You find a function Why don't we do this in IF? Several reasons:
What do we do instead? Today we assign each parameter a For all time series, we divide up time into intervals of 1s duration. Then, we recompose a time series to the desired interval according to the
The resampling works as long as the time series is continuous, meaning Some combination of the above procedures is used to adjust the intervals of each time series so they match the interval of the In summary, we decompose every time series to 1s resolution, then batch those 1s observations into new timestamps with the desired trimmingOnce we have resampled, trimming is easy - we just discard any data whose timestamp is earlier than the defined start time, or later than the defined end time. extrapolatingWe don't actually extrapolate any data to extend time series - we just fill with our aforementioned zeroish objects until the first timestamp matches the defined start, and the last timestamp matches the define end. There are many ways we could choose to extrapolate the dataset, but we decided that it's best not to create data we don't have any knowledge of outside of the time range of the given observations. How this works with aggregationThe reason this Implications for time-gridWe need to persist the coupling between time series resampling and aggregation, because:
All the logic I described above amounts to a lot of complicated code. It doesn't feel great to offload this to plugins unless we can move it into The ability to use "fills" as well as interpolations and copies is also very important, as we have to be able to maintain the "spikiness" of time series after resampling, for example, if in a month's work of daily observations, a server is only spun up once for 30 mins, our time series must be flat except for a burst coincident with the real moment of activity for that server, and not artificially smooth, redistribute or extent the values elsewhere in the time series. There are nuances to how we treat Non-unique time series might pose some challenges for resampling, as we'll have to be accepting of repeated timestamps but also redistribute values over time appropriately. There will need to be some logic to separate out time series with repeating timestamps into multiple unique time series first, resample them all individually, and then recombine. Otherwise, I'm struggling to imagine what the resampling algorithm looks like. Since we can have non-unique time series in We will also have to be able to provide custom fill values for certain named parameters. A good example is Integration issues with some pluginsSome plugins such as Watt-time, Github and others have limits on the time range and interval served by their API. This is a problem if we resample before making requests, but ok if we resample after making requests, assuming the initial time series is in range for the target APIs. We will need some strategy to gracefully handle failing API requests in those cases. |
Good points @jmcook1186, I agree we need to think through all these points - regardless of where we end up, we need to document the snapping process and the complex nuanced issues that arise from different choices.
Very interesting point. I agree; if we are sampling, then we are making assumptions, and there are many ways those assumptions can be wrong. The only people who know for sure what data to use is the plugin or the manifest builder, IF as a framework doesn't know and will need to be configured to act correctly.
The above also implies that the responsibility for returning data snapped correctly lies with the plugin/data source itself. We do quite a "rough" sampling, but maybe the WattTime folks have strong opinions that any sampling our framework does wouldn't do the data any good. I also think both those APIs will return data in the grid you want (I think, not dived into them myself).
This is a super interesting point, and I hadn't considered it. There is an expectation, say for a time-series of CPU util, that the CPU util is the average for the duration. There might be lots of variability inside the duration, but by the time we get the data, it's been averaged out over the duration. When it comes to a time series of the number of visits to a website, this is the number of visits for the duration. If gathering manually via Google Analytics, GA does the sampling/aggregation to get the data in the granularity you want. Regarding the zoom call, something for which "duration" has a special meaning, we perhaps mean "number of minutes of Zoom call for that duration", so in a day, we might have 160 minutes of zoom calls? So in a way, we are already getting some form of aggregated data, with their own logic for how to aggregate depending on the type of data. CPU util does an average, visits and zoom-minutes does a sum. So even before we get the data, there is already some timeseries maths going on. Who's responsible?
I agree it's like a hot potato and we're figuring out where it should land. If we put the responsibility on plugin authors, they just need to do this once properly for each plugin at code time. If we put the responsibility on the end user then they will each need to build up expertise in time snapping to know what to chose for each situation. If we put the responsibility on ourselves then we will need a framework which works for all use cases and my worry is that to do this properly we'll end up building a competitor, or a very poor cousin, to one of the existing time series databases. (Also putting this on ourselves, also means putting this on the end user, since we are are just making ourselves configurable). So basically what I'm proposing is not to move TimeSyncing to the Framework from a Plugin, but to remove it entirely from the Framework and into Plugins. In "Grid Mode" all we would do is:
My assumption is that most existing importer plugins either don't care, and already return data pre-snapped to a grid or can be easily converted to do so. My second assumption is that anyone who needs to do some complex snapping, can either use a library version of TimeSync or perhaps another solution like an in-mem DB or maybe another 3rd party library. I suppose my third assumption is that snapping to a grid is a rare activity, it's the first plugins job and the rest of the plugins just care about individual observations. The only plugin I know of that cares about snapping post-importing, is WattTime and their new API already does the snapping for you (I think) but I could be wrong. There is also the more manual use case. If you can't gather data pre-snapped we could create a helper generic |
There is also the option one just dropping time-series support, it's so complex with so many edge cases. As in calculating a single value is very easy, spreading that over a time-series seems to add an order of magnitude complexity and i'm not 100% it's worth it. (I know I've been the one banging my drum about timeseires/grids for a while). |
@jmcook1186 I can't stop thinking about what if we just dropped snapping/syncing all together for now, maybe it's a 2.0 feature once we've got many more users.
I write all the above with the full knowledge I've been championing time series all along, but it's hitting me that if we just drop it for now, we make a simpler product that's much easier to adopt. |
And sorry, you're getting a stream of consciousness now 😅 One more thing, I just realised what if we treated time as a continuous variable we sampled instead of a discrete variable we bucketed? I still think this is 2.0, I like dropping time completely for now. What if instead of snapping data to discrete buckets and aggregating up, what if grouping nodes just sampled data from underlying components? So for the website SCI example, it's 31 discrete day buckets, but if you want you could just say "what was the carbon at midnight every Monday" and we just take carbon from whatever observations that match. That would result in the grouping nodes having 4 observations instead of 31, and the grouping node observations would not have a duration, it's just the data from that instance of time, sampled from the sub components. The totals for each component would aggregate up, so the overall total would be accurate. The sampled time series for each grouping node would not be accurate, but they would be indicators of how the data is changing underneath. So we basically stop treating time series as the source of truth, but just as a way to visualise how the underlying data is changing. Sorry, again I still think this is 2.0 but didn't want to lose this thought. |
My gut reaction is that it will be a real shame to drop time series support. I understand that, if you are using IF as an alternative route to a PCF or LCA, then our treatment of time can be simplified a lot. But, we are not just a reporting tool - we are perhaps most useful as a tool for introspection, decision making and mitigation analysis. In particular, we prioritise the SCI as a computable metric because of its appropriateness as a KPI. A useful KPI is addressable in near real time. If we don't handle data that's granular in time, we don't support real time course corrections or the ability to model mitigation strategies into the future. So here's a counter-proposal with some simplification, better UX for users, and visualizer support but without dropping time series altogether: Confine
|
@jmcook1186 so based on our conversation yesterday, my position now is that the base case use of IF should be one that doesn't need time-syncing/snapping but I agree that time-syncing does add a lot of value, but also a lot of complexity for the end user. So basically there at least needs to be some steps in the journey, and if we have the attitude that time-syncing is mandatory, the current first step is really high. One way we've made time-syncing mandatory is the visualizer is mostly useless without time-synced data. The conversation we've been having in this ticket was about making that first step easier, by making time-syncing easier, because time-syncing was seen by me as mandatory to getting any usefulness out of IF. But the other way to make the first step easier is just to say that the first step is without time-syncing, and time-syncing is an advanced feature, a second step in the journey. It's subtle but it changes the priority of things and gives us a path. For instance it means that we really should make the visualizer more useful even without time-synced data, it also helps us to prioritize the training material, but it also makes me feel ok about time-syncing just being fundamentally a complex activity. It's step 2, more advanced, as a user you're going to have to think through some complex scenarios. Also, BTW, reading through your last response "sync first, then import" feels to me the same as what I was proposing above - I suspect we're actually much closer aligned and need to just whiteboard it together. |
To summarise the outcomes of our conversation yesterday:
Assuming this only yields minor refinements to the existing time-sync plugin, we will maintain the current status quo, with small tweaks to the behaviour to ensure we support simpler manifests. |
@jmcook1186 it seems the first point is fixed, tried to move the time sync to different positions in the pipeline (pipeline-with-mocks file), got the same result (checked by online yaml diff). |
The current status of this is that we have confirmed time-sync can now run anywhere in a pipeline, but the position can lead to non-negligible differences in the final aggregated SCI value because it changes exactly what data is operated on by the time-sync resampling. We just need to decide what we want to recommend as the accepted best practise. My base case is that we ougt to recommend executing time-sync as the first step in the compute pipeline, but am open to counter-suggestions. @jawache @narekhovhannisyan @manushak any opinions? |
all that remains is a decision on what we want to recommend as the best practice, due to the side effects. Confirm with @jawache that we should recommend it running first (more #1057 (comment)) Then it becomes a sentence update in the docs |
status update: what remains is updating the documentation with the above best practice - @jmcook1186 🙏 |
@jmcook1186 I think as a stop gap "this will 100% change in the future, alpha" solution then yeah agreed with the start of the compute pipeline seems reasonable. Time Sync to function correctly needs a unique time series and for some use cases that can only be achieved after the grouping step so first in the compute seems reasonable. I remember originally proposing that this should happen together with grouping as an implicit "IF wide" config with global TimeSync settings so the same syncing is applied to all, I can't remember the rationale at the time but during development the decision was made to not make it implicitly and still keep it as a plugin. Let's not lose what those reasons were, perhaps make sure they are documented in the training so others can learn from them? |
Sub of #1025
Tasks:
Terminology
Background
Since close to the inception of Impact Framework, we've ensured time is one of the dimensions that the IF reports impacts. We want to surface where it makes sense how the impacts change over time to make sure we can highlight the moments where it's high and allow people to dive into the why.
The problem is that we need to sync up the start, end, durations, and number of observations across each component to ensure everything snaps to a grid so that we can sum up data across all components for every time bucket.
When is having a synced set of observations essential?
Plugins don't need the data snapped to a grid, it's primarily useful for aggregation only.
This is not easy; it has caused several problems and complexities, making writing manifest files unusually hard.
Problem
What determines the time window and durations of a component?
TimeSync
plugin, but you must be careful where you put it in the pipeline, it can only go in certain places or the manifest errors out - it's tough for the end user to know where to place it in the pipeline intuitively.It's up to the user, component by component, to make sure the observations are snapped to one standard global time grid.
There is a complex relationship between grouping and time-syncing
TimeSync
bleeds from cells next to it, a unique time series is needed for it to work.Grouping is needed for time syncing, and it's not always clear where you have to put the TimeSync plugin, it has to be after grouping, but it might not be immediately after. Time syncing reduces the usefulness of grouping since you can't group it however you want.
Solution
Have the user define a global time grid and make the framework (including plugins) responsible for automatically aligning all the time series to that time grid.
We define a global
time-grid
setting where we define thestart
,end
, andwindow
.start
and end can be hardcoded date times or relative offsets, e.g.,start
=now
,end
can be30 mins ago
, and window can be60
.We start by making sure everything is snapped to the globally configured time grid:
Plugins are responsible for ensuring that any returning observations are snapped to that global time grid:
WattTime
plugin already uses the input data to decide what to return, so this solves that problem.We provide plugin authors some support in helping them align their time series to the time grid:
TimeGrid
utility function to every plugin, which they can use to snap data to that grid if it isn't.snap-to-time-grid: true
, and then we would automatically run time-sync on the output.TimeSync
on their outputs.snap-to-time-grid: true
, but again, that would only work if the plugin returns a unique time series.How does aggregation work in a world where all observations are guaranteed to snap to a time-grid?
For example, a component does not have a unique time series.
E.g. Component has a sparse time series
The component is missing some data in cell, that's ok - we are just summing up every cell it does have, up the tree.
E.g. Component represents servers spinning up and then down.
Advantages of this approach
The text was updated successfully, but these errors were encountered: