Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define switch data link timeseries in TOML #6120

Merged
merged 3 commits into from
Jul 24, 2024

Conversation

bnaecker
Copy link
Collaborator

Note that this renames and reorganizes the timeseries a good deal. We want to include more of the switch / sled identifiers, and make the name of the timeseries more consistent with the existing sled data link and planned instance data link timeseries.

Note that this renames and reorganizes the timeseries a good deal. We
want to include more of the switch / sled identifiers, and make the name
of the timeseries more consistent with the existing sled data link and
planned instance data link timeseries.
@bnaecker
Copy link
Collaborator Author

Here's a bit more context. This completely abandons the existing timeseries like data_link:rx_bytes. Instead, it creates a new timeseries switch_data_link:bytes_received, to mirror the planned timeseries sled_data_link:bytes_received and guest_data_link:bytes_received. We don't currently have the ability to actually update timeseries yet (that's in-progress), but we can pretty easily abandon old ones and redefine new ones. I've taken that path here.

Once we're all happy with the actual new definition (fields, names, descriptions, etc), I have some other WIP in Dendrite to fetch the needed identifiers (for example, switch serial numbers), and populate the new timeseries definitions with them. At some point after that, I will explicitly expunge the existing data_link:* timeseries.

The other big change here is that I've removed all the per-FSM-state timeseries and made them all one, that keeps the state itself as a field. That's because I kind of expect we'll want to more easily aggregate across these states. That would be kind of tricky in OxQL now, because we don't currently support aggregating across timeseries, but you can aggregate within a timeseries. For example, we can still do this to get the same data as before:

> get switch_data_link:link_fsm | filter state = "ber_check_done"

That would be equivalent to this today:

> get data_link:b_e_r_check_done

We can then also do aggregations, say if we wanted to count all the transitions of any kind, not just in a particular state:

> get switch_data_link:link_fsm | align mean_within(1m) | group_by [state]

Anyway, this is all a bit of an experiment, but hopefully that provides some more insight into where we want to go and why this is the next step along the way.

@bnaecker
Copy link
Collaborator Author

Friendly ping on this one folks

Copy link
Contributor

@Nieuwejaar Nieuwejaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's obvious enough that I probably don't even need to mention it, but... it sure would be nice if there were an automated way to ensure that this definition stays in sync with the upstream data source. I did a side-by-side visual comparison, but that's not the most confidence inspiring mechanism.

@bnaecker
Copy link
Collaborator Author

@Nieuwejaar What is the "upstream data source" in your comment?

@Nieuwejaar
Copy link
Contributor

@Nieuwejaar What is the "upstream data source" in your comment?

In this case, dendrite. Since our counters are defined by the ASIC, they are unlikely to change (until/unless we switch ASICs), but I was thinking of the more general problem of keeping these schema in sync with with those used by the producers.

@bnaecker
Copy link
Collaborator Author

Ok, I think I understand. One of the main goals of all the work to move the timeseries definitions into one central location is to ensure consistency. I have a follow-up PR in the works which will remove the existing timeseries definitions, e.g., the ones like make_cumulative!(Capacity);, and replace them with the corresponding timeseries from the new TOML definitions.

So if we decided to publish a new ASIC counter, we would add that to the TOML definitions, and then consume it in Dendrite and start publishing that when Oximeter comes knocking. Hopefully that assuages the concern about keeping things in sync: the text-file definitions in Omicron will be the single source of truth.

@Nieuwejaar
Copy link
Contributor

Hopefully that assuages the concern about keeping things in sync: the text-file definitions in Omicron will be the single source of truth.

Very much so. Thanks.

@zeeshanlakhani
Copy link
Collaborator

guest_data_link:bytes_received

This is what we called instance previously?

@zeeshanlakhani
Copy link
Collaborator

The other big change here is that I've removed all the per-FSM-state timeseries and made them all one, that keeps the state itself as a field. That's because I kind of expect we'll want to more easily aggregate across these states. That would be kind of tricky in OxQL now, because we don't currently support aggregating across timeseries, but you can aggregate within a timeseries. For example, we can still do this to get the same data as before:

Should we update/collate a readme to update OxQL queries here to match the changes? I had to planned to do this at some point too.

Copy link
Collaborator

@zeeshanlakhani zeeshanlakhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjaminleonard This essentially maps what we have today and streamlines how we want to capture all these across the system, so 👍🏽. I know you mentioned that you're working on schema update, as we'll probably want to update fields over time as we add more.

@zeeshanlakhani
Copy link
Collaborator

Approved here, but had one question on documentation really.

@bnaecker
Copy link
Collaborator Author

guest_data_link:bytes_received

This is what we called instance previously?

This timeseries does not exist yet. We'll need to create it soon, and probably publish those samples to oximeter from the Propolis zone. I made this change here so the name and fields match up more directly with that to-be-created timeseries.

Should we update/collate a readme to update OxQL queries here to match the changes? I had to planned to do this at some point too.

The current place to do that is RFD 463, which I still owe an update on. Part of the motivation of the TOML-formatted timeseries definitions is so that we can auto-generate documentation too. Maybe it would be worth coordinating with @benjaminleonard and / or @david-crespo on how to create those and get them into our documentation pages and / or the console itself. That would be really sweet.

@zeeshanlakhani
Copy link
Collaborator

This timeseries does not exist yet. We'll need to create it soon, and probably publish those samples to oximeter from the Propolis zone. I made this change here so the name and fields match up more directly with that to-be-created timeseries.

Oh, I know that it didn't. I meant what I called it in https://github.com/orgs/oxidecomputer/projects/55/views/1?filterQuery=&pane=issue&itemId=68336554, maybe?

@bnaecker
Copy link
Collaborator Author

Oh, I see. Yeah instance_data_link:* might be better, I was going off my memory.

@bnaecker bnaecker merged commit 18f2520 into main Jul 24, 2024
23 checks passed
@bnaecker bnaecker deleted the move-switch-data-link-timeseries-to-toml branch July 24, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants