Investigate how to integrate better with releases #3758

QuentinBisson · 2024-11-04T10:05:24Z

Motivation

See this for context giantswarm/releases#1459 (comment)

Our current config mechanism broke releases really bad. We need to find a better way forward because there are currently no way to be sure we do not break releases when doing changes in the operators.

Let's investigate what we really miss in releases today and fix it instead of being out of process

Todo

Investigate what configuration we actually need to be dynamic / regularly adapt (like secrets, etc)
- create a list of configuration that our operators are actually handling
Investigate what configuration we want to keep dynamic as a safeguard for fast reaction times on incidents
Investigate how we technically can push configuration
- this can be the observability bundle - but since we want to get rid of it some day, is it still the best place?
- since the MC overrides some configuration in the WCs, this might be tricky.

Open Questions to clear with tenet

How do we configure apps in releases? Maybe through the Observability Bundle, but that has it's own problems?
How can we get auto patches? Since we don't want PRs every night and have a lot of blockers, just for config changes?
There will be some dynamic config - how do we handle those changes?

Outcome

We have a better understanding how we can better follow the given release process

AverageMarcus · 2024-11-05T13:44:51Z

@QuentinBisson Would you be able to outline in this issue what features y'all are missing when it comes to Releases and what your current pain points are? I think that'd really help to focus the ideas.

Some things I'd like to know more about - is the problems you're seeing more related to getting changes out to MC (where we've typically been really slow at rolling out update, which is maybe what we need to fix) or, for example, are you seeing problems getting changes out to existing workload clusters that are needed to build consistency across out platform? (These are just examples to get the thoughts flowing 😉)

QuentinBisson · 2024-11-05T14:09:29Z

@AverageMarcus you posted when we started refining this :D @Rotfuks added some things at the top but we are starting a discovery :)

Rotfuks · 2024-11-05T14:37:06Z

Thanks for looking into this issue @AverageMarcus ! We definitely have on the list to talk to the lovely people of tenet for this issue - but we identified some preparation we can first do, to do you guys justice. So we'll reach out after some initial investigation as soon as we're confident we can answer all relevant questions at this stage.

TheoBrigitte · 2024-11-22T11:53:13Z

create a list of configuration that our operators are actually handling

Here is a list of what we are currently doing with our observability related operators

Observability operator

OpsGenie heartbeats link
Mimir ingress and basic auth secrets link
Observability bundle config to select monitoring agent (prometheus-agent or alloy) link
Prometheus agent configmap secret
- Shard scaling up/down
- Version
- External labels
- Mimir write URL + credentials
- Queue config
Alloy (very similar to Prometheus agent) link
- Vertical pod autoscaler
- Cilium network policy
- Service and pod monitors selectors
- External labels
- Mimir write URL + credentials
- Queue config
Grafana Organizations link
- Organization CRUD operations
- Datasources
- RBAC

Logging operator

Observability bundle config to select logging agent (promtail or alloy) link
Alloy config secret
- PodLogs default resources
- PodLogs support toggle
- Structured metadata
- External labels
- Loki write URL + basic auth
Promtail config secret
- Structured metadata
- External labels
- Loki write URL + basic auth
Event logger config secret
- Loki write URL + basic auth
Logging credentials
Proxy auth link
- read and write credentials and org ids
Loki datasource with credentials link

Prometheus meta operator

Alertmanager config and notification template
- Config with opsgenie apikey, slack token, proxy url
OpsGenie heartbeats
- Alertmanager heartbeat receiver
Prometheus Cilium network policy
Prometheus ingress with oauth
Prometheus
- Scrape config
- Service and pod monitors selector
- Rules selector
- RBAC
Remote write
- Basic auth
- Queue config
- Write URL
- Ingress with auth
Remote write CRs
- CRD definition
- CR reconciliation with Prometheus configuration
Prometheus agent
- Remote write URL and basic auth

QuentinBisson · 2024-11-28T10:17:09Z

@giantswarm/team-tenet this is what we discussed today based on what we know today but this might change with what we do not know yet :p

What we really need to integrate with releases:

Tenet - We need to be able to rollback or automatically deploy a patched version if the observability version is creating trouble (crazy alerting or instability) because when observability components are misbehaving on 200 clusters, this is unbearable. The root of the issue is that it is impossible to test a new release with big clusters or big mcs so the impact usually happens when the releares are getting deployed which is usually 6 months to a year after we brought the change in the release.
- ideally, we need to be able to push a patch automatically to all affected clusters
- could be a cluster release rollback (would need to be deployed automatically across all clusters) <- Maybe this is something we need in case we see more errors during an upgrade
- could be an observability bundle release rollback (would need to be deployed automatically across all clusters) <- Could be a new cluster release, but this needs to be automatically applied
- could be a manual app change as a workaround if the number of affected clusters is low so we can provide a proper fix
Tenet - We have to configure some dynamic configs and secrets with an operator for all clusters like:
- Be able to disable metrics and logs at the cluster level
- Agent authentication and storage backend host
- Agent shard scaling up/down (Could be replaced by Keda but it's on no ones radar)
- Maybe external labels
How can we achieve that?
Atlas - We need to be able to configure individual user values for observability bundle apps without goiing through the observability bundle user values
- Do we need the observability bundle at all?

Let's wait for @Rotfuks to give his thoughts on that but we're open to have a meet with you guys to discuss it :)

AverageMarcus · 2024-11-28T14:26:19Z

it is impossible to test a new release with big clusters or big mcs

Why? Can we introduce load testing into our CI/CD pipelines? Do you have examples of what isn't possible to cover with testing? This sounds like something Tenet might be able to help with.

ideally, we need to be able to push a patch automatically to all affected clusters

This is a product decision and needs to be checked with customers, etc. This isn't something Tenet can decide to do, but can work on the technical implementation if that is the choice made. (paging @giantswarm/sig-product)

could be a cluster release rollback

Rollbacks will only be possible if all changed components support it (e.g. CRDs haven't changed or KubeVersion hasn't been bumped)

We have to configure some dynamic configs and secrets with an operator for all clusters like:

Why isn't this able to use our existing config-based approach? I assume it needs to be extended to WCs which I don't think we currently have but it sounds like something we should have generically rather than specifically for observability. Maybe Honeybadger would be good to discuss this with?

What is the actual problem this is needing to solve? This is a solution and not a requirement. A real-work example would be useful here.

QuentinBisson · 2024-11-28T15:07:12Z

Why? Can we introduce load testing into our CI/CD pipelines? Do you have examples of what isn't possible to cover with testing? This sounds like something Tenet might be able to help with.

One of the reason is that to get actual load testing results on daemonsets we need a lot of node and the one time we created a 200 node cluster, it was not see well.

This is a product decision and needs to be checked with customers, etc. This isn't something Tenet can decide to do, but can work on the technical implementation if that is the choice made. (paging @giantswarm/sig-product)

I agree with you, but this is the only thing that blocks us from moving back to releases. This is not just about new features, but the stability of the monitoring pipeline is quite hard to maintain, especially when this is the first thing that pages on tens of clusters when something goes wrong due to some unforeen issues and not being able to fix an issue for years now is not okay when people constantly gets woken up at night.

Why isn't this able to use our existing config-based approach? I assume it needs to be extended to WCs which I don't think we currently have but it sounds like something we should have generically rather than specifically for observability. Maybe Honeybadger would be good to discuss this with?

Because most of those are dynamic by nature. The scaling of shards if operator driven as it is based on the number of series in mimir for a particular cluster and this is not static over time.
API Keys used to authenticate each workload clusters are generated by the operators as well so this would never with the config management we have.

The requirement is that we need to be able to deploy a patch on all clusters when hell breaks loose. I can't give you any example because we built the things we did to not have it happen again.

What i remember is that 2 yeards ago we had some buggy prometheus-agent releases (incorrect settings like scraping too often, dropping a ksm labels we thought we did not need) in the past and the decision from Phoenix then was to not let us create a patch release due to the PSP/PSS, death of vintage migration and we go hundreds of alerts on the monitoring pipeline instability due to that and this situation latest for more than a year because reasons and we definitely do not want it to happen again and the current state of the CAPI release and more specifically the lack of customer upgrade does not give me confidence that this will not happen again

Rotfuks · 2024-11-28T15:45:20Z

Thanks for challenging this @AverageMarcus !
I think in general it's an insurance thing - if we break stuff it's getting very expensive very fast - be it with a lot of false pages, no pages at all, while there are actually incidents and cost explosions in large clusters because of data flooding.

So we see two options on how we can handle this risk:

1. Have extensive testing to make sure we NEVER release something that might break.
1. Make sure if something breaks we can fix it as fast as possible.

We sadly have no high confidence to ever achieve 1) because the observability platform allows so much configuration unique to a customer (talking about amounts of nodes monitored, amounts of servicemonitors/podlogs on customer specific workload apps, etc) and operates on edge cases like alba at a scale that we can hardly reproduce. We've seen already that even with Loki and Mimir which should be safer when it comes to scaling we still uncovered a lot of issues just at a larger scale that we have nowhere besides production installations.

This is why in the past we always relied heavily on 2) and want to continue relying on fast fixes as a fallback. That lead us historically to that long list of configuration that is managed by operators and the need to realign with releases, because we've gone so far that this dynamic nature of our platform leads to it's own issues. That's also why we're afraid of going too far into the other direction again and becoming too static, loosing the ability to fix things fast when something that was unforseeable in our testenvironments happens.

Now I think the future is both 1 and 2. We need to become way better in testing - and we already had a full epic just focused on giving us better and earlier feedback loops and we're really happy with the progress of the e2e test setup you're building. We need to become better in testing and we're on a good track - but in order to be really confident with our platform we still need to be able to fix things fast and not have some general release dependency and process as bottleneck.

But maybe we just need to better understand how we see patches and how fast we can be when patching a release? What do you think about that?

AverageMarcus · 2024-11-29T07:40:37Z

So we see two options on how we can handle this risk:

Have extensive testing to make sure we NEVER release something that might break.

Make sure if something breaks we can fix it as fast as possible.

I see a third option we can (also) include - build in guardrails into our actual product to prevent these issues at runtime.

But I think realistically we need a combination of all three approaches, to some degree.

We sadly have no high confidence to ever achieve 1) because the observability platform allows so much configuration unique to a customer (talking about amounts of nodes monitored, amounts of servicemonitors/podlogs on customer specific workload apps, etc) and operates on edge cases like alba at a scale that we can hardly reproduce. We've seen already that even with Loki and Mimir which should be safer when it comes to scaling we still uncovered a lot of issues just at a larger scale that we have nowhere besides production installations.

This sounds like we NEED to invest in load testing. Otherwise our customers are doing it for us in production. Other companies are capable of achieving this, why do you think we couldn't exactly? What makes us so special or unique in this regard?

This is why in the past we always relied heavily on 2) and want to continue relying on fast fixes as a fallback.

Just remember that every time we do this (and it's been a fair few times in recent memory) it erodes some of the confidence our customers have in our abilities. This should ALWAYS be the last choice.

In general I think we need something a bit in the middle between what we have today and what releases provide. I think we still need releases that deploy a known and fixed version of specific operators but those operators should be able to self-heal within the clusters as needed. They shouldn't change behaviour but they should include guardrails to prevent things blowing up and in those situations we get notified and can work with the customers to resolve (via upgrading) the issue without the need to rush a quick fix.

I'm not sure exactly what that looks like though as I don't understand enough about your teams stuff so maybe this is completely impossible but I want to make sure it at least is thought about and maybe PoC'd if possible.

On a side-note, this might be a more general-purpose discussion that should be had in KaaS-sync or similar meeting as I suspect other teams have similar "break glass" desires that should be taken into consideration.

I'll also like to hear thoughts from my other @giantswarm/team-tenet friends as they are more knowledgeable about the actual technical side of Releases compared to me. I'm just mostly focussing on the customer experience side of things 😄

AverageMarcus · 2024-11-29T07:41:11Z

(Also, I'm really liking this discussion! Great bit of cross-team colab! 💙 )

puja108 · 2024-12-03T09:03:07Z

I still feel that mid-term we do need a better approach do being able to roll out specific bundles or apps more independently from the full WC release (I know this would create a test matrix), or alternatively a way to easily create atomic cluster releases that would change only a single app, so more of a roll forward than rollback.

That would enable some of what Atlas is asking for here, and as those atomic releases should not roll anything but the app(s) itself it should also be easier to argue with customers that we are rolling them out quickly to remediate current issues and could skip typical change processes that might take too long, like what we're currently facing with the move to CAPI and then to v29 and beyond. Still the tying into WC release has created big dependencies between teams and areas it feels and also made customers very wary of doing any upgrade that is a WC version change. IIRC when we were rolling out apps separately, we did often see a case of customers being happy to roll an app much quicker than their normal change process.

puja108 · 2024-12-03T09:03:46Z

cc @alex-dabija as this is somehow also related to the current release/upgrade topics you are looking into

Rotfuks · 2024-12-03T14:45:56Z

I see a third option we can (also) include - build in guardrails into our actual product to prevent these issues at runtime.

Ah yeah, definitely. I'm not yet fully sure how we can do that, but I think a lot of our postmortems are exactly designed to do that - change our product in a way that a thing won't happen again. Maybe we can leverage Kyverno a bit better or generally use @giantswarm/team-shield policy API in a way that stabilises our releases a bit more, but at this stage I have a feeling it's more a "it has to break, that we learn it" thing.
Maybe we should think about some chaos engineering day/week for the observability platform to see more things break and have some directions where we can improve. Do we have some experience with chaos engineering in @giantswarm/team-atlas ? :D

They shouldn't change behaviour but they should include guardrails to prevent things blowing up and in those situations we get notified and can work with the customers to resolve (via upgrading) the issue without the need to rush a quick fix.

I really like that scenario. And I guess it's also a good direction for us to regularily challenge the "dynamic configuration" of the operators - does this change the behaviour in a way that the customer will realise a difference? Then it should be a release. Does this just quickfix some stuff that's not working anyways and a customer won't see it? Then it's okay. Something like that. What do you think @QuentinBisson ?

QuentinBisson · 2024-12-17T13:29:15Z

Hey sorry all, it took me a while to get back to this so I'll try to answer you all :D

@AverageMarcus

Guardails:
I really like the idea of guardrails in our product and maybe we could use some kind of rollouts tooling in our clusters to stop a rollout if we see more issues happening during the upgrade? I would love for example to have some customer configurable threshold as well as internal ones to know when to stop and I guess automatically rollback as well.
I am not confident today that our product could do that but I definitely would love this and I hope customers as well.

Load testing:
I am totally with you here that we need to invest in load testing (maybe not on all components), at least in platform components and networking components and have it automated would probably save us a lot of hassle in the long run. I also do not like that customers have to be the testers. We will for sure always find unexpected stuff in production but we should reduce that to a minimum so both load tests and guard rails will help in improving the customer's opinion.

Fast release and fixes
I think this is important to have in any case, not specifically because we currently test with high load on customers, but because waiting 6 months for a fix or a feature to reach a WC (could be security, reliability and so on) because of freezes, then major version blocking us and so on.

I think this last point took into account @puja108 remark.
I really like the release concept but they are currently a bit too big and cumbersome with a lot of changes that prevent us from moving forward faster so having smaller releaes with auto apply (as long as nodes are not rolled for instance) would be awesome. And we could still be able to implement guardrails, load tests, deployment channels and so on here. It's possible the issue we have now is also because we are waiting for customers to upgrade to the latest v29 release but that's at least months away. I expect the guardrails and load test to be even farther away while we need a solution short to mid-term.

@Rotfuks I don't think we have experience with chaos testing but we probably should do game days but I would think onprem is a good enough game day :D
We do have some load testing knowledge (mostly @TheoBrigitte) with k6s but that is something we could built at the company level.

I also really like this collaboration

AverageMarcus · 2025-01-06T07:56:26Z

Random thought after coming back to this in the new year - I wonder if we need to move towards a more "managed service" approach where we have free reign over control plane but don't touch worker space without release. Maybe even go so far to actually hide the control plane from our customers (do they even need access to those nodes / workloads?)

github-project-automation bot added this to Roadmap Nov 4, 2024

github-project-automation bot moved this to Inbox 📥 in Roadmap Nov 4, 2024

QuentinBisson added the needs/refinement Needs refinement in order to be actionable label Nov 4, 2024

Rotfuks removed the needs/refinement Needs refinement in order to be actionable label Nov 5, 2024

Rotfuks mentioned this issue Dec 3, 2024

Observability Platform Changelog #3788

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate how to integrate better with releases #3758

Investigate how to integrate better with releases #3758

QuentinBisson commented Nov 4, 2024 •

edited by Rotfuks

Loading

AverageMarcus commented Nov 5, 2024

QuentinBisson commented Nov 5, 2024

Rotfuks commented Nov 5, 2024

TheoBrigitte commented Nov 22, 2024

QuentinBisson commented Nov 28, 2024

AverageMarcus commented Nov 28, 2024

QuentinBisson commented Nov 28, 2024

Rotfuks commented Nov 28, 2024

AverageMarcus commented Nov 29, 2024

AverageMarcus commented Nov 29, 2024

puja108 commented Dec 3, 2024

puja108 commented Dec 3, 2024

Rotfuks commented Dec 3, 2024

QuentinBisson commented Dec 17, 2024

AverageMarcus commented Jan 6, 2025

Investigate how to integrate better with releases #3758

Investigate how to integrate better with releases #3758

Comments

QuentinBisson commented Nov 4, 2024 • edited by Rotfuks Loading

Motivation

Todo

Outcome

AverageMarcus commented Nov 5, 2024

QuentinBisson commented Nov 5, 2024

Rotfuks commented Nov 5, 2024

TheoBrigitte commented Nov 22, 2024

Observability operator

Logging operator

Prometheus meta operator

QuentinBisson commented Nov 28, 2024

AverageMarcus commented Nov 28, 2024

QuentinBisson commented Nov 28, 2024

Rotfuks commented Nov 28, 2024

AverageMarcus commented Nov 29, 2024

AverageMarcus commented Nov 29, 2024

puja108 commented Dec 3, 2024

puja108 commented Dec 3, 2024

Rotfuks commented Dec 3, 2024

QuentinBisson commented Dec 17, 2024

AverageMarcus commented Jan 6, 2025

QuentinBisson commented Nov 4, 2024 •

edited by Rotfuks

Loading