Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate how to integrate better with releases #3758

Open
4 tasks
QuentinBisson opened this issue Nov 4, 2024 · 15 comments
Open
4 tasks

Investigate how to integrate better with releases #3758

QuentinBisson opened this issue Nov 4, 2024 · 15 comments

Comments

@QuentinBisson
Copy link

QuentinBisson commented Nov 4, 2024

Motivation

See this for context giantswarm/releases#1459 (comment)

Our current config mechanism broke releases really bad. We need to find a better way forward because there are currently no way to be sure we do not break releases when doing changes in the operators.

Let's investigate what we really miss in releases today and fix it instead of being out of process

Todo

  • Investigate what configuration we actually need to be dynamic / regularly adapt (like secrets, etc)
    • create a list of configuration that our operators are actually handling
  • Investigate what configuration we want to keep dynamic as a safeguard for fast reaction times on incidents
  • Investigate how we technically can push configuration
    • this can be the observability bundle - but since we want to get rid of it some day, is it still the best place?
    • since the MC overrides some configuration in the WCs, this might be tricky.

Open Questions to clear with tenet

  • How do we configure apps in releases? Maybe through the Observability Bundle, but that has it's own problems?
  • How can we get auto patches? Since we don't want PRs every night and have a lot of blockers, just for config changes?
  • There will be some dynamic config - how do we handle those changes?

Outcome

  • We have a better understanding how we can better follow the given release process
@github-project-automation github-project-automation bot moved this to Inbox 📥 in Roadmap Nov 4, 2024
@QuentinBisson QuentinBisson added the needs/refinement Needs refinement in order to be actionable label Nov 4, 2024
@AverageMarcus
Copy link
Member

@QuentinBisson Would you be able to outline in this issue what features y'all are missing when it comes to Releases and what your current pain points are? I think that'd really help to focus the ideas.

Some things I'd like to know more about - is the problems you're seeing more related to getting changes out to MC (where we've typically been really slow at rolling out update, which is maybe what we need to fix) or, for example, are you seeing problems getting changes out to existing workload clusters that are needed to build consistency across out platform? (These are just examples to get the thoughts flowing 😉)

@Rotfuks Rotfuks removed the needs/refinement Needs refinement in order to be actionable label Nov 5, 2024
@QuentinBisson
Copy link
Author

@AverageMarcus you posted when we started refining this :D @Rotfuks added some things at the top but we are starting a discovery :)

@Rotfuks
Copy link
Contributor

Rotfuks commented Nov 5, 2024

Thanks for looking into this issue @AverageMarcus ! We definitely have on the list to talk to the lovely people of tenet for this issue - but we identified some preparation we can first do, to do you guys justice. So we'll reach out after some initial investigation as soon as we're confident we can answer all relevant questions at this stage.

@TheoBrigitte
Copy link
Member

  • create a list of configuration that our operators are actually handling

Here is a list of what we are currently doing with our observability related operators

Observability operator

  • OpsGenie heartbeats link
  • Mimir ingress and basic auth secrets link
  • Observability bundle config to select monitoring agent (prometheus-agent or alloy) link
  • Prometheus agent configmap secret
    • Shard scaling up/down
    • Version
    • External labels
    • Mimir write URL + credentials
    • Queue config
  • Alloy (very similar to Prometheus agent) link
    • Vertical pod autoscaler
    • Cilium network policy
    • Service and pod monitors selectors
    • External labels
    • Mimir write URL + credentials
    • Queue config
  • Grafana Organizations link
    • Organization CRUD operations
    • Datasources
    • RBAC

Logging operator

  • Observability bundle config to select logging agent (promtail or alloy) link
  • Alloy config secret
    • PodLogs default resources
    • PodLogs support toggle
    • Structured metadata
    • External labels
    • Loki write URL + basic auth
  • Promtail config secret
    • Structured metadata
    • External labels
    • Loki write URL + basic auth
  • Event logger config secret
    • Loki write URL + basic auth
  • Logging credentials
  • Proxy auth link
    • read and write credentials and org ids
  • Loki datasource with credentials link

Prometheus meta operator

  • Alertmanager config and notification template
    • Config with opsgenie apikey, slack token, proxy url
  • OpsGenie heartbeats
    • Alertmanager heartbeat receiver
  • Prometheus Cilium network policy
  • Prometheus ingress with oauth
  • Prometheus
    • Scrape config
    • Service and pod monitors selector
    • Rules selector
    • RBAC
  • Remote write
    • Basic auth
    • Queue config
    • Write URL
    • Ingress with auth
  • Remote write CRs
    • CRD definition
    • CR reconciliation with Prometheus configuration
  • Prometheus agent
    • Remote write URL and basic auth

@QuentinBisson
Copy link
Author

@giantswarm/team-tenet this is what we discussed today based on what we know today but this might change with what we do not know yet :p

What we really need to integrate with releases:

  • Tenet - We need to be able to rollback or automatically deploy a patched version if the observability version is creating trouble (crazy alerting or instability) because when observability components are misbehaving on 200 clusters, this is unbearable. The root of the issue is that it is impossible to test a new release with big clusters or big mcs so the impact usually happens when the releares are getting deployed which is usually 6 months to a year after we brought the change in the release.

    • ideally, we need to be able to push a patch automatically to all affected clusters
    • could be a cluster release rollback (would need to be deployed automatically across all clusters) <- Maybe this is something we need in case we see more errors during an upgrade
    • could be an observability bundle release rollback (would need to be deployed automatically across all clusters) <- Could be a new cluster release, but this needs to be automatically applied
    • could be a manual app change as a workaround if the number of affected clusters is low so we can provide a proper fix
  • Tenet - We have to configure some dynamic configs and secrets with an operator for all clusters like:

    • Be able to disable metrics and logs at the cluster level
    • Agent authentication and storage backend host
    • Agent shard scaling up/down (Could be replaced by Keda but it's on no ones radar)
    • Maybe external labels

    How can we achieve that?

  • Atlas - We need to be able to configure individual user values for observability bundle apps without goiing through the observability bundle user values

    • Do we need the observability bundle at all?

Let's wait for @Rotfuks to give his thoughts on that but we're open to have a meet with you guys to discuss it :)

@AverageMarcus
Copy link
Member

it is impossible to test a new release with big clusters or big mcs

Why? Can we introduce load testing into our CI/CD pipelines? Do you have examples of what isn't possible to cover with testing? This sounds like something Tenet might be able to help with.

ideally, we need to be able to push a patch automatically to all affected clusters

This is a product decision and needs to be checked with customers, etc. This isn't something Tenet can decide to do, but can work on the technical implementation if that is the choice made. (paging @giantswarm/sig-product)

could be a cluster release rollback

Rollbacks will only be possible if all changed components support it (e.g. CRDs haven't changed or KubeVersion hasn't been bumped)

We have to configure some dynamic configs and secrets with an operator for all clusters like:

Why isn't this able to use our existing config-based approach? I assume it needs to be extended to WCs which I don't think we currently have but it sounds like something we should have generically rather than specifically for observability. Maybe Honeybadger would be good to discuss this with?

What is the actual problem this is needing to solve? This is a solution and not a requirement. A real-work example would be useful here.

@QuentinBisson
Copy link
Author

Why? Can we introduce load testing into our CI/CD pipelines? Do you have examples of what isn't possible to cover with testing? This sounds like something Tenet might be able to help with.

One of the reason is that to get actual load testing results on daemonsets we need a lot of node and the one time we created a 200 node cluster, it was not see well.

This is a product decision and needs to be checked with customers, etc. This isn't something Tenet can decide to do, but can work on the technical implementation if that is the choice made. (paging @giantswarm/sig-product)

I agree with you, but this is the only thing that blocks us from moving back to releases. This is not just about new features, but the stability of the monitoring pipeline is quite hard to maintain, especially when this is the first thing that pages on tens of clusters when something goes wrong due to some unforeen issues and not being able to fix an issue for years now is not okay when people constantly gets woken up at night.

Why isn't this able to use our existing config-based approach? I assume it needs to be extended to WCs which I don't think we currently have but it sounds like something we should have generically rather than specifically for observability. Maybe Honeybadger would be good to discuss this with?

Because most of those are dynamic by nature. The scaling of shards if operator driven as it is based on the number of series in mimir for a particular cluster and this is not static over time.
API Keys used to authenticate each workload clusters are generated by the operators as well so this would never with the config management we have.

The requirement is that we need to be able to deploy a patch on all clusters when hell breaks loose. I can't give you any example because we built the things we did to not have it happen again.

What i remember is that 2 yeards ago we had some buggy prometheus-agent releases (incorrect settings like scraping too often, dropping a ksm labels we thought we did not need) in the past and the decision from Phoenix then was to not let us create a patch release due to the PSP/PSS, death of vintage migration and we go hundreds of alerts on the monitoring pipeline instability due to that and this situation latest for more than a year because reasons and we definitely do not want it to happen again and the current state of the CAPI release and more specifically the lack of customer upgrade does not give me confidence that this will not happen again

@Rotfuks
Copy link
Contributor

Rotfuks commented Nov 28, 2024

Thanks for challenging this @AverageMarcus !
I think in general it's an insurance thing - if we break stuff it's getting very expensive very fast - be it with a lot of false pages, no pages at all, while there are actually incidents and cost explosions in large clusters because of data flooding.

So we see two options on how we can handle this risk:

    1. Have extensive testing to make sure we NEVER release something that might break.
    1. Make sure if something breaks we can fix it as fast as possible.

We sadly have no high confidence to ever achieve 1) because the observability platform allows so much configuration unique to a customer (talking about amounts of nodes monitored, amounts of servicemonitors/podlogs on customer specific workload apps, etc) and operates on edge cases like alba at a scale that we can hardly reproduce. We've seen already that even with Loki and Mimir which should be safer when it comes to scaling we still uncovered a lot of issues just at a larger scale that we have nowhere besides production installations.

This is why in the past we always relied heavily on 2) and want to continue relying on fast fixes as a fallback. That lead us historically to that long list of configuration that is managed by operators and the need to realign with releases, because we've gone so far that this dynamic nature of our platform leads to it's own issues. That's also why we're afraid of going too far into the other direction again and becoming too static, loosing the ability to fix things fast when something that was unforseeable in our testenvironments happens.

Now I think the future is both 1 and 2. We need to become way better in testing - and we already had a full epic just focused on giving us better and earlier feedback loops and we're really happy with the progress of the e2e test setup you're building. We need to become better in testing and we're on a good track - but in order to be really confident with our platform we still need to be able to fix things fast and not have some general release dependency and process as bottleneck.

But maybe we just need to better understand how we see patches and how fast we can be when patching a release? What do you think about that?

@AverageMarcus
Copy link
Member

So we see two options on how we can handle this risk:

    1. Have extensive testing to make sure we NEVER release something that might break.
    1. Make sure if something breaks we can fix it as fast as possible.

I see a third option we can (also) include - build in guardrails into our actual product to prevent these issues at runtime.

But I think realistically we need a combination of all three approaches, to some degree.

We sadly have no high confidence to ever achieve 1) because the observability platform allows so much configuration unique to a customer (talking about amounts of nodes monitored, amounts of servicemonitors/podlogs on customer specific workload apps, etc) and operates on edge cases like alba at a scale that we can hardly reproduce. We've seen already that even with Loki and Mimir which should be safer when it comes to scaling we still uncovered a lot of issues just at a larger scale that we have nowhere besides production installations.

This sounds like we NEED to invest in load testing. Otherwise our customers are doing it for us in production. Other companies are capable of achieving this, why do you think we couldn't exactly? What makes us so special or unique in this regard?

This is why in the past we always relied heavily on 2) and want to continue relying on fast fixes as a fallback.

Just remember that every time we do this (and it's been a fair few times in recent memory) it erodes some of the confidence our customers have in our abilities. This should ALWAYS be the last choice.

In general I think we need something a bit in the middle between what we have today and what releases provide. I think we still need releases that deploy a known and fixed version of specific operators but those operators should be able to self-heal within the clusters as needed. They shouldn't change behaviour but they should include guardrails to prevent things blowing up and in those situations we get notified and can work with the customers to resolve (via upgrading) the issue without the need to rush a quick fix.

I'm not sure exactly what that looks like though as I don't understand enough about your teams stuff so maybe this is completely impossible but I want to make sure it at least is thought about and maybe PoC'd if possible.


On a side-note, this might be a more general-purpose discussion that should be had in KaaS-sync or similar meeting as I suspect other teams have similar "break glass" desires that should be taken into consideration.


I'll also like to hear thoughts from my other @giantswarm/team-tenet friends as they are more knowledgeable about the actual technical side of Releases compared to me. I'm just mostly focussing on the customer experience side of things 😄

@AverageMarcus
Copy link
Member

(Also, I'm really liking this discussion! Great bit of cross-team colab! 💙 )

@puja108
Copy link
Member

puja108 commented Dec 3, 2024

I still feel that mid-term we do need a better approach do being able to roll out specific bundles or apps more independently from the full WC release (I know this would create a test matrix), or alternatively a way to easily create atomic cluster releases that would change only a single app, so more of a roll forward than rollback.

That would enable some of what Atlas is asking for here, and as those atomic releases should not roll anything but the app(s) itself it should also be easier to argue with customers that we are rolling them out quickly to remediate current issues and could skip typical change processes that might take too long, like what we're currently facing with the move to CAPI and then to v29 and beyond. Still the tying into WC release has created big dependencies between teams and areas it feels and also made customers very wary of doing any upgrade that is a WC version change. IIRC when we were rolling out apps separately, we did often see a case of customers being happy to roll an app much quicker than their normal change process.

@puja108
Copy link
Member

puja108 commented Dec 3, 2024

cc @alex-dabija as this is somehow also related to the current release/upgrade topics you are looking into

@Rotfuks
Copy link
Contributor

Rotfuks commented Dec 3, 2024

I see a third option we can (also) include - build in guardrails into our actual product to prevent these issues at runtime.

Ah yeah, definitely. I'm not yet fully sure how we can do that, but I think a lot of our postmortems are exactly designed to do that - change our product in a way that a thing won't happen again. Maybe we can leverage Kyverno a bit better or generally use @giantswarm/team-shield policy API in a way that stabilises our releases a bit more, but at this stage I have a feeling it's more a "it has to break, that we learn it" thing.
Maybe we should think about some chaos engineering day/week for the observability platform to see more things break and have some directions where we can improve. Do we have some experience with chaos engineering in @giantswarm/team-atlas ? :D

They shouldn't change behaviour but they should include guardrails to prevent things blowing up and in those situations we get notified and can work with the customers to resolve (via upgrading) the issue without the need to rush a quick fix.

I really like that scenario. And I guess it's also a good direction for us to regularily challenge the "dynamic configuration" of the operators - does this change the behaviour in a way that the customer will realise a difference? Then it should be a release. Does this just quickfix some stuff that's not working anyways and a customer won't see it? Then it's okay. Something like that. What do you think @QuentinBisson ?

@QuentinBisson
Copy link
Author

Hey sorry all, it took me a while to get back to this so I'll try to answer you all :D

@AverageMarcus

Guardails:
I really like the idea of guardrails in our product and maybe we could use some kind of rollouts tooling in our clusters to stop a rollout if we see more issues happening during the upgrade? I would love for example to have some customer configurable threshold as well as internal ones to know when to stop and I guess automatically rollback as well.
I am not confident today that our product could do that but I definitely would love this and I hope customers as well.

Load testing:
I am totally with you here that we need to invest in load testing (maybe not on all components), at least in platform components and networking components and have it automated would probably save us a lot of hassle in the long run. I also do not like that customers have to be the testers. We will for sure always find unexpected stuff in production but we should reduce that to a minimum so both load tests and guard rails will help in improving the customer's opinion.

Fast release and fixes
I think this is important to have in any case, not specifically because we currently test with high load on customers, but because waiting 6 months for a fix or a feature to reach a WC (could be security, reliability and so on) because of freezes, then major version blocking us and so on.

I think this last point took into account @puja108 remark.
I really like the release concept but they are currently a bit too big and cumbersome with a lot of changes that prevent us from moving forward faster so having smaller releaes with auto apply (as long as nodes are not rolled for instance) would be awesome. And we could still be able to implement guardrails, load tests, deployment channels and so on here. It's possible the issue we have now is also because we are waiting for customers to upgrade to the latest v29 release but that's at least months away. I expect the guardrails and load test to be even farther away while we need a solution short to mid-term.

@Rotfuks I don't think we have experience with chaos testing but we probably should do game days but I would think onprem is a good enough game day :D
We do have some load testing knowledge (mostly @TheoBrigitte) with k6s but that is something we could built at the company level.

I also really like this collaboration

@AverageMarcus
Copy link
Member

Random thought after coming back to this in the new year - I wonder if we need to move towards a more "managed service" approach where we have free reign over control plane but don't touch worker space without release. Maybe even go so far to actually hide the control plane from our customers (do they even need access to those nodes / workloads?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox 📥
Development

No branches or pull requests

5 participants