-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate how to integrate better with releases #3758
Comments
@QuentinBisson Would you be able to outline in this issue what features y'all are missing when it comes to Releases and what your current pain points are? I think that'd really help to focus the ideas. Some things I'd like to know more about - is the problems you're seeing more related to getting changes out to MC (where we've typically been really slow at rolling out update, which is maybe what we need to fix) or, for example, are you seeing problems getting changes out to existing workload clusters that are needed to build consistency across out platform? (These are just examples to get the thoughts flowing 😉) |
@AverageMarcus you posted when we started refining this :D @Rotfuks added some things at the top but we are starting a discovery :) |
Thanks for looking into this issue @AverageMarcus ! We definitely have on the list to talk to the lovely people of tenet for this issue - but we identified some preparation we can first do, to do you guys justice. So we'll reach out after some initial investigation as soon as we're confident we can answer all relevant questions at this stage. |
Here is a list of what we are currently doing with our observability related operators Observability operator
Logging operator
Prometheus meta operator
|
@giantswarm/team-tenet this is what we discussed today based on what we know today but this might change with what we do not know yet :p What we really need to integrate with releases:
Let's wait for @Rotfuks to give his thoughts on that but we're open to have a meet with you guys to discuss it :) |
Why? Can we introduce load testing into our CI/CD pipelines? Do you have examples of what isn't possible to cover with testing? This sounds like something Tenet might be able to help with.
This is a product decision and needs to be checked with customers, etc. This isn't something Tenet can decide to do, but can work on the technical implementation if that is the choice made. (paging @giantswarm/sig-product)
Rollbacks will only be possible if all changed components support it (e.g. CRDs haven't changed or KubeVersion hasn't been bumped)
Why isn't this able to use our existing What is the actual problem this is needing to solve? This is a solution and not a requirement. A real-work example would be useful here. |
One of the reason is that to get actual load testing results on daemonsets we need a lot of node and the one time we created a 200 node cluster, it was not see well.
I agree with you, but this is the only thing that blocks us from moving back to releases. This is not just about new features, but the stability of the monitoring pipeline is quite hard to maintain, especially when this is the first thing that pages on tens of clusters when something goes wrong due to some unforeen issues and not being able to fix an issue for years now is not okay when people constantly gets woken up at night.
Because most of those are dynamic by nature. The scaling of shards if operator driven as it is based on the number of series in mimir for a particular cluster and this is not static over time. The requirement is that we need to be able to deploy a patch on all clusters when hell breaks loose. I can't give you any example because we built the things we did to not have it happen again. What i remember is that 2 yeards ago we had some buggy prometheus-agent releases (incorrect settings like scraping too often, dropping a ksm labels we thought we did not need) in the past and the decision from Phoenix then was to not let us create a patch release due to the PSP/PSS, death of vintage migration and we go hundreds of alerts on the monitoring pipeline instability due to that and this situation latest for more than a year because reasons and we definitely do not want it to happen again and the current state of the CAPI release and more specifically the lack of customer upgrade does not give me confidence that this will not happen again |
Thanks for challenging this @AverageMarcus ! So we see two options on how we can handle this risk:
We sadly have no high confidence to ever achieve 1) because the observability platform allows so much configuration unique to a customer (talking about amounts of nodes monitored, amounts of servicemonitors/podlogs on customer specific workload apps, etc) and operates on edge cases like alba at a scale that we can hardly reproduce. We've seen already that even with Loki and Mimir which should be safer when it comes to scaling we still uncovered a lot of issues just at a larger scale that we have nowhere besides production installations. This is why in the past we always relied heavily on 2) and want to continue relying on fast fixes as a fallback. That lead us historically to that long list of configuration that is managed by operators and the need to realign with releases, because we've gone so far that this dynamic nature of our platform leads to it's own issues. That's also why we're afraid of going too far into the other direction again and becoming too static, loosing the ability to fix things fast when something that was unforseeable in our testenvironments happens. Now I think the future is both 1 and 2. We need to become way better in testing - and we already had a full epic just focused on giving us better and earlier feedback loops and we're really happy with the progress of the e2e test setup you're building. We need to become better in testing and we're on a good track - but in order to be really confident with our platform we still need to be able to fix things fast and not have some general release dependency and process as bottleneck. But maybe we just need to better understand how we see patches and how fast we can be when patching a release? What do you think about that? |
I see a third option we can (also) include - build in guardrails into our actual product to prevent these issues at runtime. But I think realistically we need a combination of all three approaches, to some degree.
This sounds like we NEED to invest in load testing. Otherwise our customers are doing it for us in production. Other companies are capable of achieving this, why do you think we couldn't exactly? What makes us so special or unique in this regard?
Just remember that every time we do this (and it's been a fair few times in recent memory) it erodes some of the confidence our customers have in our abilities. This should ALWAYS be the last choice. In general I think we need something a bit in the middle between what we have today and what releases provide. I think we still need releases that deploy a known and fixed version of specific operators but those operators should be able to self-heal within the clusters as needed. They shouldn't change behaviour but they should include guardrails to prevent things blowing up and in those situations we get notified and can work with the customers to resolve (via upgrading) the issue without the need to rush a quick fix. I'm not sure exactly what that looks like though as I don't understand enough about your teams stuff so maybe this is completely impossible but I want to make sure it at least is thought about and maybe PoC'd if possible. On a side-note, this might be a more general-purpose discussion that should be had in KaaS-sync or similar meeting as I suspect other teams have similar "break glass" desires that should be taken into consideration. I'll also like to hear thoughts from my other @giantswarm/team-tenet friends as they are more knowledgeable about the actual technical side of Releases compared to me. I'm just mostly focussing on the customer experience side of things 😄 |
(Also, I'm really liking this discussion! Great bit of cross-team colab! 💙 ) |
I still feel that mid-term we do need a better approach do being able to roll out specific bundles or apps more independently from the full WC release (I know this would create a test matrix), or alternatively a way to easily create atomic cluster releases that would change only a single app, so more of a roll forward than rollback. That would enable some of what Atlas is asking for here, and as those atomic releases should not roll anything but the app(s) itself it should also be easier to argue with customers that we are rolling them out quickly to remediate current issues and could skip typical change processes that might take too long, like what we're currently facing with the move to CAPI and then to v29 and beyond. Still the tying into WC release has created big dependencies between teams and areas it feels and also made customers very wary of doing any upgrade that is a WC version change. IIRC when we were rolling out apps separately, we did often see a case of customers being happy to roll an app much quicker than their normal change process. |
cc @alex-dabija as this is somehow also related to the current release/upgrade topics you are looking into |
Ah yeah, definitely. I'm not yet fully sure how we can do that, but I think a lot of our postmortems are exactly designed to do that - change our product in a way that a thing won't happen again. Maybe we can leverage Kyverno a bit better or generally use @giantswarm/team-shield policy API in a way that stabilises our releases a bit more, but at this stage I have a feeling it's more a "it has to break, that we learn it" thing.
I really like that scenario. And I guess it's also a good direction for us to regularily challenge the "dynamic configuration" of the operators - does this change the behaviour in a way that the customer will realise a difference? Then it should be a release. Does this just quickfix some stuff that's not working anyways and a customer won't see it? Then it's okay. Something like that. What do you think @QuentinBisson ? |
Hey sorry all, it took me a while to get back to this so I'll try to answer you all :D Guardails: Load testing: Fast release and fixes I think this last point took into account @puja108 remark. @Rotfuks I don't think we have experience with chaos testing but we probably should do game days but I would think onprem is a good enough game day :D I also really like this collaboration |
Random thought after coming back to this in the new year - I wonder if we need to move towards a more "managed service" approach where we have free reign over control plane but don't touch worker space without release. Maybe even go so far to actually hide the control plane from our customers (do they even need access to those nodes / workloads?) |
Motivation
See this for context giantswarm/releases#1459 (comment)
Our current config mechanism broke releases really bad. We need to find a better way forward because there are currently no way to be sure we do not break releases when doing changes in the operators.
Let's investigate what we really miss in releases today and fix it instead of being out of process
Todo
Open Questions to clear with tenet
Outcome
The text was updated successfully, but these errors were encountered: