Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile configuration drift in Helm releases #289

Open
robinkb opened this issue Sep 30, 2024 · 11 comments · May be fixed by #334
Open

Reconcile configuration drift in Helm releases #289

robinkb opened this issue Sep 30, 2024 · 11 comments · May be fixed by #334
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@robinkb
Copy link

robinkb commented Sep 30, 2024

User Story

As an operator I would like CAAPH to reconcile configuration drift in Helm releases.

Detailed Description

If I modify or delete a Kubernetes resource on the target cluster that is managed by CAAPH through a HelmReleaseProxy, it is not reconciled to the desired state.
For example, if I delete a Deployment deployed through CAAPH, it is not recreated.

When using Helm directly, a release can be repaired by running helm upgrade with the same version and values as before.
CAAPH seems to skip executing any actions on the target cluster if there is no change in the HelmReleaseProxy resource.

Making any change in the HelmReleaseProxy resource (like bumping the version) does trigger a reconcile and restores any modified or deleted resources back to the desired state.

It would be nice to have some way for CAAPH to automatically reconcile configuration drift without having to make changes in the HelmReleaseProxy. Perhaps as a toggle on the HelmReleaseProxy, or a feature flag on the controller manager.

Anything else you would like to add:

Not really, just that this project is very cool :)

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 30, 2024
@robinkb
Copy link
Author

robinkb commented Oct 24, 2024

Thanks for reaching out via Slack, @Jont828! Here's my further analysis and proposal on this request.

I think that the issue lies in the shouldUpgradeHelmRelease function. It does some validation before calling the Helm client, and I wonder if this is all necessary. In particular, I think that the snippet below should be removed.

	klog.V(2).Infof("Diff between values is:\n%s", cmp.Diff(existing.Config, values))

	// TODO: Comparing yaml is not ideal, but it's the best we can do since DeepEquals fails. This is because int64 types
	// are converted to float64 when returned from the helm API.
	oldValues, err := yaml.Marshal(existing.Config)
	if err != nil {
		return false, errors.Wrapf(err, "failed to marshal existing release values")
	}
	newValues, err := yaml.Marshal(values)
	if err != nil {
		return false, errors.Wrapf(err, "failed to new release values")
	}

	return !cmp.Equal(oldValues, newValues), nil

Making the decision on whether or not to run the Helm Upgrade action based on comparing the old and new values of the release is the root of the problem. It doesn't consider the state of the resources on the target cluster, so there is no way of detecting drift in the resources managed by the Helm release.

This behaviour makes sense when compared to the ClusterResourceSet resource, or if operators want to avoid calling out to the workload clusters, because it only triggers a change when the resource changes on the management cluster. But this is not what I expect from CAAPH, as I expect it to behave like Helm. The meat of Helm's upgrade function is very good at determining exactly what needs to be created, updated, or deleted. The CAPI controllers also do drift reconciliation. Or at least the Cluster API for AWS controller does so with EKS. I do not have experience with others.

In my opinion, it makes sense to remove the snippet above completely, and rely on Helm to determine what should or should not be changed. This does change the behaviour of CAAPH a bit, so if you feel that it's better to place this behind a controller feature flag, or a flag on HelmReleaseProxy, then I will happily implement it that way.

@Jont828
Copy link
Contributor

Jont828 commented Oct 24, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Oct 24, 2024
@Jont828
Copy link
Contributor

Jont828 commented Oct 24, 2024

@robinkb That makes sense, and that's a nice catch that I didn't notice. I'm curious though, in what case did you end up deleting a deployment from the Helm chart?

In terms of implementation, I think it would be a good idea to try to leverage the performUpgrade function to avoid duplicating code. I do want to make sure that the behavior of Helm upgrade would preserve the existing use case where if the values are the same and nothing has been deleted, then it would do nothing to avoid reinstalling components in every reconciliation loop. I see that it can use the isDryRun() flag to do the checks without installing anything. My initial thoughts would be to try to use a dry run of RunWithContext() and see if we can get a boolean on if anything would have changed, and if not, leave it be. If we can make sure that the dry run would not impact performance, I think we can rely on that instead of the existing snippet above.

@robinkb
Copy link
Author

robinkb commented Oct 31, 2024

The case of deleting a deployment from a Helm chart was more of an example. It probably makes sense to elaborate on my use-case:

We are building an internal platform for managing EKS clusters for departments within our company. On each of these clusters we install a common set of resources that we call the runtime. This is your typical stuff like a CNI, CoreDNS, ArgoCD, all the way up to observability agents and policy rules. Pretty common use-case, but we don't want to allow departments to modify this runtime. This will partially be solved with RBAC within the EKS clusters, but we also want to make sure even if they somehow modify (delete or edit) a resource that we manage, that it automatically gets reconciled back. The vast majority of our runtime resources will be deployed by ArgoCD, and ArgoCD can already do what we want it to do. But we need to use CAAPH to manage a few resources that are deployed before ArgoCD, and also ArgoCD itself. And for those resources, we want to have the same level of certainty that their configuration is consistent across all of our clusters.

Now back to the issue at hand...

My initial thought was to just run Helm's Upgrade action with every reconciliation, so the equivalent of running helm upgrade constantly. Helm won't reinstall anything if no changes are detected, so I'm not worried about that. But that would result in a new Helm release every time in the workload cluster. Not very nice.

Unfortunately, it doesn't look like Helm's Upgrade action returns any useful information when performing a dry run. So I don't think that we can easily rely on that...

Another project with a mature Helm controller is Flux. Flux's Helm controller has a pretty simple Upgrade action. As stated in the documentation though, it relies on the caller to determine if an upgrade actually needs to happen. That might happen in the Diff action? I'm not sure. I started to dive into the controller code, and it's a lot.

There's also the quite popular helm-diff plugin which appears to have a function that returns a simple boolean to report whether or not it expects changes. I don't know how reliable this is, though.

So I'm not really sure how to proceed here. The base Helm code is not very useful for use in a controller, and Flux's Helm controller is quite complex.

My initial thoughts would be to try to use a dry run of RunWithContext() and see if we can get a boolean on if anything would have changed, and if not, leave it be.

And looking back on my initial comment about the performUpgrade function, I'm not even sure if it does detect if changes to resources are to be made...

@Jont828
Copy link
Contributor

Jont828 commented Nov 7, 2024

I took a cursory pass through the code for Flux. It seems like the Diff function isn't tied to the Upgrade action, but rather, it does a check after the release is deployed here for reconciliation drift, and it has its own controller dedicated to drift reconciliation. We could implement our own solution similar to how Flux does it, but I think it would be best to look into the easiest options first before we go that way.

  1. If we just run Helm upgrade on every reconciliation, like you said, it would likely repair the release on its own. What do you mean that it would result in a new Helm release every time? Do you mean to say that it would bump the revision/version count every time, even if the upgrade was a no-op?
  2. We could use the helm-diff plugin. It would be very nice to leverage an existing solution so we don't have to maintain one ourselves. Could you play around with it to see if it would cover your use cases? It has test cases which inspires some confidence that it works. Even if it's not 100% reliable and we find edge cases, we could PR them to the repo.
  3. Just a thought, but what would happen if we tried to diff the results of helm template or helm upgrade --dry-run with helm get manifest on the existing release? They would both spit out a lot of YAML which we can compare easily by marshalling into JSON.

@dmvolod
Copy link
Contributor

dmvolod commented Dec 4, 2024

Let me add my two cents on this topic

In case of the HelmChartProxy and HelmReleaseProxy reconcile loops, they are handled based on the events generated by the this or watched (Cluster API Cluster) Kubernetes resources events.

If these objects do not change (СRUD), then no reconciliation event occurs. Therefore, the main problem of this task is to update or not to update the helm release, but to receive events when some object managed by helm changes or deleted.

Yes, we can subscribe to all these events, filter them by source (external or the controller itself), but this will be a huge load on the cluster and the controller. We need to look for some mechanisms that will help us receive events that objects managed by helm have changed.

@Jont828
Copy link
Contributor

Jont828 commented Dec 5, 2024

@dmvolod I agree in that ideally, what we would like to do is be able to get updates for when the underlying resource changes, but I agree that it would not be scalable. In CAPZ (Azure), we have a similar case where we need to detect changes in the Azure resources from the CAPZ controller, but we are unable to use a K8s watch. The precedent in that case is to have the controller run when it requeues automatically every 10 minutes, pull all the Azure resources, compute a diff between that and the spec, and update if needed. If we can find a way to watch updates to the Helm resources that isn't too expensive, that would be great, but otherwise, I think we would need to follow the CAPZ approach.

@Jont828
Copy link
Contributor

Jont828 commented Dec 5, 2024

@robinkb Do you need help with finding a place to start? If so, I'm happy to hop on a call or Slack if you would like to work on this together.

@dombott
Copy link

dombott commented Dec 12, 2024

We have the same exact use case as @robinkb and would be glad to find a solution for that.
We would prefer if we could configure on a HelmChartProxy if and how often the drift correction should run.
The HelmChartProxy could then maybe be requeued after whatever timespan is configured if the drift correction is enabled.
The reconcile loop could then check if there is a drift using one of the above mentioned options and upgrade the helm release if needed.
If there are any open questions about the use case or what the benefit is for us, feel free to ask. I'll gladly share some more details.

@robinkb
Copy link
Author

robinkb commented Dec 16, 2024

I'm glad that there is interest in this issue. I would love to pick it up, but unfortunately, I will absolutely not have time to do so at my job. If anyone wants to pick this up instead, feel free to.

@dmvolod
Copy link
Contributor

dmvolod commented Dec 23, 2024

I'm working on the helm release drift controller manager with optimization options which are:

  • - Watches and caches only objects with label selector based on Helm release name and type
  • - Stores only ObjectMeta in cache
  • - Start and stop drift controller manager dynamically based manifest changes (if GVKs are not changing due to new release install, no new manager won't be created)
  • - Interacting with HelmReleaseController via channel (no external object reconcilation and watch)
  • - Reduce number of CRUD events based on ManagedFields.Manager filter (not implemented yet). This is a client-side filter, but it cuts off watches from HelmReleaseProxy own update operations.

Please have a look on WIP MR linked with issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
5 participants