-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile configuration drift in Helm releases #289
Comments
Thanks for reaching out via Slack, @Jont828! Here's my further analysis and proposal on this request. I think that the issue lies in the klog.V(2).Infof("Diff between values is:\n%s", cmp.Diff(existing.Config, values))
// TODO: Comparing yaml is not ideal, but it's the best we can do since DeepEquals fails. This is because int64 types
// are converted to float64 when returned from the helm API.
oldValues, err := yaml.Marshal(existing.Config)
if err != nil {
return false, errors.Wrapf(err, "failed to marshal existing release values")
}
newValues, err := yaml.Marshal(values)
if err != nil {
return false, errors.Wrapf(err, "failed to new release values")
}
return !cmp.Equal(oldValues, newValues), nil Making the decision on whether or not to run the Helm Upgrade action based on comparing the old and new values of the release is the root of the problem. It doesn't consider the state of the resources on the target cluster, so there is no way of detecting drift in the resources managed by the Helm release. This behaviour makes sense when compared to the In my opinion, it makes sense to remove the snippet above completely, and rely on Helm to determine what should or should not be changed. This does change the behaviour of CAAPH a bit, so if you feel that it's better to place this behind a controller feature flag, or a flag on |
/triage accepted |
@robinkb That makes sense, and that's a nice catch that I didn't notice. I'm curious though, in what case did you end up deleting a deployment from the Helm chart? In terms of implementation, I think it would be a good idea to try to leverage the |
The case of deleting a deployment from a Helm chart was more of an example. It probably makes sense to elaborate on my use-case: We are building an internal platform for managing EKS clusters for departments within our company. On each of these clusters we install a common set of resources that we call the runtime. This is your typical stuff like a CNI, CoreDNS, ArgoCD, all the way up to observability agents and policy rules. Pretty common use-case, but we don't want to allow departments to modify this runtime. This will partially be solved with RBAC within the EKS clusters, but we also want to make sure even if they somehow modify (delete or edit) a resource that we manage, that it automatically gets reconciled back. The vast majority of our runtime resources will be deployed by ArgoCD, and ArgoCD can already do what we want it to do. But we need to use CAAPH to manage a few resources that are deployed before ArgoCD, and also ArgoCD itself. And for those resources, we want to have the same level of certainty that their configuration is consistent across all of our clusters. Now back to the issue at hand... My initial thought was to just run Helm's Unfortunately, it doesn't look like Helm's Another project with a mature Helm controller is Flux. Flux's Helm controller has a pretty simple There's also the quite popular So I'm not really sure how to proceed here. The base Helm code is not very useful for use in a controller, and Flux's Helm controller is quite complex.
And looking back on my initial comment about the |
I took a cursory pass through the code for Flux. It seems like the
|
Let me add my two cents on this topic In case of the HelmChartProxy and HelmReleaseProxy reconcile loops, they are handled based on the events generated by the this or watched (Cluster API Cluster) Kubernetes resources events. If these objects do not change (СRUD), then no reconciliation event occurs. Therefore, the main problem of this task is to update or not to update the helm release, but to receive events when some object managed by helm changes or deleted. Yes, we can subscribe to all these events, filter them by source (external or the controller itself), but this will be a huge load on the cluster and the controller. We need to look for some mechanisms that will help us receive events that objects managed by helm have changed. |
@dmvolod I agree in that ideally, what we would like to do is be able to get updates for when the underlying resource changes, but I agree that it would not be scalable. In CAPZ (Azure), we have a similar case where we need to detect changes in the Azure resources from the CAPZ controller, but we are unable to use a K8s watch. The precedent in that case is to have the controller run when it requeues automatically every 10 minutes, pull all the Azure resources, compute a diff between that and the spec, and update if needed. If we can find a way to watch updates to the Helm resources that isn't too expensive, that would be great, but otherwise, I think we would need to follow the CAPZ approach. |
@robinkb Do you need help with finding a place to start? If so, I'm happy to hop on a call or Slack if you would like to work on this together. |
We have the same exact use case as @robinkb and would be glad to find a solution for that. |
I'm glad that there is interest in this issue. I would love to pick it up, but unfortunately, I will absolutely not have time to do so at my job. If anyone wants to pick this up instead, feel free to. |
I'm working on the helm release drift controller manager with optimization options which are:
Please have a look on WIP MR linked with issue |
User Story
As an operator I would like CAAPH to reconcile configuration drift in Helm releases.
Detailed Description
If I modify or delete a Kubernetes resource on the target cluster that is managed by CAAPH through a HelmReleaseProxy, it is not reconciled to the desired state.
For example, if I delete a Deployment deployed through CAAPH, it is not recreated.
When using Helm directly, a release can be repaired by running
helm upgrade
with the same version and values as before.CAAPH seems to skip executing any actions on the target cluster if there is no change in the HelmReleaseProxy resource.
Making any change in the HelmReleaseProxy resource (like bumping the version) does trigger a reconcile and restores any modified or deleted resources back to the desired state.
It would be nice to have some way for CAAPH to automatically reconcile configuration drift without having to make changes in the HelmReleaseProxy. Perhaps as a toggle on the HelmReleaseProxy, or a feature flag on the controller manager.
Anything else you would like to add:
Not really, just that this project is very cool :)
/kind feature
The text was updated successfully, but these errors were encountered: