-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ensure
Option to spec.uninstall
for Ensuring Proper Cleanup of HelmChartProxy Resources
#319
Comments
/triage-accepted Will take a look on Thurs. |
If I understand this correctly, you would like to be able to block Cluster deletion until we've successfully done the Helm uninstall operations, is that right?
I'm not too clear on how the Helm resources on the workload Cluster would exist after the Cluster is deleted. The Helm chart is running on the VMs and such of the workload Cluster, I thought those would get deleted anyway. Do you have an example/use case where there's a security risk, or where a Helm resource persists after the Cluster is deleted? |
Thank you for your response. As you correctly understood, my proposal aims to ensure that HelmChartProxy resources are fully uninstalled before the Cluster deletion process completes. Below, I provide detailed use cases and risks to clarify the necessity of this feature. Specific Use Cases and Risks
Current Issues
Benefits of the Proposal
SummaryProper cleanup of external resources during Cluster deletion is crucial for maintaining security, managing costs, and ensuring operational efficiency. Introducing the I hope this clarifies the importance and benefits of the proposed feature. Please feel free to reach out with any further questions or feedback. Best regards, kahirokunn |
That makes sense, I think this would be a good feature to add. I'm not too clear on what the implementation would look like, however. I haven't used any of the hooks like BeforeClusterDelete, would that be something we implement in code, or is it something that's defined in YAML? And would you like to implement this feature? |
Thank you for your response. I'm glad to hear that you think this would be a valuable feature to add. To address your questions, implementing the apiVersion: runtime.cluster.x-k8s.io/v1alpha1
kind: ExtensionConfig
metadata:
annotations:
runtime.cluster.x-k8s.io/inject-ca-from-secret: default/test-runtime-sdk-svc-cert
name: test-runtime-sdk-extensionconfig
spec:
clientConfig:
service:
name: test-runtime-sdk-svc
namespace: default # Assumes the extension is deployed in the default namespace
port: 443
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: In
values:
- default # Assumes the extension is used by Clusters in the default namespace only However, we also need to set up a webhook server to handle the incoming webhook requests. You can find a working sample implementation here: Runtime Hooks Sample Implementation For detailed guidelines on implementing extensions, please refer to the official documentation: Cluster API Runtime SDK - Implement Extensions Additionally, we'll need to modify the Custom Resource Definitions (CRDs) and the Go structs to include the new optional apiVersion: addons.cluster.x-k8s.io/v1alpha1
kind: HelmChartProxy
metadata:
name: karpenter
spec:
clusterSelector:
matchLabels:
karpenterChart: enabled
repoURL: oci://public.ecr.aws/karpenter
chartName: karpenter
options:
waitForJobs: true
wait: true
timeout: 5m
install:
createNamespace: true
uninstall:
ensure: true # New optional field (default: false)
valuesTemplate: |
controller:
replicaCount: 2 Both There are some aspects of the
If both conditions are met, then the implementation of the runtime webhook server would only primarily need to block the I'm more than willing to begin by investigating these details and proceeding based on what I find. This would be my first time contributing to the Cluster API (CAPI) project, but I'd be happy to take on the implementation of this feature if that's alright with you. Please let me know your thoughts or if there's anything else you'd like me to address. Best regards, kahirokunn |
Currently, we do not have any logic to block the creation of a HelmReleaseProxy if a HelmChartProxy is newly created and selects a Cluster that has a deletion timestamp. However, I think that would be pretty simple to implement by adding some logic to block the HelmReleaseProxy creation that happens here. The HelmReleaseProxy is owned by both the Cluster and the HelmChartProxy, so I believe deleting any owner should also delete HelmChartProxy. Come to think of it, I think you can use the blockOwnerDeletion field in the ownerRef to the Cluster to wait until the Helm resources are cleaned up, as described here. |
Thank you for your valuable feedback. Since PR URL: #328 I would appreciate your review! |
Sounds good! Just to clarify, are you already having the issue of Helm resources not being cleaned up existing version of CAAPH? Since the blockOwnerDeletion behavior is already set, if you are having that issue, then it seems like something else might need to be changed. |
Thank you for following up. At the moment, I have not observed any specific issues with resources failing to be cleaned up by CAAPH. However, in my previous work, I frequently created and deleted EKS clusters—sometimes via Argo CD, Flux CD, or manual Helm installations—and found myself manually deleting numerous residual resources. This has led me to the conclusion that while blockOwnerDeletion is useful, it does not by itself guarantee that leftover resources will be properly removed. In particular, blockOwnerDeletion only prevents certain resource deletions in etcd; it does not stop the assignment of a deletionTimestamp. Many cleanup logics are triggered when a resource acquires a deletionTimestamp, and blockOwnerDeletion is not meant to define dependencies between finalizers. Consequently, if a Cluster or ControlPlane resource persists, CAAPH’s cleanup logic may attempt to run but fail because the cluster is no longer accessible—or cannot be processed correctly—leaving some resources behind. The runtime webhook (as proposed here: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220414-runtime-hooks.md) is meant to address precisely these issues. While blockOwnerDeletion plays a part in controlling ownership-based cleanup, I believe relying on that alone will not fully solve the problem of leftover resources. On a related note, I do believe that intentionally reproducing the scenario and gathering evidence would be an excellent approach to verify whether the current setup could lead to cleanup failures. |
User Story
As an operator, I would like an option to ensure the uninstallation of HelmChartProxy resources. This is to properly clean up external resources during cluster deletion and prevent unnecessary resource remnants or unexpected behavior.
Detailed Description
I propose adding a new option,
ensure
, to thespec.uninstall
section of HelmChartProxy to control its uninstallation behavior. By enabling this option, specific Helm charts can be guaranteed to uninstall properly during cluster deletion. The default value ofensure
should befalse
, allowing users to explicitly opt into this behavior.An example configuration:
Challenges
Currently, the following issues can arise during cluster deletion, making proper cleanup difficult:
Cluster Resource Deletion Before HelmChartProxy Uninstallation
If the Cluster resource is deleted before HelmChartProxy has been fully uninstalled, external resources created by the Helm chart may remain orphaned, leading to inconsistencies and potential security risks.
Concurrency Between HelmChartProxy Uninstallation and Cluster Deletion
When the Cluster deletion process starts while the HelmChartProxy is still uninstalling, the Cluster deletion may execute too quickly, preventing the HelmChartProxy from completing its uninstallation process. This can result in incomplete cleanup of resources.
To address these challenges, I suggest leveraging the Cluster API [Runtime Hooks for Add-on Management](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220414-runtime-hooks.md). Specifically, the
BeforeClusterDelete
hook can be used to:Other Information
/kind feature
The text was updated successfully, but these errors were encountered: