Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ensure Option to spec.uninstall for Ensuring Proper Cleanup of HelmChartProxy Resources #319

Open
kahirokunn opened this issue Dec 2, 2024 · 9 comments · May be fixed by #328
Open
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@kahirokunn
Copy link

User Story

As an operator, I would like an option to ensure the uninstallation of HelmChartProxy resources. This is to properly clean up external resources during cluster deletion and prevent unnecessary resource remnants or unexpected behavior.

Detailed Description

I propose adding a new option, ensure, to the spec.uninstall section of HelmChartProxy to control its uninstallation behavior. By enabling this option, specific Helm charts can be guaranteed to uninstall properly during cluster deletion. The default value of ensure should be false, allowing users to explicitly opt into this behavior.

An example configuration:

apiVersion: addons.cluster.x-k8s.io/v1alpha1
kind: HelmChartProxy
metadata:
  name: karpenter
spec:
  clusterSelector:
    matchLabels:
      karpenterChart: enabled
  repoURL: oci://public.ecr.aws/karpenter
  chartName: karpenter
  options:
    waitForJobs: true
    wait: true
    timeout: 5m
    install:
      createNamespace: true
    uninstall:
      ensure: true  # New option (default: false)
  valuesTemplate: |
    controller:
      replicaCount: 2

Challenges

Currently, the following issues can arise during cluster deletion, making proper cleanup difficult:

  1. Cluster Resource Deletion Before HelmChartProxy Uninstallation
    If the Cluster resource is deleted before HelmChartProxy has been fully uninstalled, external resources created by the Helm chart may remain orphaned, leading to inconsistencies and potential security risks.

  2. Concurrency Between HelmChartProxy Uninstallation and Cluster Deletion
    When the Cluster deletion process starts while the HelmChartProxy is still uninstalling, the Cluster deletion may execute too quickly, preventing the HelmChartProxy from completing its uninstallation process. This can result in incomplete cleanup of resources.

To address these challenges, I suggest leveraging the Cluster API [Runtime Hooks for Add-on Management](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220414-runtime-hooks.md). Specifically, the BeforeClusterDelete hook can be used to:

  • Ensure that HelmChartProxy resources are fully uninstalled before proceeding with Cluster deletion.
  • Introduce synchronization points to prevent race conditions between HelmChartProxy uninstallation and Cluster resource deletion.

Other Information

  • Using Runtime Hooks can integrate the add-on uninstallation process with the cluster lifecycle, ensuring a smoother deletion process.
  • For Helm charts that create external resources (e.g., Karpenter), this feature guarantees resource consistency and proper cleanup.
  • Addressing these challenges reduces the risk of resource remnants, inconsistencies, and security issues during cluster lifecycle management.

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 2, 2024
@Jont828
Copy link
Contributor

Jont828 commented Dec 4, 2024

/triage-accepted

Will take a look on Thurs.

@Jont828
Copy link
Contributor

Jont828 commented Dec 5, 2024

If I understand this correctly, you would like to be able to block Cluster deletion until we've successfully done the Helm uninstall operations, is that right?

  1. Cluster Resource Deletion Before HelmChartProxy Uninstallation
    If the Cluster resource is deleted before HelmChartProxy has been fully uninstalled, external resources created by the Helm chart may remain orphaned, leading to inconsistencies and potential security risks.

I'm not too clear on how the Helm resources on the workload Cluster would exist after the Cluster is deleted. The Helm chart is running on the VMs and such of the workload Cluster, I thought those would get deleted anyway. Do you have an example/use case where there's a security risk, or where a Helm resource persists after the Cluster is deleted?

@kahirokunn
Copy link
Author

Thank you for your response.

As you correctly understood, my proposal aims to ensure that HelmChartProxy resources are fully uninstalled before the Cluster deletion process completes. Below, I provide detailed use cases and risks to clarify the necessity of this feature.

Specific Use Cases and Risks

  1. Using Karpenter

    • Role: Karpenter automatically provisions and manages AWS EC2 instances and Kubernetes node resources.
    • Risk: If the Cluster is deleted without properly uninstalling Karpenter’s Helm chart, the EC2 instances and nodes managed by Karpenter may remain running. This leads to unnecessary costs and potential resource wastage.
  2. Using Crossplane

    • Role: Crossplane manages cloud provider resources such as RDS instances and S3 buckets.
    • Risk: During Cluster deletion, if Crossplane's resources are not correctly cleaned up, resources like RDS instances or S3 buckets might persist unintentionally. This can result in data leakage, increased costs, and potential security vulnerabilities.
  3. Using external-dns

    • Role: external-dns automatically creates and manages DNS records.
    • Risk: If the Cluster is deleted without properly uninstalling external-dns, DNS records may remain. This can cause domain misrouting, security risks, and operational issues.

Current Issues

  • Uninstallation Timing: Currently, the Cluster resource deletion process may execute before HelmChartProxy has a chance to fully uninstall the associated Helm charts. This increases the risk of external resources remaining orphaned.

  • Manual Cleanup Challenges: In large-scale environments, where there are hundreds or thousands of cloud infrastructure and subsystem resources, manually deleting these resources is extremely tedious and time-consuming. From my experience, this process can be described as “hellish” due to its complexity and the high potential for human error.

Benefits of the Proposal

  • Automated Consistent Cleanup: By enabling the ensure option, the uninstallation of Helm charts will be guaranteed during Cluster deletion. This prevents external resources from being left behind, ensuring a clean and consistent environment.

  • Enhanced Security and Cost Optimization: Proper cleanup of cloud resources minimizes security risks associated with orphaned resources and helps in avoiding unnecessary expenses from unused infrastructure.

  • Reduced Operational Burden: Automating the cleanup process significantly lowers the operational workload for administrators, reducing the likelihood of errors and saving valuable time.

Summary

Proper cleanup of external resources during Cluster deletion is crucial for maintaining security, managing costs, and ensuring operational efficiency. Introducing the ensure option in the HelmChartProxy's spec.uninstall section will effectively address these challenges by automating the uninstallation process. This enhancement will lead to more reliable cluster lifecycle management and mitigate the risks associated with orphaned resources.

I hope this clarifies the importance and benefits of the proposed feature. Please feel free to reach out with any further questions or feedback.

Best regards,

kahirokunn

@Jont828
Copy link
Contributor

Jont828 commented Dec 11, 2024

That makes sense, I think this would be a good feature to add. I'm not too clear on what the implementation would look like, however. I haven't used any of the hooks like BeforeClusterDelete, would that be something we implement in code, or is it something that's defined in YAML? And would you like to implement this feature?

@kahirokunn
Copy link
Author

Thank you for your response. I'm glad to hear that you think this would be a valuable feature to add.

To address your questions, implementing the BeforeClusterDelete hook involves both YAML configuration and some code implementation. Specifically, you can receive webhooks by defining an ExtensionConfig resource like the following:

apiVersion: runtime.cluster.x-k8s.io/v1alpha1
kind: ExtensionConfig
metadata:
  annotations:
    runtime.cluster.x-k8s.io/inject-ca-from-secret: default/test-runtime-sdk-svc-cert
  name: test-runtime-sdk-extensionconfig
spec:
  clientConfig:
    service:
      name: test-runtime-sdk-svc
      namespace: default  # Assumes the extension is deployed in the default namespace
      port: 443
  namespaceSelector:
    matchExpressions:
      - key: kubernetes.io/metadata.name
        operator: In
        values:
          - default  # Assumes the extension is used by Clusters in the default namespace only

However, we also need to set up a webhook server to handle the incoming webhook requests. You can find a working sample implementation here:

Runtime Hooks Sample Implementation

For detailed guidelines on implementing extensions, please refer to the official documentation:

Cluster API Runtime SDK - Implement Extensions

Additionally, we'll need to modify the Custom Resource Definitions (CRDs) and the Go structs to include the new optional ensure field in the uninstall section:

apiVersion: addons.cluster.x-k8s.io/v1alpha1
kind: HelmChartProxy
metadata:
  name: karpenter
spec:
  clusterSelector:
    matchLabels:
      karpenterChart: enabled
  repoURL: oci://public.ecr.aws/karpenter
  chartName: karpenter
  options:
    waitForJobs: true
    wait: true
    timeout: 5m
    install:
      createNamespace: true
    uninstall:
      ensure: true  # New optional field (default: false)
  valuesTemplate: |
    controller:
      replicaCount: 2

Both HelmChartProxy and HelmReleaseProxy are involved in this process. HelmChartProxy acts as a higher-level resource that creates HelmReleaseProxy instances for multiple Cluster resources.

There are some aspects of the HelmReleaseProxy lifecycle that I'm currently uncertain about:

  1. Creation During Cluster Deletion: If a HelmChartProxy is newly created when a Cluster resource has a deletionTimestamp (or if the selector is changed to match such a Cluster), does the controller prevent the creation of a HelmReleaseProxy for that Cluster? Essentially, is the controller designed to avoid creating HelmReleaseProxy resources for Clusters that are in the process of being deleted?

  2. Cleanup on Cluster Deletion: When a Cluster resource receives a deletionTimestamp, does the controller initiate the cleanup process so that, after cleanup is completed, the associated HelmReleaseProxy is deleted? I would like to know if the controller behaves this way.

If both conditions are met, then the implementation of the runtime webhook server would only primarily need to block the Cluster deletion.

I'm more than willing to begin by investigating these details and proceeding based on what I find.

This would be my first time contributing to the Cluster API (CAPI) project, but I'd be happy to take on the implementation of this feature if that's alright with you.

Please let me know your thoughts or if there's anything else you'd like me to address.

Best regards,

kahirokunn

@Jont828
Copy link
Contributor

Jont828 commented Dec 17, 2024

Currently, we do not have any logic to block the creation of a HelmReleaseProxy if a HelmChartProxy is newly created and selects a Cluster that has a deletion timestamp. However, I think that would be pretty simple to implement by adding some logic to block the HelmReleaseProxy creation that happens here.

The HelmReleaseProxy is owned by both the Cluster and the HelmChartProxy, so I believe deleting any owner should also delete HelmChartProxy. Come to think of it, I think you can use the blockOwnerDeletion field in the ownerRef to the Cluster to wait until the Helm resources are cleaned up, as described here.

@kahirokunn kahirokunn linked a pull request Dec 18, 2024 that will close this issue
@kahirokunn
Copy link
Author

Thank you for your valuable feedback.

Since ownerReference and blockOwnerDeletion are set, it appears that handling deletion explicitly may not be necessary. Therefore, I have created a PR to skip only the creation process.

PR URL: #328

I would appreciate your review!

@Jont828
Copy link
Contributor

Jont828 commented Dec 18, 2024

Sounds good! Just to clarify, are you already having the issue of Helm resources not being cleaned up existing version of CAAPH? Since the blockOwnerDeletion behavior is already set, if you are having that issue, then it seems like something else might need to be changed.

@kahirokunn
Copy link
Author

kahirokunn commented Dec 19, 2024

Thank you for following up. At the moment, I have not observed any specific issues with resources failing to be cleaned up by CAAPH. However, in my previous work, I frequently created and deleted EKS clusters—sometimes via Argo CD, Flux CD, or manual Helm installations—and found myself manually deleting numerous residual resources. This has led me to the conclusion that while blockOwnerDeletion is useful, it does not by itself guarantee that leftover resources will be properly removed.

In particular, blockOwnerDeletion only prevents certain resource deletions in etcd; it does not stop the assignment of a deletionTimestamp. Many cleanup logics are triggered when a resource acquires a deletionTimestamp, and blockOwnerDeletion is not meant to define dependencies between finalizers. Consequently, if a Cluster or ControlPlane resource persists, CAAPH’s cleanup logic may attempt to run but fail because the cluster is no longer accessible—or cannot be processed correctly—leaving some resources behind.

The runtime webhook (as proposed here: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220414-runtime-hooks.md) is meant to address precisely these issues. While blockOwnerDeletion plays a part in controlling ownership-based cleanup, I believe relying on that alone will not fully solve the problem of leftover resources.

On a related note, I do believe that intentionally reproducing the scenario and gathering evidence would be an excellent approach to verify whether the current setup could lead to cleanup failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants