Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App gets stuck deleting if service account is not created #114

Open
benmoss opened this issue Feb 25, 2021 · 15 comments
Open

App gets stuck deleting if service account is not created #114

benmoss opened this issue Feb 25, 2021 · 15 comments
Labels
bug This issue describes a defect or unexpected behavior carvel-accepted This issue should be considered for future work and that the triage process has been completed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/unprioritized-backlog Higher priority than priority/awaiting-more-evidence but not planned. Contributions are welcome.

Comments

@benmoss
Copy link
Contributor

benmoss commented Feb 25, 2021

What steps did you take:
Installed the "simple-app" from the instructions without first creating the service account for it.
https://github.com/vmware-tanzu/carvel-kapp-controller/blob/develop/docs/walkthrough.md

Tried to delete the app via kapp -a simple-app delete and see it hang/timeout.

What happened:

$ k get app
NAME         DESCRIPTION                                                                                         SINCE-DEPLOY   AGE
simple-app   Delete failed: Preparing kapp: Getting service account: serviceaccounts "default-ns-sa" not found   3s             56m

What did you expect:
Delete should succeed

Anything else you would like to add:

Environment:

  • kapp Controller version : v0.16.0
  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-21T01:11:42Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
@benmoss benmoss added bug This issue describes a defect or unexpected behavior carvel-triage This issue has not yet been reviewed for validity labels Feb 25, 2021
@danielhelfand danielhelfand added carvel-accepted This issue should be considered for future work and that the triage process has been completed and removed carvel-triage This issue has not yet been reviewed for validity labels Feb 25, 2021
@danielhelfand
Copy link
Contributor

I have also run into the issue described above in a similar scenario where I have used serviceAccount instead of serviceAccountName. This is definitely something we should look into prioritizing.

Could be similar to #69 but can dig a bit further to see if the issue itself has different nuance to it.

@vibhas
Copy link

vibhas commented Mar 15, 2021

Related to #71

@aaronshurley
Copy link
Contributor

aaronshurley commented Mar 22, 2021

Adding Explore to define/propose a solution.

Reasonable solutions could be:

  • improving the error message that's returned to the user
  • providing documentation about this

@aaronshurley aaronshurley changed the title App gets stuck deleting if service account is not created **Explore:** App gets stuck deleting if service account is not created Mar 22, 2021
@vibhas vibhas mentioned this issue Mar 24, 2021
32 tasks
@ewrenn8
Copy link
Contributor

ewrenn8 commented Mar 24, 2021

related: #34

@ewrenn8
Copy link
Contributor

ewrenn8 commented Apr 1, 2021

Hey @benmoss,

After some discussion, we've decided the best thing we could do here is to update the error message with a bit more info. Since kapp-controller will be deleting resources, we need a service account that has the roles needed to do the deletion, so that we know it is allowed from an rbac standpoint. This means if the service account can't be found, kapp-controller won't be able to delete things safely.

Going forward, the easiest remedy in this case is to create the service account again. We will be sure to call this information out in the error message when we update it.

Thanks!
Eli

@danielhelfand
Copy link
Contributor

We will also explicitly document how to resolve this issue in kapp-controller docs.

As part of documenting how to address this issue, something we will need to call out is what are minimum rbac permissions needed for the serviceaccount to resolve this issue so a user can proceed with deleting the App.

@danielhelfand danielhelfand changed the title **Explore:** App gets stuck deleting if service account is not created App gets stuck deleting if service account is not created Apr 1, 2021
@benmoss
Copy link
Contributor Author

benmoss commented Apr 1, 2021

If I had to guess why it's not straightforward:

Since it never had the service account in the first place it didn't create anything, but the tricky bit is knowing that for certain. You might also be in a situation where the service account existed, resources were created, and then the service account was deleted, and so knowing the difference would mean caching that state of "i created x, y, z" on the app itself?

@cppforlife
Copy link
Contributor

Since it never had the service account in the first place it didn't create anything, but the tricky bit is knowing that for certain.

ah, this is a good case to address.

@ewrenn8
Copy link
Contributor

ewrenn8 commented Apr 5, 2021

I will update this issue to call out fixing that specific case and create another one for improving docs or error messages for the case resources do exist.

Problem statement

Kapp controller currently doesn't distinguish between deletes that will delete resources and deletes that just require deleting the App CR. This means, when a service account is missing, kctrl refuses to perform a delete in both cases, even though the former is the only one that requires a service account.

Desired State

kapp controller is able to delete apps that have not had any resources deployed without a service account being present.

Use Case

  1. Deploy kapp-controller
  2. Skip RBAC setup
  3. Deploy simple app example
  4. Verify deploy failed before applying resources
  5. delete the simple app
  6. see that it deletes successfully

@julian-hj
Copy link

We saw another version of this bug when trying to delete a namespace that contains a kapp app and a service account that it refers to.

The whole mess hangs forever after the service account gets deleted before the kapp finalizer can run.

@github-project-automation github-project-automation bot moved this to To Triage in Carvel Feb 14, 2023
@neil-hickey neil-hickey moved this from To Triage to Unprioritized in Carvel Feb 22, 2023
@neil-hickey neil-hickey added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 22, 2023
@heyjcollins
Copy link

+1 to @julian-hj ...
Can this be bumped in priority?
This is a serious blocker for customers working at scale.
It's not going to be feasible at scale for operators to manually inspect each namespace and delete all kaap apps before deleting the namespace itself.

@evankanderson
Copy link

One solution here might be to add the finalizer (which is what's blocking deletion) only after verifying that the service account exists and has access on the first reconciliation.

In combination with #1132, this could introduce the following model:

  • Create an App. Until it has has created at least one child resource, it can simply be deleted.
  • On first reconciliation / install, add a finalizer to the App resource immediately before writing other resources to the cluster, but after validating the service account's read access to the cluster.
    • At the same time, add an ownerReference with blockOwnerDeletion to the ServiceAccount used by the App.†
  • On deletion, the serviceAccount will still be valid (possibly with a deletionTimestamp) when the App is deleted.
  • When changing spec.serviceAccountName, the new ServiceAccount will need to be added as an ownerReference at the same time the old service account is removed. Note that if the new serviceAccountName does not point to a valid service account, we could have the same problem.
  • It's not clear to me what should be done with respect to ownerReferences when spec.serviceAccountName is empty.

† Note that adding this ownerReference may change an App from being an independent object to an object subject to garbage collection, i.e. if the ServiceAccount is deleted. If the App is part of some larger constellation like a PackageInstall, it will have two ownerReferences, and both will need to be deleted.

@ryanjbaxter
Copy link

What is the best way to resolve this once you get into this stat? You can't recreate the service account because the namespace is Terminating so I am not sure how you actually delete the apps to allow the namespace to be deleted.

@praveenrewar
Copy link
Member

@ryanjbaxter If you are already in this state, then you can remove the finalizers on the App CR. This might have side effect, if there were any resources created outside the namespace by the App, then those would still remain.

@ragaskar
Copy link

ragaskar commented Jan 13, 2024

So ... I got this wedged pretty good just now when a tanzu package install of cert manager failed on a vCenter 8 TKG 2.2 cluster. Not totally sure that is supposed to work -- but I couldn't edit the finalizers on the App CR because kubectl claimed it couldn't find the app (even though it was letting me edit it). I finally had to issue a "create namespace cert-manager" and THEN kubectl let me edit the app and remove the finalizers. I also had to remove some remaining cert manager service account stuff (clusterrole, maybe some other cluster-scoped stuff) by hand. n.b., you have to "replace" finalizers with an empty array -- just deleting the key will leave the original value there and it will stay stuck.

I'm fairly new to k8s, but this is definitely one of the trickier situations I've had to try to get myself out of. Any additional pointers or automated things that could have been done would have helped me.

EDIT: by the way, for others experiencing this problem, I'll share my root issue as a possible lead (as someone new to k8s, a hint like this would've help me a lot). Deep under the hood, (i.e., PackageInstall -> App -> ... -> ReplicaSet -> ) pods were failing to spin up in the cert-manager namespace due to lack of permissions (the namespace did not allow privileged pods). Instead of trying to "delete" my way out at the namespace level, I gather the "right" way to fix would have been to update the namespace to allow privileged pods (depends on your k8s version, but for me it required adding a label) and then retrying the pod deploy. I think this would have happened automatically on some cadence, but in my case, I was able to speed it up by deleting the failing ReplicaSets, which were then immediately re-created by the parent App and then, since the namespace now had the right permissions, the pods deployed within seconds. Again, being new to k8s, deleting this resource to "retry" seemed pretty scary to me (am I going to break something?), but might be an obvious step to someone more experienced. Anything the tool could have done to give me more hints around figuring this out would have been helpful. e.g., maybe errors from pods could be rolled up somehow and displayed at time of failure vs. me having to "know" where to look for the error I need to fix (some of this isn't really in kapp's domain, to be fair).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue describes a defect or unexpected behavior carvel-accepted This issue should be considered for future work and that the triage process has been completed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/unprioritized-backlog Higher priority than priority/awaiting-more-evidence but not planned. Contributions are welcome.
Projects
Status: Unprioritized
Development

Successfully merging a pull request may close this issue.