Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with node startupTaints being restored #1772

Open
dpiddock opened this issue Oct 23, 2024 · 2 comments
Open

Race condition with node startupTaints being restored #1772

dpiddock opened this issue Oct 23, 2024 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@dpiddock
Copy link

Description

Observed Behavior:
Karpenter restores the startupTaints to a node if it is removed too quickly at node startup. This results in a node being unusable. Node also never reaches a ready state, so Karpenter refuses to remove it: Cannot disrupt Node: state node isn't initialized

From AWS CloudWatch logs insights:

  • created state:
    requestObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
    requestObject.spec.taints.1 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
    requestObject.spec.taints.2 {"key":"karpenter.sh/unregistered","effect":"NoExecute"}
    requestObject.spec.taints.3 {"key":"node.cloudprovider.kubernetes.io/uninitialized","value":"true","effect":"NoSchedule"}
    requestReceivedTimestamp 2024-10-22T13:21:21.946552Z
    
  • node-controller updates some taints
  • efs-csi removes its taint
    requestObject.0.op                test
    requestObject.0.path              /spec/taints
    requestObject.0.value.0.effect    NoExecute
    requestObject.0.value.0.key       ebs.csi.aws.com/agent-not-ready
    requestObject.0.value.1.effect    NoExecute
    requestObject.0.value.1.key       efs.csi.aws.com/agent-not-ready
    requestObject.0.value.2.effect    NoExecute
    requestObject.0.value.2.key       karpenter.sh/unregistered
    requestObject.0.value.3.effect    NoSchedule
    requestObject.0.value.3.key       node.kubernetes.io/not-ready
    requestObject.1.op                replace
    requestObject.1.path              /spec/taints
    requestObject.1.value.0.effect    NoExecute
    requestObject.1.value.0.key       ebs.csi.aws.com/agent-not-ready
    requestObject.1.value.1.effect    NoExecute
    requestObject.1.value.1.key       karpenter.sh/unregistered
    requestObject.1.value.2.effect    NoSchedule
    requestObject.1.value.2.key       node.kubernetes.io/not-ready
    requestReceivedTimestamp          2024-10-22T13:21:32.092879Z
    ...
    responseObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
    responseObject.spec.taints.1 {"key":"karpenter.sh/unregistered","effect":"NoExecute"}
    responseObject.spec.taints.2 {"key":"node.kubernetes.io/not-ready","effect":"NoSchedule"}
    
  • node-controller updates the not-ready taint
  • Karpenter removes the unregistered taint but also restores the efs taint for some reason:
    requestObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
    requestObject.spec.taints.1 {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2024-10-22T13:21:34Z"}
    requestObject.spec.taints.2 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
    requestReceivedTimestamp 2024-10-22T13:21:38.465463Z
    ...
    responseObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
    responseObject.spec.taints.1 {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2024-10-22T13:21:34Z"}
    responseObject.spec.taints.2 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
    
  • ebs removes its taint
  • node-controller removes the not-ready taint
  • Node never schedules pods

Expected Behavior:
Karpenter updates the existing taints on a node to remove karpenter.sh/unregistered=NoExecute without restoring startup taints removed by other controllers.

Reproduction Steps (Please include YAML):
This is an unpredictable race condition that is near impossible to reproduce on demand.
Might be related to this code: https://github.com/rschalo/karpenter/blob/a652a4aa95dbe92159bb273a3b64ff8837d92660/pkg/controllers/nodeclaim/lifecycle/registration.go#L87

Versions:

  • Chart Version: 1.0.6
  • Kubernetes Version (kubectl version):
    Client Version: v1.31.1
    Kustomize Version: v5.4.2
    Server Version: v1.30.4-eks-a737599
    
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@dpiddock dpiddock added the kind/bug Categorizes issue or PR as related to a bug. label Oct 23, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 23, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nuvme-devops
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants