Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

karpenter Pod fails to Start -The specified queue does not exist. #1799

Open
ssarbadh opened this issue Nov 7, 2024 · 5 comments
Open

karpenter Pod fails to Start -The specified queue does not exist. #1799

ssarbadh opened this issue Nov 7, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@ssarbadh
Copy link

ssarbadh commented Nov 7, 2024

Description

Observed Behavior:

Pod fails to start - panic: AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist.

Expected Behavior:
Pod runs

Reproduction Steps (Please include YAML):
Follow this documentation.
https://karpenter.sh/docs/getting-started/migrating-from-cas/

Doc mentions about setting a Interruption Queue -
--set "settings.interruptionQueue=${CLUSTER_NAME}"

But the policy for the service account - doesn't mention anything to do with Queue (sqs permissions).

** Extra info **
A service account is created -

# Source: karpenter/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: karpenter
  namespace: kube-system
  labels:
    helm.sh/chart: karpenter-1.0.7
    app.kubernetes.io/name: karpenter
    app.kubernetes.io/instance: karpenter
    app.kubernetes.io/version: "1.0.7"
    app.kubernetes.io/managed-by: Helm
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::797036517683:role/KarpenterControllerRole-tajawal-dev
    

Deployment refers to queue

            - name: INTERRUPTION_QUEUE
            value: "tajawal-dev"

Policy attached to the Service Account is copied from documentation

cat << EOF > controller-trust-policy.json
{
 "Version": "2012-10-17",
 "Statement": [
     {
         "Effect": "Allow",
         "Principal": {
             "Federated": "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ENDPOINT#*//}"
         },
         "Action": "sts:AssumeRoleWithWebIdentity",
         "Condition": {
             "StringEquals": {
                 "${OIDC_ENDPOINT#*//}:aud": "sts.amazonaws.com",
                 "${OIDC_ENDPOINT#*//}:sub": "system:serviceaccount:${KARPENTER_NAMESPACE}:karpenter"
             }
         }
     }
 ]
}
EOF

aws iam create-role --role-name "KarpenterControllerRole-${CLUSTER_NAME}" \
 --assume-role-policy-document file://controller-trust-policy.json

cat << EOF > controller-policy.json
{
 "Statement": [
     {
         "Action": [
             "ssm:GetParameter",
             "ec2:DescribeImages",
             "ec2:RunInstances",
             "ec2:DescribeSubnets",
             "ec2:DescribeSecurityGroups",
             "ec2:DescribeLaunchTemplates",
             "ec2:DescribeInstances",
             "ec2:DescribeInstanceTypes",
             "ec2:DescribeInstanceTypeOfferings",
             "ec2:DeleteLaunchTemplate",
             "ec2:CreateTags",
             "ec2:CreateLaunchTemplate",
             "ec2:CreateFleet",
             "ec2:DescribeSpotPriceHistory",
             "pricing:GetProducts"
         ],
         "Effect": "Allow",
         "Resource": "*",
         "Sid": "Karpenter"
     },
     {
         "Action": "ec2:TerminateInstances",
         "Condition": {
             "StringLike": {
                 "ec2:ResourceTag/karpenter.sh/nodepool": "*"
             }
         },
         "Effect": "Allow",
         "Resource": "*",
         "Sid": "ConditionalEC2Termination"
     },
     {
         "Effect": "Allow",
         "Action": "iam:PassRole",
         "Resource": "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}",
         "Sid": "PassNodeIAMRole"
     },
     {
         "Effect": "Allow",
         "Action": "eks:DescribeCluster",
         "Resource": "arn:${AWS_PARTITION}:eks:${AWS_REGION}:${AWS_ACCOUNT_ID}:cluster/${CLUSTER_NAME}",
         "Sid": "EKSClusterEndpointLookup"
     },
     {
         "Sid": "AllowScopedInstanceProfileCreationActions",
         "Effect": "Allow",
         "Resource": "*",
         "Action": [
         "iam:CreateInstanceProfile"
         ],
         "Condition": {
         "StringEquals": {
             "aws:RequestTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned",
             "aws:RequestTag/topology.kubernetes.io/region": "${AWS_REGION}"
         },
         "StringLike": {
             "aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*"
         }
         }
     },
     {
         "Sid": "AllowScopedInstanceProfileTagActions",
         "Effect": "Allow",
         "Resource": "*",
         "Action": [
         "iam:TagInstanceProfile"
         ],
         "Condition": {
         "StringEquals": {
             "aws:ResourceTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned",
             "aws:ResourceTag/topology.kubernetes.io/region": "${AWS_REGION}",
             "aws:RequestTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned",
             "aws:RequestTag/topology.kubernetes.io/region": "${AWS_REGION}"
         },
         "StringLike": {
             "aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*",
             "aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*"
         }
         }
     },
     {
         "Sid": "AllowScopedInstanceProfileActions",
         "Effect": "Allow",
         "Resource": "*",
         "Action": [
         "iam:AddRoleToInstanceProfile",
         "iam:RemoveRoleFromInstanceProfile",
         "iam:DeleteInstanceProfile"
         ],
         "Condition": {
         "StringEquals": {
             "aws:ResourceTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned",
             "aws:ResourceTag/topology.kubernetes.io/region": "${AWS_REGION}"
         },
         "StringLike": {
             "aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*"
         }
         }
     },
     {
         "Sid": "AllowInstanceProfileReadActions",
         "Effect": "Allow",
         "Resource": "*",
         "Action": "iam:GetInstanceProfile"
     }
 ],
 "Version": "2012-10-17"
}
EOF

aws iam put-role-policy --role-name "KarpenterControllerRole-${CLUSTER_NAME}" \
 --policy-name "KarpenterControllerPolicy-${CLUSTER_NAME}" \
 --policy-document file://controller-policy.json

This issue comment mentions some SQS permissions -
aws/karpenter-provider-aws#3185 (comment)

      - Effect: Allow
        Action:
          # Write Operations
          - sqs:DeleteMessage
          # Read Operations
          - sqs:GetQueueAttributes
          - sqs:GetQueueUrl
          - sqs:ReceiveMessage
        Resource: !GetAtt KarpenterInterruptionQueue.Arn

Versions:

  • Chart Version:
  • Kubernetes Version (kubectl version): 1.29
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@ssarbadh ssarbadh added the kind/bug Categorizes issue or PR as related to a bug. label Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 7, 2024
@ssarbadh
Copy link
Author

ssarbadh commented Nov 7, 2024

I could fix it by stitching things from other repos, issue comments and documentation -

SQS stack needs to be present - Ref - https://github.com/aws/karpenter-provider-aws/blob/main/website/content/en/docs/getting-started/getting-started-with-karpenter/cloudformation.yaml

SQS permission needs to be given to Karpenter pod's service Account -
Ref- aws/karpenter-provider-aws#3185 (comment)

If this can be added to documentation or the --set "settings.interruptionQueue=${CLUSTER_NAME}" \ removed
That will help. Thanks

@prwnd9
Copy link

prwnd9 commented Nov 11, 2024

Using ARM SPOT instances and I can verify the following versions that the issue exist:

  • EKS 1.31
  • terraform eks module version: v20.29.0
  • karpenter helm version: 1.0.7

@jigisha620
Copy link
Contributor

@ssarbadh What version of karpenter chart is being used here?

@MNLOPS
Copy link

MNLOPS commented Nov 13, 2024

I temporarily resolved the issue by removing the following lines:

- name: INTERRUPTION_QUEUE
  value: "<cluster-name>"

This prevented errors related to the non-existent interruption queue and allowed the pod to start as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

5 participants