Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS Pod Identity feature does not work #6418

Open
magzim21 opened this issue Dec 12, 2024 · 1 comment
Open

EKS Pod Identity feature does not work #6418

magzim21 opened this issue Dec 12, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@magzim21
Copy link

magzim21 commented Dec 12, 2024

Report

  1. Documentation.
    https://keda.sh/docs/2.16/concepts/authentication/#aws-pod-identity-webhook-for-aws provder: aws.
    https://keda.sh/docs/2.16/concepts/authentication/#aws-eks-pod-identity-webhook (deprecated) provider: aws-eks.

Both say "allows you to provide the role name using an annotation on a service account associated with your pod."
Providing a role name via annotation (IRSA) is an old way. The new way is Pod Identity - it does not involve annotations. Why is it marked as "deprecated" ?

  1. The error itself
    When using pod identities:
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
spec:
  podIdentity:
    provider: aws-eks # <==
    identityOwner: workload

fails with error parsing SQS queue metadata: awsAccessKeyID not found .

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
spec:
  podIdentity:
    provider: aws # <==
    identityOwner: workload

fails with "error getting service account: 'human-risk-scores-scheduled-sqs-worker', error: annotation 'eks.amazonaws.com/role-arn' not found" which is expected.

Expected Behavior

It should use the service account mentioned in the ScalingJob template (identityOwner: workload).
The service account had Pod Identity attached - this scaler has necessary SQS permissions.

Actual Behavior

fails with error parsing SQS queue metadata: awsAccessKeyID not found .

Steps to Reproduce the Problem

Apply

---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: some-scheduled-sqs-worker
  namespace: "human-risk"
spec:
  podIdentity:
    provider: aws-eks
    identityOwner: workload
---
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: some-scheduled-sqs-worker
spec:
  jobTargetRef:
    parallelism: 1                            # [max number of desired pods](https://kubernetes.io/docs/concepts/workloads/controllers/job/#controlling-parallelism)
    completions: 1                            # [desired number of successfully finished pods](https://kubernetes.io/docs/concepts/workloads/controllers/job/#controlling-parallelism)
    activeDeadlineSeconds: 600                #  Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer
    backoffLimit: 6                           # Specifies the number of retries before marking this job failed. Defaults to 6
    template:
      # describes the [job template](https://kubernetes.io/docs/concepts/workloads/controllers/job)

      metadata:
      spec:
        containers:
          - name: scheduled-sqs-worker
            image: some
            command:
              - ./run_scheduled_worker.sh
            ports:
              - name: http
                containerPort: 8080
                protocol: TCP
            envFrom:
              - configMapRef:
                  name: some
            env:
              - name: POD_NAME
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.name
              - name: POD_NAMESPACE
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
              - name: POD_NODENAME
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: spec.nodeName
              - name: POD_SERVICEACCOUNTNAME
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: spec.serviceAccountName
              - name: CONTAINER_CPU_REQUEST
                valueFrom:
                  resourceFieldRef:
                    containerName: scheduled-sqs-worker
                    resource: requests.cpu
                    divisor: '0'
              - name: CONTAINER_CPU_LIMIT
                valueFrom:
                  resourceFieldRef:
                    containerName: scheduled-sqs-worker
                    resource: limits.cpu
                    divisor: '0'
              - name: CONTAINER_MEM_REQUEST
                valueFrom:
                  resourceFieldRef:
                    containerName: scheduled-sqs-worker
                    resource: requests.memory
                    divisor: '0'
              - name: CONTAINER_MEM_LIMIT
                valueFrom:
                  resourceFieldRef:
                    containerName: scheduled-sqs-worker
                    resource: limits.memory
                    divisor: '0'
              - name: JSM_API_KEY
                valueFrom:
                  secretKeyRef:
                    name: some
                    key: JSM_API_KEY
                    optional: false
              - name: KAFKA_BROKERS
                valueFrom:
                  secretKeyRef:
                    name: some
                    key: KAFKA_BROKERS
                    optional: false
            resources: {}
            livenessProbe:
              exec:
                command:
                  - cat
                  - /code/ready
              initialDelaySeconds: 60
              timeoutSeconds: 10
              periodSeconds: 30
              successThreshold: 1
              failureThreshold: 5
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            imagePullPolicy: IfNotPresent
            securityContext:
              capabilities:
                add:
                  - NET_BIND_SERVICE
                drop:
                  - ALL
              readOnlyRootFilesystem: false
              allowPrivilegeEscalation: false
        restartPolicy: Always
        terminationGracePeriodSeconds: 30
        dnsPolicy: ClusterFirst
        serviceAccountName: some
        serviceAccount: some
        securityContext:
          runAsUser: 1000
          runAsNonRoot: true
          fsGroup: 2000
        affinity:
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      app.kubernetes.io/instance: some-scheduled-sqs-worker
                      app.kubernetes.io/name: scheduled-sqs-worker
                  topologyKey: topology.kubernetes.io/zone
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      app.kubernetes.io/instance: some-scheduled-sqs-worker
                      app.kubernetes.io/name: scheduled-sqs-worker
                  topologyKey: kubernetes.io/hostname
  pollingInterval: 30                         # Optional. Default: 30 seconds
  successfulJobsHistoryLimit: 10              # Optional. Default: 100. How many completed jobs should be kept.
  failedJobsHistoryLimit: 10                  # Optional. Default: 100. How many failed jobs should be kept.
  # envSourceContainerName: {container-name}    # Optional. Default: .spec.JobTargetRef.template.spec.containers[0]
  minReplicaCount: 0                           # Optional. Default: 0
  maxReplicaCount: 100                        # Optional. Default: 100
  # rolloutStrategy: gradual                    # Deprecated: Use rollout.strategy instead (see below).
  rollout:
    strategy: gradual                         # Optional. Default: default. Which Rollout Strategy KEDA will use.
    propagationPolicy: background             # Optional. Default: background. Kubernetes propagation policy for cleaning up existing jobs during rollout.
  # scalingStrategy:
  #   strategy: "custom"                        # Optional. Default: default. Which Scaling Strategy to use. 
  #   customScalingQueueLengthDeduction: 1      # Optional. A parameter to optimize custom ScalingStrategy.
  #   customScalingRunningJobPercentage: "0.5"  # Optional. A parameter to optimize custom ScalingStrategy.
  #   pendingPodConditions:                     # Optional. A parameter to calculate pending job count per the specified pod conditions
  #     - "Ready"
  #     - "PodScheduled"
  #     - "AnyOtherCustomPodCondition"
    # multipleScalersCalculation : "max" # Optional. Default: max. Specifies how to calculate the target metrics when multiple scalers are defined.
  triggers:
  # https://keda.sh/docs/2.16/scalers/aws-sqs/
  - type: aws-sqs-queue
    authenticationRef:
      name: some-scheduled-sqs-worker
    metadata:
      queueLength: "1" # TODO: Change this to the actual queue length
      queueURLFromEnv: SCHEDULED_JOBS_SQS_URL
      awsRegion: eu-west-2
  1. Have Pod identity attached to some to serviceAccount
  2. Observe "error parsing SQS queue metadata: awsAccessKeyID not found"

Logs from KEDA operator

2024-12-12T06:20:06Z	ERROR	Reconciler error	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"human-risk-scores-scheduled-sqs-worker","namespace":"human-risk"}, "namespace": "human-risk", "name": "human-risk-scores-scheduled-sqs-worker", "reconcileID": "3e0173b5-eb2c-4032-91c1-7481abbc808c", "error": "error parsing SQS queue metadata: awsAccessKeyID not found"}

KEDA Version

2.16.0

Kubernetes Version

1.29

Platform

Amazon Web Services

Scaler Details

SQS

Anything else?

I am new to Keda, but am an experienced Kubernetes user. Maybe I misunderstood smth from documentation. The part I mentioned is really misleading.

@magzim21 magzim21 added the bug Something isn't working label Dec 12, 2024
@magzim21
Copy link
Author

magzim21 commented Dec 12, 2024

Also I tried provider: aws-kiam which is not mentioned in the latest docs - "error parsing SQS queue metadata: awsAccessKeyID not found"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: To Triage
Development

No branches or pull requests

1 participant