Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node tuning: failed to list *v1.Job: Unauthorized #2287

Open
adnankobir opened this issue Dec 19, 2024 · 3 comments
Open

node tuning: failed to list *v1.Job: Unauthorized #2287

adnankobir opened this issue Dec 19, 2024 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@adnankobir
Copy link

What happened?

It appears that serviceaccount/tokens used by the cluster-node-setup daemonset are not refreshed after a certain period of time (in my case 106d) - I can see logs as follows:

I1219 16:21:32.848791       1 cache/reflector.go:325] Listing and watching *v1.Pod from k8s.io/[email protected]/tools/cache/reflector.go:229
W1219 16:21:32.907213       1 cache/reflector.go:539] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Pod: Unauthorized
E1219 16:21:32.907247       1 cache/reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
I1219 16:21:37.125898       1 cache/reflector.go:325] Listing and watching *v1.DaemonSet from k8s.io/[email protected]/tools/cache/reflector.go:229
W1219 16:21:37.134328       1 cache/reflector.go:539] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.DaemonSet: Unauthorized
E1219 16:21:37.134376       1 cache/reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.DaemonSet: failed to list *v1.DaemonSet: Unauthorized
I1219 16:21:37.198647       1 cache/reflector.go:325] Listing and watching *v1.Job from k8s.io/[email protected]/tools/cache/reflector.go:229
W1219 16:21:37.207259       1 cache/reflector.go:539] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Job: Unauthorized
E1219 16:21:37.207289       1 cache/reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Job: failed to list *v1.Job: Unauthorized

Verified that the RBAC is setup correctly - a simple restart of the daemonset resolves the issue.

This is problematic because scylla nodes will fail to startup as the associated nodeconfig configmap will be blocked:

❯ k get cm -n scylla-aud-events  nodeconfig-podinfo-5e0c3810-ec68-46fb-9ed2-6ef7a8c5daa4 -o yaml
apiVersion: v1
data:
  ScyllaRuntimeConfig: '{"containerID":"containerd://645a7b7cfcb354927eff327cce12296bd32f2720412b8365e3ca69ab8e17fec7","matchingNodeConfigs":["cluster"],"blockingNodeConfigs":["cluster"]}'

What did you expect to happen?

the cluster-node-setup daemonset should have pods that refresh their tokens correctly and be able to query the kubernetes API for it to function correctly.

How can we reproduce it (as minimally and precisely as possible)?

Deploy a nodeConfig CR for e.g.:

apiVersion: scylla.scylladb.com/v1alpha1
kind: NodeConfig
metadata:
  name: cluster
spec:
  placement:
    nodeSelector:
      scylla.scylladb.com/node-type: scylla
    tolerations:
    - effect: NoSchedule
      key: role
      operator: Equal
      value: scylla

leave it running for 100d+

Scylla Operator version

scylla-operator:1.13

Kubernetes platform name and version

kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.29.10-eks-7f9249a

Kubernetes platform info:
EKS

Please attach the must-gather archive.

NA

Anything else we need to know?

No response

@adnankobir adnankobir added the kind/bug Categorizes issue or PR as related to a bug. label Dec 19, 2024
@scylla-operator-bot scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Dec 19, 2024
@tnozicka
Copy link
Contributor

Please attach the must-gather archive.

NA

The must-gather archive is a **mandatory** part of every bug report.
      See https://operator.docs.scylladb.com/stable/support/must-gather.html to learn how you can collect it.
      Do not edit the collected must-gather.

https://github.com/scylladb/scylla-operator/blob/b6e2ed7/.github/ISSUE_TEMPLATE/bug-report.yaml?plain=1#L57-L59

@tnozicka
Copy link
Contributor

that said, we should likely double check how we wire the token in case it gets rotated

@tnozicka tnozicka added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 20, 2024
@scylla-operator-bot scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Dec 20, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2025
@zimnx zimnx added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants