Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kong-controller stops fetching EndpointSlices and update kong-gateways #6567

Open
1 task done
lindeskar opened this issue Oct 25, 2024 · 8 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@lindeskar
Copy link

lindeskar commented Oct 25, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

A few times per day we see the kong-controller enter a state where it stops fetching EndpointSlices, and by that not updating kong-gateways with new configuration. The bad state lasts for about 30 minutes before an unknown trigger makes it all go back to normal.

This affects traffic going through kong-gateways if there were upstream changes during the bad kong-controller state, which the kong-gateways are then not aware of.
The cluster where the issue is occurring is heavily using spot Nodes, which leads to frequent updates of available Pods in Services.

The issue also affects Kong itself if a kong-gateway Pod is replaced during the bad state. Logs show the kong-controller not being aware of the new kong-gateway and still tries to reach the old kong-gateway Pod.

--

During the issue, two errors are constantly logged:

  • newly added kong-gateway Pods: not ready for proxying: no configuration available (empty configuration present)
  • the kong-controller: Failed to fill in defaults for plugin with a reference to a previously running kong-gateway Pod, not the newly added one

I think these errors are a symptom of a greater issue where something in the kong-controller gets stuck.

Debug logs show Fetching EndpointSlices and Sending configuration to gateway clients stop entirely during the bad state:
image
image

Expected Behavior

The kong-controller keeps fetching EndpointSlices and updates the kong-gateways.

Steps To Reproduce

Note: We have not been able to reproduce the issue in other Kubernetes clusters.

Values for the ingress chart:

controller:
  serviceMonitor:
    enabled: false # see https://github.com/Kong/charts/issues/1053 for more info
  podAnnotations: {} # disable kuma and other sidecar injection
  resources:
    requests:
      cpu: 50m
      memory: 128Mi
  extraObjects:
    - apiVersion: monitoring.coreos.com/v1
      kind: PodMonitor
      metadata:
        labels:
          app.kubernetes.io/component: app
          app.kubernetes.io/instance: kong
          app.kubernetes.io/name: controller
        name: kong-controller
        namespace: kong
      spec:
        podMetricsEndpoints:
          - path: /metrics
            targetPort: cmetrics
        selector:
          matchLabels:
            app.kubernetes.io/component: app
            app.kubernetes.io/instance: kong
            app.kubernetes.io/name: controller
  ingressController:
    customEnv:
      CONTROLLER_LOG_LEVEL: debug

gateway:
  serviceMonitor:
    enabled: true
  replicaCount: 3
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
  resources:
    requests:
      cpu: 10m
      memory: 240Mi
  deployment:
    prefixDir:
      sizeLimit: 2Gi
  proxy:
    externalTrafficPolicy: Local

  env:
    # HSTS, we use same values as in the default in ingress-nginx
    nginx_http_add_header: 'Strict-Transport-Security "max-age=15724800; includeSubDomains" always'

    # The client body memory buffer size: https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/nginx-configuration/annotations.md#client-body-buffer-size
    nginx_http_client_body_buffer_size: 50m

    # Don't pass on 'Server' header to downstream
    nginx_http_more_clear_headers: Server

    # Disable Kong headers to downstream
    headers: "off"

    # Enable Gzip compression, if requested by the client. Gzip types are infulenced by the default in ingress-nginx (minus xml types)
    nginx_http_gzip: "on"
    nginx_http_gzip_types: "application/javascript application/x-javascript application/json application/vnd.ms-fontobject application/x-font-ttf application/x-web-app-manifest+json font/opentype text/css text/javascript text/plain text/html application/octet-stream"
    nginx_http_gzip_min_length: "500"
    nginx_http_gzip_comp_level: "6"
    nginx_http_gzip_http_version: "1.1"
    nginx_http_gzip_proxied: "any"
    nginx_http_gzip_vary: "on"

Kong Ingress Controller version

kong/kubernetes-ingress-controller:3.3 from the Helm chart (the digest matches 3.3.1)

Kubernetes version

v1.29.9-gke.1177000

Anything else?

Debug log filtered for kong-gateway Pod IPs:
kong-controller-debug-2.txt

  • 172.19.7.208 kong-gateway running
  • 172.19.0.152 kong-gateway running
  • 172.19.2.48 kong-gateway stopped 14:56
  • 172.19.1.164 kong-gateway started 14:56 and stopped 17:10
  • 172.19.0.154 kong-gateway started 17:10
@lindeskar lindeskar added the bug Something isn't working label Oct 25, 2024
@MarkusFlorian79
Copy link

We face a similar issue on KIC 3.3. Kong stops somehow updating the Upstreams. Neither restarting control or data plane helps.
Only downgrading seems to resolve the issue.

@lindeskar
Copy link
Author

Hi @MarkusFlorian79, what version did you downgrade to?

@MarkusFlorian79
Copy link

@lindeskar
That was KIC 3.2.2 and Kong 3.7.1
Not working is KIC 3.3 and Kong 3.8.0

@jjchambl
Copy link

@MarkusFlorian79 any reason for KIC 3.2.2? Did the later patch versions have problems as well?

@azdobylak
Copy link

I also spotted the issue with KIC 3.2.0 and Kong 3.7.0, it persists after upgrade to 3.2.2. The problem comes up when new gateway pod is spawned. Controller keeps using old ip. After recreating the controller pod, all gateways receive valid config immediately.
According to logs, the controller is aware of new address, but keeps using old one:

2024-11-14T14:56:50Z	debug	controllers.KongAdminAPIService	Notifying about newly detected Admin APIs	{"v": 1, "admin_apis": ["https://10.42.0.192:8444", "https://10.42.1.253:8444"]}
2024-11-14T14:56:50Z	debug	Received notification about Admin API addresses change	{"v": 1}
2024-11-14T14:56:50Z	debug	setup.readiness-checker	Checking readiness of pending client for "https://10.42.1.253:8444"	{"v": 1, "ok": true}
2024-11-14T14:56:50Z	debug	setup.readiness-checker	Checking readiness of already created client for "https://10.42.0.192:8444"	{"v": 1, "ok": true}
2024-11-14T14:56:50Z	debug	Notifying subscribers about gateway clients change	{"v": 1}
2024-11-14T14:56:52Z	error	Failed to fill in defaults for plugin	{"url": "https://10.42.2.245:8444", "plugin_name": "response-transformer", "error": "error retrieveing schema for plugin response-transformer: making HTTP request: Get \"https://10.42.2.245:8444/schemas/plugins/response-transformer\": dial tcp 10.42.2.245:8444: connect: no route to host"}

Then Failed to fill in defaults errors keeps recurring until controler is restarted.

@MarkusFlorian79
Copy link

MarkusFlorian79 commented Nov 22, 2024

@jjchambl
any reason for KIC 3.2.2? Did the later patch versions have problems as well?
we just missed v.3.2.4. But I did some tests now with v.3.2.4. It does not show the same problems as v.3.3

@joran-fonjallaz
Copy link

same issue for us: described in further details here Kong/gateway-operator#140 (comment)

@jjchambl
Copy link

The problem comes up when new gateway pod is spawned

@azdobylak are you running 2 KIC replicas or just one? We had seen this behavior in Kong/KIC v2 with Gateway Discovery turned on and running KIC with 2 replicas. Haven't experienced this in v3 yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants