kong-controller stops fetching EndpointSlices and update kong-gateways #6567

lindeskar · 2024-10-25T10:55:01Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

A few times per day we see the kong-controller enter a state where it stops fetching EndpointSlices, and by that not updating kong-gateways with new configuration. The bad state lasts for about 30 minutes before an unknown trigger makes it all go back to normal.

This affects traffic going through kong-gateways if there were upstream changes during the bad kong-controller state, which the kong-gateways are then not aware of.
The cluster where the issue is occurring is heavily using spot Nodes, which leads to frequent updates of available Pods in Services.

The issue also affects Kong itself if a kong-gateway Pod is replaced during the bad state. Logs show the kong-controller not being aware of the new kong-gateway and still tries to reach the old kong-gateway Pod.

--

During the issue, two errors are constantly logged:

newly added kong-gateway Pods: not ready for proxying: no configuration available (empty configuration present)
the kong-controller: Failed to fill in defaults for plugin with a reference to a previously running kong-gateway Pod, not the newly added one

I think these errors are a symptom of a greater issue where something in the kong-controller gets stuck.

Debug logs show Fetching EndpointSlices and Sending configuration to gateway clients stop entirely during the bad state:

Expected Behavior

The kong-controller keeps fetching EndpointSlices and updates the kong-gateways.

Steps To Reproduce

Note: We have not been able to reproduce the issue in other Kubernetes clusters.

Values for the ingress chart:

controller:
  serviceMonitor:
    enabled: false # see https://github.com/Kong/charts/issues/1053 for more info
  podAnnotations: {} # disable kuma and other sidecar injection
  resources:
    requests:
      cpu: 50m
      memory: 128Mi
  extraObjects:
    - apiVersion: monitoring.coreos.com/v1
      kind: PodMonitor
      metadata:
        labels:
          app.kubernetes.io/component: app
          app.kubernetes.io/instance: kong
          app.kubernetes.io/name: controller
        name: kong-controller
        namespace: kong
      spec:
        podMetricsEndpoints:
          - path: /metrics
            targetPort: cmetrics
        selector:
          matchLabels:
            app.kubernetes.io/component: app
            app.kubernetes.io/instance: kong
            app.kubernetes.io/name: controller
  ingressController:
    customEnv:
      CONTROLLER_LOG_LEVEL: debug

gateway:
  serviceMonitor:
    enabled: true
  replicaCount: 3
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
  resources:
    requests:
      cpu: 10m
      memory: 240Mi
  deployment:
    prefixDir:
      sizeLimit: 2Gi
  proxy:
    externalTrafficPolicy: Local

  env:
    # HSTS, we use same values as in the default in ingress-nginx
    nginx_http_add_header: 'Strict-Transport-Security "max-age=15724800; includeSubDomains" always'

    # The client body memory buffer size: https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/nginx-configuration/annotations.md#client-body-buffer-size
    nginx_http_client_body_buffer_size: 50m

    # Don't pass on 'Server' header to downstream
    nginx_http_more_clear_headers: Server

    # Disable Kong headers to downstream
    headers: "off"

    # Enable Gzip compression, if requested by the client. Gzip types are infulenced by the default in ingress-nginx (minus xml types)
    nginx_http_gzip: "on"
    nginx_http_gzip_types: "application/javascript application/x-javascript application/json application/vnd.ms-fontobject application/x-font-ttf application/x-web-app-manifest+json font/opentype text/css text/javascript text/plain text/html application/octet-stream"
    nginx_http_gzip_min_length: "500"
    nginx_http_gzip_comp_level: "6"
    nginx_http_gzip_http_version: "1.1"
    nginx_http_gzip_proxied: "any"
    nginx_http_gzip_vary: "on"

Kong Ingress Controller version

kong/kubernetes-ingress-controller:3.3 from the Helm chart (the digest matches 3.3.1)

Kubernetes version

v1.29.9-gke.1177000

Anything else?

Debug log filtered for kong-gateway Pod IPs:
kong-controller-debug-2.txt

172.19.7.208 kong-gateway running
172.19.0.152 kong-gateway running
172.19.2.48 kong-gateway stopped 14:56
172.19.1.164 kong-gateway started 14:56 and stopped 17:10
172.19.0.154 kong-gateway started 17:10

The text was updated successfully, but these errors were encountered:

MarkusFlorian79 · 2024-11-12T08:16:25Z

We face a similar issue on KIC 3.3. Kong stops somehow updating the Upstreams. Neither restarting control or data plane helps.
Only downgrading seems to resolve the issue.

lindeskar · 2024-11-12T09:12:12Z

Hi @MarkusFlorian79, what version did you downgrade to?

MarkusFlorian79 · 2024-11-12T09:55:32Z

@lindeskar
That was KIC 3.2.2 and Kong 3.7.1
Not working is KIC 3.3 and Kong 3.8.0

jjchambl · 2024-11-15T16:59:52Z

@MarkusFlorian79 any reason for KIC 3.2.2? Did the later patch versions have problems as well?

azdobylak · 2024-11-17T22:03:47Z

I also spotted the issue with KIC 3.2.0 and Kong 3.7.0, it persists after upgrade to 3.2.2. The problem comes up when new gateway pod is spawned. Controller keeps using old ip. After recreating the controller pod, all gateways receive valid config immediately.
According to logs, the controller is aware of new address, but keeps using old one:

2024-11-14T14:56:50Z	debug	controllers.KongAdminAPIService	Notifying about newly detected Admin APIs	{"v": 1, "admin_apis": ["https://10.42.0.192:8444", "https://10.42.1.253:8444"]}
2024-11-14T14:56:50Z	debug	Received notification about Admin API addresses change	{"v": 1}
2024-11-14T14:56:50Z	debug	setup.readiness-checker	Checking readiness of pending client for "https://10.42.1.253:8444"	{"v": 1, "ok": true}
2024-11-14T14:56:50Z	debug	setup.readiness-checker	Checking readiness of already created client for "https://10.42.0.192:8444"	{"v": 1, "ok": true}
2024-11-14T14:56:50Z	debug	Notifying subscribers about gateway clients change	{"v": 1}
2024-11-14T14:56:52Z	error	Failed to fill in defaults for plugin	{"url": "https://10.42.2.245:8444", "plugin_name": "response-transformer", "error": "error retrieveing schema for plugin response-transformer: making HTTP request: Get \"https://10.42.2.245:8444/schemas/plugins/response-transformer\": dial tcp 10.42.2.245:8444: connect: no route to host"}

Then Failed to fill in defaults errors keeps recurring until controler is restarted.

MarkusFlorian79 · 2024-11-22T12:43:36Z

@jjchambl
any reason for KIC 3.2.2? Did the later patch versions have problems as well?
we just missed v.3.2.4. But I did some tests now with v.3.2.4. It does not show the same problems as v.3.3

joran-fonjallaz · 2024-11-25T13:33:40Z

same issue for us: described in further details here Kong/gateway-operator#140 (comment)

jjchambl · 2024-11-25T16:59:01Z

The problem comes up when new gateway pod is spawned

@azdobylak are you running 2 KIC replicas or just one? We had seen this behavior in Kong/KIC v2 with Gateway Discovery turned on and running KIC with 2 replicas. Haven't experienced this in v3 yet.

lindeskar added the bug Something isn't working label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kong-controller stops fetching EndpointSlices and update kong-gateways #6567

kong-controller stops fetching EndpointSlices and update kong-gateways #6567

lindeskar commented Oct 25, 2024 •

edited

Loading

MarkusFlorian79 commented Nov 12, 2024

lindeskar commented Nov 12, 2024

MarkusFlorian79 commented Nov 12, 2024

jjchambl commented Nov 15, 2024

azdobylak commented Nov 17, 2024

MarkusFlorian79 commented Nov 22, 2024 •

edited

Loading

joran-fonjallaz commented Nov 25, 2024

jjchambl commented Nov 25, 2024

kong-controller stops fetching EndpointSlices and update kong-gateways #6567

kong-controller stops fetching EndpointSlices and update kong-gateways #6567

Comments

lindeskar commented Oct 25, 2024 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Kong Ingress Controller version

Kubernetes version

Anything else?

MarkusFlorian79 commented Nov 12, 2024

lindeskar commented Nov 12, 2024

MarkusFlorian79 commented Nov 12, 2024

jjchambl commented Nov 15, 2024

azdobylak commented Nov 17, 2024

MarkusFlorian79 commented Nov 22, 2024 • edited Loading

joran-fonjallaz commented Nov 25, 2024

jjchambl commented Nov 25, 2024

lindeskar commented Oct 25, 2024 •

edited

Loading

MarkusFlorian79 commented Nov 22, 2024 •

edited

Loading