How Kafka lag threshold is computed #6135

MakeshKathirvel · 2024-09-04T23:02:07Z

MakeshKathirvel
Sep 4, 2024

Hi Team,

We are using KEDA to scale the kubernetes replicas based on kafka lag threshold using scaled object resource
We are using kafka lag exporter to identify the lag on the consumergroup/metrics

SUM of lag - sum(kafka_consumergroup_lag{cluster="$Cluster",consumergroup="${consumerGroup}",topic="${topic}"})
AVG of lag - avg(kafka_consumergroup_lag{cluster="$Cluster",consumergroup="${consumerGroup}",topic="${topic}"})

We have 2 problems here

Our microservice replicas are flapping frequently due to autoscaling which triggers the kafka rebalance followed by lags on certain partitions
During the lag, the replicas were not reaching to max replica count
In this case, during the active lag for 1 hour the replicas were scaled up to 17 and not 18(we have 18 partition count for the topic)

As part of analysis, when try to hit the below endpoint we see different value (not sum of lag/avg of lag)
kubectl get --raw "/apis/[external.metrics.k8s.io/v1beta1/namespaces/YOUR_NAMESPACE/YOUR_METRIC_NAME?labelSelector=scaledobject.keda.sh%2Fname%3D{SCALED_OBJECT_NAME}]

{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s1-kafka-TEST","metricLabels":null,"timestamp":"2024-09-04T22:59:16Z","value":"678"}]}

We would like to understand how these values are computed so that we can set the appropriate threshold and understand the expected behavior
Is it the avg of all lag/sum of all lag/ any other computation?

CHART APP VERSION
keda-2.9.4 2.9.3

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
annotations:
meta.helm.sh/release-name: <microservice_name>
meta.helm.sh/release-namespace:
finalizers:

finalizer.keda.sh
generation: 8
labels:
app.kubernetes.io/managed-by: Helm
scaledobject.keda.sh/name: <microservice_name>
name: <microservice_name>
namespace:
spec:
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
policies:
- periodSeconds: 60
type: Percent
value: 10
stabilizationWindowSeconds: 300
scaleUp:
policies:
- periodSeconds: 60
type: Percent
value: 10
stabilizationWindowSeconds: 0
cooldownPeriod: 120
maxReplicaCount: 18
minReplicaCount: 6
pollingInterval: 10
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: <microservice_name>
triggers:
metadata:
metricName: cpu_average
type: AverageValue
value: "85"
type: cpu
metadata:
bootstrapServers:
consumerGroup:
lagThreshold: "200"
topic:
type: kafka
status:
conditions:
message: ScaledObject is defined correctly and is ready for scaling
reason: ScaledObjectReady
status: "True"
type: Ready
message: Scaling is performed because triggers are active
reason: ScalerActive
status: "True"
type: Active
message: No fallbacks are active on this scaled object
reason: NoFallbackFound
status: "False"
type: Fallback
externalMetricNames:
s1-kafka-TEST
health:
s1-kafka-TEST:
numberOfFailures: 0
status: Happy
hpaName: keda-hpa-<microservice_name>
lastActiveTime: "2024-09-04T22:48:18Z"
originalReplicaCount: 3
resourceMetricNames:
cpu
scaleTargetGVKR:
group: apps
kind: Deployment
resource: deployments
version: v1
scaleTargetKind: apps/v1.Deployment

zroubalik · 2024-12-04T10:21:50Z

zroubalik
Dec 4, 2024
Maintainer

Your KEDA version is way to old, please update.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How Kafka lag threshold is computed #6135

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How Kafka lag threshold is computed #6135

MakeshKathirvel Sep 4, 2024

Replies: 1 comment

zroubalik Dec 4, 2024 Maintainer

MakeshKathirvel
Sep 4, 2024

zroubalik
Dec 4, 2024
Maintainer