Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strimzi-Kafka-Operator Bug - Missing Schema Validation #286

Closed
MarkintoshZ opened this issue Nov 23, 2023 · 14 comments
Closed

Strimzi-Kafka-Operator Bug - Missing Schema Validation #286

MarkintoshZ opened this issue Nov 23, 2023 · 14 comments

Comments

@MarkintoshZ
Copy link
Collaborator

MarkintoshZ commented Nov 23, 2023

Work part of #274

Overview

The strimzi-cluster-operator throws NullPointerException in a loop due to missing schema validation in CRD.

According to the docs, checkIssuer is one of many optional configurations that could be applied when the authentication type is oauth. When checkIssuer is provided without oauth authentication type (e.g. using custom instead), is it expected for the operator to reject the config through apiserver-side validation. However in reality, the additional (optional) config causes the strimzi-cluster-operator to throw NullPointerException in a loop and the state would never reconcile.

Steps to Reproduce

Deploy a Kafka cluster with one of the listerns' authentication to be custom type, while setting the checkIssuer to be true

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: test-cluster
spec:
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    replicas: 1
    listeners:
    - authentication:
        type: custom
        checkIssuer: true
      name: plain
      port: 9092
      tls: false
      type: internal
    storage:
      type: ephemeral
  zookeeper:
    replicas: 1
    storage:
      type: ephemeral

Deploy with version 0.38.0

kubectl create namespace kafka
kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka'
kubectl apply -f cr.yaml -n kafka

Observe the exception logs from strimzi-cluster-operator pod

2023-11-22 23:34:24 INFO  AbstractOperator:511 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Kafka test-cluster in namespace kafka was ADDED
2023-11-22 23:34:24 WARN  StatusUtils:159 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Resource Kafka(kafka/test-cluster) contains object at path spec.kafka.listeners.auth with an unknown property: checkIssuer
2023-11-22 23:34:24 WARN  StatusUtils:159 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Resource Kafka(kafka/test-cluster) contains object at path spec.kafka.listeners.auth with an unknown property: checkIssuer
2023-11-22 23:34:24 INFO  AbstractOperator:265 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Kafka test-cluster will be checked for creation or modification
2023-11-22 23:34:24 INFO  CrdOperator:131 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Status of Kafka test-cluster in namespace kafka has been updated
2023-11-22 23:34:24 INFO  AbstractOperator:511 - Reconciliation #2(watch) Kafka(kafka/test-cluster): Kafka test-cluster in namespace kafka was MODIFIED
2023-11-22 23:34:24 INFO  Ca:988 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Generating CA with subject=Subject(organizationName='io.strimzi', commonName='cluster-ca v0', dnsNames=[], ipAddresses=[])
2023-11-22 23:34:24 INFO  Ca:988 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Generating CA with subject=Subject(organizationName='io.strimzi', commonName='clients-ca v0', dnsNames=[], ipAddresses=[])
2023-11-22 23:34:25 INFO  Ca:417 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Generating certificate Subject(organizationName='io.strimzi', commonName='cluster-operator', dnsNames=[], ipAddresses=[]), signed by CA cluster-ca
2023-11-22 23:34:25 INFO  Ca:417 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Generating certificate Subject(organizationName='io.strimzi', commonName='test-cluster-zookeeper', dnsNames=[test-cluster-zookeeper-client.kafka, test-cluster-zookeeper-client.kafka.svc, *.test-cluster-zookeeper-nodes.kafka.svc, test-cluster-zookeeper-0.test-cluster-zookeeper-nodes.kafka.svc.cluster.local, *.test-cluster-zookeeper-nodes.kafka.svc.cluster.local, test-cluster-zookeeper-0.test-cluster-zookeeper-nodes.kafka.svc, test-cluster-zookeeper-client, test-cluster-zookeeper-0, test-cluster-zookeeper-client.kafka.svc.cluster.local, *.test-cluster-zookeeper-client.kafka.svc, *.test-cluster-zookeeper-client.kafka.svc.cluster.local], ipAddresses=[]), signed by CA cluster-ca
2023-11-22 23:34:26 INFO  StrimziPodSetController:378 - Reconciliation #3(watch) StrimziPodSet(kafka/test-cluster-zookeeper): StrimziPodSet will be reconciled
2023-11-22 23:34:26 INFO  StrimziPodSetController:414 - Reconciliation #3(watch) StrimziPodSet(kafka/test-cluster-zookeeper): reconciled
2023-11-22 23:34:26 INFO  StrimziPodSetController:378 - Reconciliation #4(watch) StrimziPodSet(kafka/test-cluster-zookeeper): StrimziPodSet will be reconciled
2023-11-22 23:34:26 INFO  StrimziPodSetController:414 - Reconciliation #4(watch) StrimziPodSet(kafka/test-cluster-zookeeper): reconciled
2023-11-22 23:34:26 INFO  StrimziPodSetController:378 - Reconciliation #5(watch) StrimziPodSet(kafka/test-cluster-zookeeper): StrimziPodSet will be reconciled
2023-11-22 23:34:26 INFO  StrimziPodSetController:414 - Reconciliation #5(watch) StrimziPodSet(kafka/test-cluster-zookeeper): reconciled
2023-11-22 23:34:30 INFO  StrimziPodSetController:378 - Reconciliation #6(watch) StrimziPodSet(kafka/test-cluster-zookeeper): StrimziPodSet will be reconciled
2023-11-22 23:34:30 INFO  StrimziPodSetController:414 - Reconciliation #6(watch) StrimziPodSet(kafka/test-cluster-zookeeper): reconciled
2023-11-22 23:34:30 INFO  StrimziPodSetController:378 - Reconciliation #7(watch) StrimziPodSet(kafka/test-cluster-zookeeper): StrimziPodSet will be reconciled
2023-11-22 23:34:30 INFO  StrimziPodSetController:414 - Reconciliation #7(watch) StrimziPodSet(kafka/test-cluster-zookeeper): reconciled
2023-11-22 23:34:53 INFO  StrimziPodSetController:378 - Reconciliation #8(watch) StrimziPodSet(kafka/test-cluster-zookeeper): StrimziPodSet will be reconciled
2023-11-22 23:34:53 INFO  StrimziPodSetController:414 - Reconciliation #8(watch) StrimziPodSet(kafka/test-cluster-zookeeper): reconciled
2023-11-22 23:35:10 INFO  StrimziPodSetController:378 - Reconciliation #9(watch) StrimziPodSet(kafka/test-cluster-zookeeper): StrimziPodSet will be reconciled
2023-11-22 23:35:10 INFO  StrimziPodSetController:414 - Reconciliation #9(watch) StrimziPodSet(kafka/test-cluster-zookeeper): reconciled
2023-11-22 23:35:10 INFO  StrimziPodSetController:378 - Reconciliation #10(watch) StrimziPodSet(kafka/test-cluster-zookeeper): StrimziPodSet will be reconciled
2023-11-22 23:35:10 INFO  StrimziPodSetController:414 - Reconciliation #10(watch) StrimziPodSet(kafka/test-cluster-zookeeper): reconciled
2023-11-22 23:35:12 INFO  Ca:417 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Generating certificate Subject(organizationName='io.strimzi', commonName='test-cluster-kafka', dnsNames=[test-cluster-kafka-brokers.kafka.svc, test-cluster-kafka-brokers.kafka.svc.cluster.local, test-cluster-kafka-0.test-cluster-kafka-brokers.kafka.svc, test-cluster-kafka-brokers.kafka, test-cluster-kafka-brokers, test-cluster-kafka-bootstrap.kafka.svc, test-cluster-kafka-bootstrap.kafka.svc.cluster.local, test-cluster-kafka-bootstrap.kafka, test-cluster-kafka-0.test-cluster-kafka-brokers.kafka.svc.cluster.local, test-cluster-kafka-bootstrap], ipAddresses=[]), signed by CA cluster-ca
2023-11-22 23:35:12 ERROR AbstractOperator:284 - Reconciliation #1(watch) Kafka(kafka/test-cluster): createOrUpdate failed
java.lang.NullPointerException: Cannot invoke "java.util.Map.entrySet()" because the return value of "io.strimzi.api.kafka.model.listener.KafkaListenerAuthenticationCustom.getListenerConfig()" is null
        at io.strimzi.operator.cluster.model.KafkaBrokerConfigurationBuilder.configureAuthentication(KafkaBrokerConfigurationBuilder.java:498) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
        at io.strimzi.operator.cluster.model.KafkaBrokerConfigurationBuilder.withListeners(KafkaBrokerConfigurationBuilder.java:294) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
        at io.strimzi.operator.cluster.model.KafkaCluster.generatePerBrokerConfiguration(KafkaCluster.java:1619) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
        at io.strimzi.operator.cluster.model.KafkaCluster.generatePerBrokerConfigurationConfigMaps(KafkaCluster.java:1662) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
        at io.strimzi.operator.cluster.operator.assembly.KafkaReconciler.lambda$perBrokerKafkaConfiguration$42(KafkaReconciler.java:632) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
        at io.vertx.core.impl.future.Composition.onSuccess(Composition.java:38) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:60) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:211) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.vertx.core.impl.future.PromiseImpl.tryComplete(PromiseImpl.java:23) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.vertx.core.impl.future.PromiseImpl.onSuccess(PromiseImpl.java:49) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.vertx.core.impl.future.FutureBase.lambda$emitSuccess$0(FutureBase.java:54) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[io.netty.netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
2023-11-22 23:35:12 INFO  CrdOperator:131 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Status of Kafka test-cluster in namespace kafka has been updated
2023-11-22 23:35:12 INFO  AbstractOperator:511 - Reconciliation #11(watch) Kafka(kafka/test-cluster): Kafka test-cluster in namespace kafka was MODIFIED
2023-11-22 23:35:12 WARN  StatusUtils:159 - Reconciliation #11(watch) Kafka(kafka/test-cluster): Resource Kafka(kafka/test-cluster) contains object at path spec.kafka.listeners.auth with an unknown property: checkIssuer
2023-11-22 23:35:12 WARN  StatusUtils:159 - Reconciliation #11(watch) Kafka(kafka/test-cluster): Resource Kafka(kafka/test-cluster) contains object at path spec.kafka.listeners.auth with an unknown property: checkIssuer
2023-11-22 23:35:12 INFO  AbstractOperator:265 - Reconciliation #11(watch) Kafka(kafka/test-cluster): Kafka test-cluster will be checked for creation or modification
2023-11-22 23:35:12 WARN  AbstractOperator:557 - Reconciliation #1(watch) Kafka(kafka/test-cluster): Failed to reconcile
java.lang.NullPointerException: Cannot invoke "java.util.Map.entrySet()" because the return value of "io.strimzi.api.kafka.model.listener.KafkaListenerAuthenticationCustom.getListenerConfig()" is null
        at io.strimzi.operator.cluster.model.KafkaBrokerConfigurationBuilder.configureAuthentication(KafkaBrokerConfigurationBuilder.java:498) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
        at io.strimzi.operator.cluster.model.KafkaBrokerConfigurationBuilder.withListeners(KafkaBrokerConfigurationBuilder.java:294) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
        at io.strimzi.operator.cluster.model.KafkaCluster.generatePerBrokerConfiguration(KafkaCluster.java:1619) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
        at io.strimzi.operator.cluster.model.KafkaCluster.generatePerBrokerConfigurationConfigMaps(KafkaCluster.java:1662) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
        at io.strimzi.operator.cluster.operator.assembly.KafkaReconciler.lambda$perBrokerKafkaConfiguration$42(KafkaReconciler.java:632) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
        at io.vertx.core.impl.future.Composition.onSuccess(Composition.java:38) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:60) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:211) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.vertx.core.impl.future.PromiseImpl.tryComplete(PromiseImpl.java:23) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.vertx.core.impl.future.PromiseImpl.onSuccess(PromiseImpl.java:49) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.vertx.core.impl.future.FutureBase.lambda$emitSuccess$0(FutureBase.java:54) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[io.netty.netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]

Potential Fix

One potential fix could be using the Kubernetes CRD validation rules to reject semantically incorrect configs to avoid runtime errors caused by unexpected input configurations.

We have tried this fix and confirmed this can fix the problem. We are happy to submit a PR to fix this.

For example

authentication:
  type: object
  x-kubernetes-validations:
    - rule: "self.type != 'oauth' && !has(self.checkIssuer)"
      message: "checkIssuer property is only valid when authentication type is 'oauth'."
  properties:
    type:
      type: string
      enum:
      - tls
      - scram-sha-512
      - oauth
      - custom
    checkIssuer:
      type: boolean
      description: Enable or disable issuer checking. By default issuer is checked using the value configured by `validIssuerUri`. Default value is `true`.
    ...
@tianyin
Copy link
Member

tianyin commented Nov 23, 2023

WoWWWW

@tylergu
Copy link
Member

tylergu commented Nov 27, 2023

Great work @MarkintoshZ !

I am wondering if it is possible to minimize the CR to reproduce this bug? It would greatly help developers spot the triggering condition.

By the way, have you tried to validate if the proposed fix would fix this bug?

@MarkintoshZ
Copy link
Collaborator Author

MarkintoshZ commented Nov 27, 2023

Thanks @tylergu @tianyin!

I just updated the issue with the minimized CR and validated the proposed fix.

@tylergu
Copy link
Member

tylergu commented Nov 27, 2023

@MarkintoshZ I made some small edits to the report. Feel free to report to developers.
It would be best if you can submit a fix PR after you report this issue.

@MarkintoshZ
Copy link
Collaborator Author

Great! I'll post it now.

@tylergu
Copy link
Member

tylergu commented Nov 27, 2023

It would be best if you can submit a fix PR after you report this issue, you can research a bit on how to add CRD schema validation for Java operators -- these CRDs are typically auto-generated in these big operator repos, so you would likely need to modify some code to add this validation rule to the auto-generation process.

@tylergu
Copy link
Member

tylergu commented Nov 27, 2023

For reference, Golang operators' validation rules are added by writing markers in the comments: https://book.kubebuilder.io/reference/markers.html

You can check if the Java CRD generator has similar mechanism

@MarkintoshZ
Copy link
Collaborator Author

MarkintoshZ commented Nov 27, 2023

Posted here. They are using a issues template so the formatting and layout is a little different.

I'll look into adding schema validation for the java CRD generator.

@tianyin
Copy link
Member

tianyin commented Nov 27, 2023

@MarkintoshZ I remember you mentioned that you found two bugs? How is the other bug?

@tianyin
Copy link
Member

tianyin commented Nov 27, 2023

Could you also update this file?
https://github.com/xlab-uiuc/acto/blob/main/bugs.md

@MarkintoshZ
Copy link
Collaborator Author

@tianyin The other one is a false positive alarm due to premature identification of system convergence.

@tianyin
Copy link
Member

tianyin commented Nov 27, 2023

The other one is a false positive alarm due to premature identification of system convergence.

Is it a bug in the Acto code?

@MarkintoshZ
Copy link
Collaborator Author

No it should be fixed by extending the timeout in the config file.

I think there might be more bugs in the operator that we could find by making improvement to how we are identifying error-state and how schema matching is currently done with the java operator. I will write down the ideas in #274 and do another Acto run sometime soon

@xlab-uiuc xlab-uiuc deleted a comment from tylergu Nov 27, 2023
@xlab-uiuc xlab-uiuc deleted a comment from MarkintoshZ Nov 27, 2023
@tianyin
Copy link
Member

tianyin commented Nov 27, 2023

I close the issue as it is reported. I will leave #274 open.

@tianyin tianyin closed this as completed Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants