When installing CSM 1.6, cray-shared-kafka-kafka-*
pods in the services namespace fail to come up which results in CFS-API
pods in CLBO state. This happens because of an issue with Zookeeper related to slow DNS.
Zookeeper fails to come up if the DNS is not set up for all hosts at startup. When this happens, the cluster gets stuck at Zookeeper pods running, but brokers not coming up.
When the issue occurs, the cray-shared-kafka-kafka-*
pods in the services namespace fail to come up and will not be present.
Also, CFS-API
pods will be in CLBO state.
The logs from strimzi-cluster-operator-*
pod in the operators namespace will be throwing errors as follows:
2024-10-04T22:16:54.899932465Z 2024-10-04 22:16:54 ERROR StaticHostProvider:148 - Unable to resolve address: cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc/<unresolved>:2181
2024-10-04T22:16:54.899952739Z java.net.UnknownHostException: cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc: Name or service not known
2024-10-04T22:21:54.061164856Z 2024-10-04 22:21:54 ERROR VertxUtil:127 - Reconciliation #1(watch) Kafka(services/cray-shared-kafka):Exceeded timeout of 300000ms while waiting for ZooKeeperAdmin connection to cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-1.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-2.cray-shared-kafka-zookeeper-nodes.services.svc:2181 to be connected
2024-10-04T22:21:54.061644246Z 2024-10-04 22:21:54 WARN ZookeeperScaler:157 - Reconciliation #1(watch) Kafka(services/cray-shared-kafka): Failed to connect to Zookeeper cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-1.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-2.cray-shared-kafka-zookeeper-nodes.services.svc:2181. Connection was not ready in 300000 ms.
2024-10-04T22:21:54.466771715Z 2024-10-04 22:21:54 WARN ZooKeeperReconciler:834 - Reconciliation #1(watch) Kafka(services/cray-shared-kafka): Failed to verify Zookeeper configuration
This problem can be triggered by events like:
- Slow DNS propagation to Kubernetes DNS subsystem
The workaround is to delete the zookeeper pods and let them be re-created by the Strimzi operator.
kubectl delete pods -n services -l strimzi.io/controller-name=cray-shared-kafka-zookeeper