prometheus metric shows Node as 100% unhealthy always #126

surendarmsk1 · 2022-11-24T15:28:37Z

Have Grafana dashbord configured with below query however on all times noticing 100% unhealthy status across most of the nodes. Not sure whether we are hitting the bug on default ping check of 300ms which is being timeout. Can someone assist on why this would occur? how to triage further?

6hr metric view

Grafana Query:

sum(increase(goldpinger_nodes_health_total{cluster=~~"$cluster",goldpinger_instance=~~"$instance",status="unhealthy"}[15m])) by (goldpinger_instance) / (sum(increase(goldpinger_nodes_health_total{cluster=~~"$cluster",goldpinger_instance=~~"$instance",status="healthy"}[15m])) by (goldpinger_instance) + sum(increase(goldpinger_nodes_health_total{cluster=~~"$cluster",goldpinger_instance=~~"$instance",status="unhealthy"}[15m])) by (goldpinger_instance))

Repeated Warn Message on GoldPinger pods logs:
{"level":"warn","ts":1669303893.1442885,"caller":"goldpinger/pinger.go:151","msg":"Ping returned error","op":"pinger","name":"goldpinger","hostIP":"XX.XX.XX.XX","podIP":"XX.XX.XX.XX","responseTime":0.300629455,"error":"Get "http://XX.XX.XX.XX:8080/ping\": context deadline exceeded"}

surendarmsk1 changed the title ~~prometheus metric of unhealthy node shows 100% unhealthy always~~ prometheus metric shows Node as 100% unhealthy always Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus metric shows Node as 100% unhealthy always #126

prometheus metric shows Node as 100% unhealthy always #126

surendarmsk1 commented Nov 24, 2022 •

edited

Loading

prometheus metric shows Node as 100% unhealthy always #126

prometheus metric shows Node as 100% unhealthy always #126

Comments

surendarmsk1 commented Nov 24, 2022 • edited Loading

Grafana Query:

surendarmsk1 commented Nov 24, 2022 •

edited

Loading