This document provides links to troubleshooting information for services and functionality provided by CSM.
- Helpful tips for navigating the CSM repository
- Known issues
- Booting
- Configuration management
- ConMan
- Customer Management Network (CMN)
- Domain Name Service (DNS)
- Grafana dashboards
- Kubernetes
- MetalLB
- Node management
- Security and authentication
- Spire
- Utility storage
In the main repository landing page, change the branch to the CSM version being used on the system (for example, release/1.0
, release/1.2
, release/1.3
, etc.).
Use the pre-populated GitHub "Search or jump to..." function in the upper left hand side of the page and append keywords related to the exiting problem seen into the existing search. (The example searches for "ping" and "PXE" related troubleshooting resources on the "main" branch.)
-
Follow any run-books, guides, or procedures which are directly related to the problem.
-
Change the branch to
main
and search a second time to retrieve very recent or beta run-books and guides. -
Users can also expand the search beyond the "troubleshooting" section (instead of doing "path troubleshooting") and/or use more advanced GitHub searches such as "path configure" to find the right context.
- SAT/HSM/CAPMC/PCS Component Power State Mismatch
- HMS Discovery job not creating
RedfishEndpoint
s in Hardware State Manager - SSL Certificate Validation Issues
- SLS Not Working During Node Rebuild
- Antero node NID allocation
- Software Management Services health check
- QLogic driver crash
- Nexus Fails Authentication with Keycloak Users
- Keycloak Error "Cannot read properties" in Web UI
- Gigabyte BMC Missing Redfish Data
admin*client-auth
Not Found- Ceph OSD latency
- Cray CLI 403 Forbidden Errors
- Flags Set For Nodes In HSM
- Goss Test Fails with Connection Refused
- Helm Chart Deploy Timeouts
- HPE iLO dropping event subscriptions and not properly transitioning power state in CSM software
- IMS image creation failure
initrd.img.xz
Not Found- NCN health checks known issues
kubectl logs -f
returns no space left on device- Kubernetes Master or Worker node's root filesystem is out of space
- Mellanox
lacp-individual
Limitations - NCN resource checks known issues
- Spire database connection pool configuration in an air*gapped environment
- Spire Database Cluster DNS Lookup Failure
- Postgres Database is in Recovery
- Test Failures Due To No Discovered Compute Nodes In HSM
- Velero Version Mismatch
- wait for unbound hang
- Product Catalog Upgrade Error
- Missing Binaries in aarch64 Images
- PCS and CAPMC Transaction Size Limitation
- Istio-Proxy failing with too many open files
- IMS image delete loses the
arch
information - Spire pods stuck in
PodInitializing
- CFS Component With Zero-Length ID
- Issues Related to Unified Extensible Firmware Interface (UEFI)
- Issues Related to Dynamic Host Configuration Protocol (DHCP)
- Issues Related to the Boot Script Service
- Issues Related to Trivial File Transfer Protocol (TFTP)
- Troubleshooting Using Kubernetes
- Log File Locations and Ports Used
- Console Services Troubleshooting Guide
- ConMan Blocking Access to a Node BMC
- ConMan Failing to Connect to a Console
- ConMan Asking for Password on SSH Connection
- Console Node Pod Stuck in Terminating State
- DHCP run book
- DNS run book
- General configuration and troubleshooting
- Troubleshoot CMN Issues
- Troubleshoot DHCP Issues
- Troubleshoot Common DNS Issues
- Troubleshoot PowerDNS Issues
- Troubleshoot Common DNS configuration Issues
- Troubleshoot External DNS Issues
- Troubleshoot BGP not accepting routes from MetalLB
- Troubleshoot BGP services without an allocated IP address
- Troubleshoot PXE boot
- General Kubernetes Commands for Troubleshooting
- Kubernetes Log File Locations
- Liveliness or Readiness Probe Failures
- Unresponsive
kubectl
Commands - Kubernetes Node
NotReady
- Kubernetes Pods not Starting
- Postgres Database
- Recover from Postgres WAL Event
- Restore Postgres
- Disaster Recovery for Postgres
- Postgres Database is in Recovery
- Issues with Redfish Endpoint
DiscoveryCheck
for Redfish Events from Nodes - Interfaces with IP Address Issues
- Loss of Console Connections and Logs on Gigabyte Nodes