When a worker node goes down unexpectedly, a cray-console-node
pod running on that worker
may get stuck in a Terminating
state that prevents it moving to a healthy worker. This can
leave consoles unmonitored and not available for interactive access.
Determine if this is a scenario where the cray-console-node
pod is stuck in Terminating
because
of the failure of the worker node it is running on, or for some other reason. This fix should only
be applied in the case where the worker node did not properly drain before being shut down.
-
(
ncn-mw#
) Find which worker theTerminating
pod is running on.kubectl -n services get pods | grep console
Example output:
cray-console-data-6ff47b7454-ch5dj 2/2 Running 0 4d3h ncn-w003 cray-console-data-postgres-0 3/3 Running 0 5h50m ncn-w002 cray-console-data-postgres-1 3/3 Terminating 0 3d22h ncn-w001 cray-console-data-postgres-2 3/3 Running 0 4d1h ncn-w003 cray-console-node-0 3/3 Terminating 0 16h ncn-w001 cray-console-node-1 3/3 Running 0 4d22h ncn-w002 cray-console-operator-575d8b9f9d-s95v9 2/2 Running 0 30m ncn-w002 cray-console-operator-575d8b9f9d-x2bvm 2/2 Terminating 0 16h ncn-w001
In this example,
cray-console-node-0
is stuck inTerminating
and running onncn-w001
.NOTE Other pods are also stuck in
Terminating
, but only thecray-console-node
pods need to be manually terminated and forced to a different worker node. -
(
ncn-mw#
) Find the state of the worker node.kubectl get nodes
Example output:
NAME STATUS ROLES AGE VERSION ncn-m001 Ready control-plane,master 34d v1.21.12 ncn-m002 Ready control-plane,master 35d v1.21.12 ncn-m003 Ready control-plane,master 35d v1.21.12 ncn-w001 NotReady <none> 35d v1.21.12 ncn-w002 Ready <none> 35d v1.21.12 ncn-w003 Ready <none> 35d v1.21.12
In this example, the worker node
ncn-w001
is not reporting to the cluster. -
Wait some time to see if the worker node rejoins the cluster.
If the node rejoins the cluster, then the issue will sort itself out with no further manual intervention. If too much time has passed and the node is not resolving the problem on its own, then perform the following procedure in order to force the
cray-console-node
pod to move to a different worker.
A force terminate will remove the old pod and it will start up on a healthy worker. There is a slight chance that if the old pod is really still working despite the node being unhealthy, then the new and old pods will conflict. Only do the following if the worker node is down.
-
(
ncn-mw#
) Force terminate thecray-console-node
pod.In the above example the
cray-console-node-0
pod was on the worker node that shut down unexpectedly. The following command example uses that pod name. Be sure to modify the example command with the actual pod name before running it.kubectl -n services delete pod cray-console-node-0 --grace-period=0 --force
-
(
ncn-mw#
) Wait for the new pod to restart on a healthy worker.kubectl -n services get pods | grep console
Example output when healthy:
cray-console-data-6ff47b7454-ch5dj 2/2 Running 0 4d3h ncn-w003 cray-console-data-postgres-0 3/3 Running 0 5h50m ncn-w002 cray-console-data-postgres-1 3/3 Terminating 0 3d22h ncn-w001 cray-console-data-postgres-2 3/3 Running 0 4d1h ncn-w003 cray-console-node-0 3/3 Running 0 2m ncn-w003 cray-console-node-1 3/3 Running 0 4d22h ncn-w002 cray-console-operator-575d8b9f9d-s95v9 2/2 Running 0 30m ncn-w002 cray-console-operator-575d8b9f9d-x2bvm 2/2 Terminating 0 16h ncn-w001
It will take a few minutes for the new pod to resume console interactions.