Skip to content

Commit

Permalink
Merge branch 'gpu-op-upgrade' into 'master'
Browse files Browse the repository at this point in the history
[gpu-operator driver upgrade] minor doc improvements

See merge request nvidia/cloud-native/cnt-docs!324
  • Loading branch information
tariq1890 committed Sep 22, 2023
2 parents 3415cc0 + d44c7f6 commit 868e4d9
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions gpu-operator/gpu-driver-upgrades.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ GPU Driver Upgrades
About Upgrading the GPU Driver
******************************

The NVIDIA driver daeamon set requires special consideration for upgrades because the driver kernel modules must be unloaded and loaded again on each driver container restart.
The NVIDIA driver daemon set requires special consideration for upgrades because the driver kernel modules must be unloaded and loaded again on each driver container restart.
Consequently, the following steps must occur across a driver upgrade:

#. Disable all clients to the GPU driver.
Expand Down Expand Up @@ -105,7 +105,7 @@ You can set the following fields in the cluster policy to configure the upgrade
upgradePolicy:
# autoUpgrade (default=true): Switch which enables / disables the driver upgrade controller.
# If set to false all other options are ignored.
# If set to false all other options are ignored.
autoUpgrade: true
# maxParallelUpgrades (default=1): Number of nodes that can be upgraded in parallel. 0 means infinite.
maxParallelUpgrades: 1
Expand All @@ -116,7 +116,7 @@ You can set the following fields in the cluster policy to configure the upgrade
# up. By default, a fixed value of 25% is used.'
maxUnavailable: 25%
# waitForCompletion: Options for the 'wait-for-completion' state, which will wait for a user-defined group of pods
# to complete before upgrading the driver on a node.
# to complete before upgrading the driver on a node.
waitForCompletion:
# timeoutSeconds (default=0): The length of time to wait before giving up. 0 means infinite.
timeoutSeconds: 0
Expand All @@ -126,26 +126,26 @@ You can set the following fields in the cluster policy to configure the upgrade
# gpuPodDeletion: Options for the 'pod-deletion' state, which will evict all pods on the node allocated a GPU.
gpuPodDeletion:
# force (default=false): Delete pods even if they are not managed by a controller (e.g. ReplicationController, ReplicaSet,
# Job, DaemonSet or StatefulSet).
# Job, DaemonSet or StatefulSet).
force: false
# timeoutSeconds (default=300): The length of time to wait before giving up. 0 means infinite. When the timeout is met,
# the GPU pod(s) will be forcefully deleted.
# the GPU pod(s) will be forcefully deleted.
timeoutSeconds: 300
# deleteEmptyDir (default=false): Delete pods even if they are using emptyDir volumes (local data will be deleted).
deleteEmptyDir: false
# drain: Options for the 'drain' state, which will drain the node (i.e. 'kubectl drain'). This is only performed if
# enabled and the 'pod-deletion' state cannot successfully remove all pods using GPU.
# enabled and the 'pod-deletion' state cannot successfully remove all pods using GPU.
drain:
# enable (default=false): Switch for allowing node drain during the upgrade process
enable: false
# force (default=false): Delete pods even if they are not managed by a controller (e.g. ReplicationController, ReplicaSet,
# Job, DaemonSet or StatefulSet).
# Job, DaemonSet or StatefulSet).
force: false
# podSelector (default=""): The label selector to filter pods on the node. "" will drain all pods.
podSelector: ""
# timeoutSeconds (default=300): The length of time to wait before giving up. 0 means infinite. When the timeout is met,
# the GPU pod(s) will be forcefully deleted.
# the GPU pod(s) will be forcefully deleted.
timeoutSeconds: 300
# deleteEmptyDir (default=false): Delete pods even if they are using emptyDir volumes (local data will be deleted).
deleteEmptyDir: false
Expand All @@ -155,7 +155,7 @@ the ``maxUnavailable`` value applies an additional constraint on the value of
``maxParallelUpgrades`` to ensure that the number of parallel upgrades does not
cause more than the intended number of nodes to become unavailable during the upgrade.
For example, if you specify ``maxUnavailable=100%`` and ``maxParallelUpgrades=1``,
one node at a time is upgraded.
one node is upgraded at a time .

The ``maxUnavailable`` value also applies to the currently unavailable nodes in the cluster.
If you cordoned nodes in the cluster and the ``maxUnavailable`` value is already met by the number of cordoned nodes,
Expand All @@ -173,7 +173,7 @@ The set of possible states are:
* ``upgrade-required``: NVIDIA driver pod is not up-to-date and requires an upgrade. No actions are performed at this stage.
* ``cordon-required``: Node will be marked Unschedulable in preparation for the driver upgrade.
* ``wait-for-jobs-required``: Node will wait on the completion of a group of pods/jobs before proceeding.
* ``pod-deletion-required``: Pods allocated GPU will be deleted from the node. If pod deletion fails, node moves to ``drain-required``
* ``pod-deletion-required``: Pods allocated with GPUs are deleted from the node. If pod deletion fails, the node state is set to ``drain-required``
if drain is enabled in ClusterPolicy.
* ``drain-required``: Node will be drained. This state is skipped if all GPU pods are successfully deleted from the node.
* ``pod-restart-required``: The NVIDIA driver pod running on the node will be restarted and upgraded to the new version.
Expand Down

0 comments on commit 868e4d9

Please sign in to comment.