MLBatch for CodeFlare Users

MLBatch is an evolution of the CodeFlare stack for managing AI/ML workloads on Kubernetes and its workload dispatcher MCAD.

Like MCAD, MLBatch is designed to queue workloads and admit them for execution over time, accounting for quotas, priorities, and precedence. MLBatch relies on AppWrappers to bundle together all the components of a workloads such as pods, PyTorch jobs, Ray jobs, config maps, secrets, etc. AppWrappers in MLBatch offer improved mechanisms to automatically detect and retry failed workloads. MLBatch includes a backward-compatible pytorch-generator Helm template to facilitate the specification of PyTorch jobs.

In this document, we review the key innovations introduced by MLBatch and differences with the earlier setup built around MCAD.

Kueue

MLBatch replaces MCAD with Kueue to queue and admit jobs. Kueue introduces a new quota management system based on cluster queues. This quota system provides more flexibility to allocate compute resources (CPU, memory, and GPU quotas) than resource quotas in core Kubernetes. This system allows the borrowing of unused quota between cluster queues (see Priorities and Preemption below). Borrowing enables high overall cluster resource utilization while still ensuring that every team always has the ability to run jobs up to their allocated quotas. Kueue also enables teams to use priorities to order jobs within their own cluster queue without those priorities impacting the scheduling of jobs by other cluster queues.

Unlike MCAD, Kueue only considers quotas when admitting workloads. As a result, MLBatch must ensure that all resource-consuming workloads in user namespaces are managed by Kueue. This is accomplished by strictly limiting the Kinds of non-AppWrapper resources users are permitted to create.

For various reasons, workloads are not directly submitted to cluster queues but rather to namespaced local queues that feed into the cluster queues. By convention in MLBatch, each team is assigned a namespace and a cluster queue dedicated to the team. For example, the platform team is assigned to namespace platform and its associated cluster queue named platform-cluster-queue. The local queue name in each namespace in MLBatch is always default-queue. Hence, the default-queue in namespace platform feeds into the platform-cluster-queue. In short, all workloads must be submitted to the local queue named default-queue but to review quota allocation and usage, one has to query the cluster queues.

MLBatch offers a simple cluster-checker tool to get a bird’s-eye view of quotas on a cluster from a GPU perspective:

node checker.js

CLUSTER QUEUE            GPU QUOTA   GPU USAGE   ADMITTED WORKLOADS   PENDING WORKLOADS
code-cluster-queue               8          16                    1                   0
platform-cluster-queue           8           4                    4                   0

Total GPU count in cluster:        24
Unschedulable GPU count:         -  0
Schedulable GPU count:           = 24

Nominal GPU quota:                 16
Slack GPU quota:                 +  8
Total GPU quota:                 = 24

GPU usage by admitted workloads:   20
Borrowed GPU count:                 8

The tool lists the cluster queues defined on the cluster showing the GPU quota for each one as well as the number of GPUs in use by admitted workloads. The GPU usage may exceed the GPU quota for the cluster queue if this cluster queue is borrowing idle capacity.

The tool also reports the total GPU capacity distinguishing healthy (i.e., schedulable, available for use) and unhealthy (i.e., unschedulable, unavailable) GPUs. The nominal GPU quota represents the cumulative GPU quota across all the teams. MLBatch recommends that cluster admins keep the nominal quota below the cluster capacity to avoid oversubscribing the GPUs. Typically, a small number of GPUs is not allocated to any team but retained as a slack quota that any team may borrow from. MLBatch automatically adjusts the slack quota to ensure the schedulable GPU count and nominal quota remain equal, unless of course this slack becomes virtually negative, in which case a cluster admin should decide how to reduce the nominal quota.

For more details about the cluster queues run:

kubectl describe clusterqueues

AppWrappers

MLBatch recommends submitting every workload as an AppWrapper. AppWrappers offer a number of checks, guarantees, and benefits over submitting unwrapped PyTorchJobs for example. In particular, the AppWrapper controller automatically injects:

labels holding the name and id of the user submitting the AppWrapper,
the queueName label required to queue the workload in the default-queue, and
the schedulerName specification required to enable gang scheduling and packing on the GPU dimension to mitigate node fragmentation.

Moreover, the AppWrapper controller consistently handles cleanup and retries across all types of workloads:

The resources, especially the GPUs, utilized by a failed workload are returned to the cluster in a timely manner, i.e., within minutes by default, with a configurable grace period to permit post-mortem debugging. Cluster admins can enforce an upper bound on this grace period to bound resource wastage.
The Kubernetes objects associated with a completed workload, in particular the pods and their logs, are eventually disposed of, by default after a week.
Failed workloads are automatically retried up to a configurable number of attempts.

The AppWrapper specification has been greatly simplified for MLBatch. In most cases, an AppWrapper yaml adds a simple prefix to a workload yaml, for instance for a pod:

# appwrapper prefix
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
  name: wrapped-pod
spec:
  components:
  - template:
      # indented pod specification
      apiVersion: v1
      kind: Pod
      metadata:
        name: sample-pod
      spec:
        restartPolicy: Never
        containers:
        - name: busybox
          image: quay.io/project-codeflare/busybox:1.36
          command: ["sh", "-c", "sleep 5"]
          resources:
            requests:
              cpu: 1

To submit this workload to the cluster, save this yaml to wrapped-pod.yaml and run:

kubectl apply -f wrapped-pod.yaml

MLBatch includes an appwrapper-packager tool to automate the addition this prefix as well as the indentation of the workload specification. In addition, MLBatch includes a new implementation of the pytorch-generator tool to facilitate the configuration of PyTorch jobs including the addition of the AppWrapper prefix.

As a result of the AppWrapper simplification for MLBatch, AppWrappers which are now in version v1beta2 are not backward compatible with MCAD's v1beta1 AppWrappers. The companion pytorch-generator tool for MCAD is not compatible with MLBatch. However, the pytorch-generator tool included in MLBatch is backward compatible with the input format of the legacy tool. In other words, simply rerun helm template on the input value.yaml files to generate proper v1beta2 AppWrappers. Please note that existing fault-tolerance-related settings from these input files will be ignored and default will be used instead. Please refer to the tool documentation for how to override settings such as max retry counts.

The list of all AppWrappers in a namespace is obtained by running:

kubectl get appwrappers

NAME          STATUS      QUOTA RESERVED   RESOURCES DEPLOYED   UNHEALTHY
wrapped-pod   Succeeded   False            True                 False

The status of an AppWrapper is one of:

Suspended: the AppWrapper is queued,
Resuming: the AppWrapper is transitioning to Running,
Running: the AppWrapper is running,
Succeeded: the execution completed successfully,
Failed: the execution failed and will not be retried,
Resetting: a failure has been detected during the current execution and the AppWrapper is preparing to retry,
Suspending: the AppWrapper has been evicted by Kueue and is transitioning back to Suspended.

---
title: AppWrapper Lifecycle
---
stateDiagram-v2
    f : Failed
    sp : Suspended
    ad : Admitted
    s : Succeeded
    su: Suspending

    state ad {
      [*] --> rs
      rs --> rn
      rn --> rt 
      rt --> rs
 
      rs : Resuming
      rn : Running
      rt : Resetting
    }

    [*] --> sp
    sp --> ad
    rn --> s
    ad --> su
    su --> sp
    ad --> f
 
    classDef admitted fill:lightblue
    class rs admitted
    class rn admitted
    class rt admitted

    classDef failed fill:pink
    class f failed

    classDef succeeded fill:lightgreen
    class s succeeded

Loading

In this diagram, the outer loop consisting of the Suspended, Admitted, and Suspending states is managed by Kueue, while the inner loop consisting of the Resuming, Running, and Resetting states is managed by the AppWrapper controller. In particular, the AppWrapper controller handles workload retries without releasing and reacquiring Kueue quotas, hence without moving retried workloads to the back of the cluster queue.

In addition, this AppWrapper table also reports:

quota reserved: whether Kueue has reserved the quota requested by the AppWrapper,
resource deployed: whether the resources wrapped by the AppWrapper, such as the sample-pod in this example have been created on the cluster,
unhealthy: whether a failure has been detected during the current execution of the AppWrapper.

For example, a Running AppWrapper has both quota reserved and resource deployed. A Succeeded AppWrapper will no longer reserve quota but the wrapped resources such as terminated pods will be preserved on the cluster for a period of time as discussed above to permit log collection. A Failed AppWrapper will transiently continue to reserve quota until the wrapped resources have been undeployed, so as to avoid oversubscribing GPUs during the cleanup of failed jobs.

More details about an AppWrapper condition may be obtained by describing the AppWrapper:

kubectl describe appwrapper wrapped-pod

Kueue creates and maintains a companion Workload object for each workload it manages. Further details about the AppWrapper condition such as Kueue's rationale for evicting the workload may be obtained by accessing this companion object:

kubectl get workloads

NAME                           QUEUE           RESERVED IN           ADMITTED   AGE
appwrapper-wrapped-pod-81d3e   default-queue   team1-cluster-queue   True       161m

kubectl describe workload appwrapper-wrapped-pod-81d3e

Workload objects are automatically deleted by Kueue when the workload itself, i.e., the AppWrapper is deleted.

Priorities and Preemption

MLBatch supports the high-priority, default-priority, and low-priority priority classes.

If you are using the pytorch-generator tool, you can override the default default-priority of a workload by setting the priority variable. If you are generating your yaml by other means, simply add a priorityClassName to the specification of the wrapped pod templates, for example:

# appwrapper prefix
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
  name: wrapped-pod
spec:
  components:
  - template:
      # indented pod specification
      apiVersion: v1
      kind: Pod
      metadata:
        name: sample-pod
      spec:
        priorityClassName: high-priority # workload priority
        restartPolicy: Never
        containers:
        - name: busybox
          image: quay.io/project-codeflare/busybox:1.36
          command: ["sh", "-c", "sleep 5"]
          resources:
            requests:
              cpu: 1

Workloads of equal priority are considered for admission by their cluster queue in submission order. Higher-priority workloads are considered for admission before lower-priority workloads irrespective of their submission time. However, workloads that cannot be admitted will not block the admission of newer and/or lower-priority workloads (if they fit within the nominal quota of the cluster queue).

To reduce workload churn, Kueue forbids workloads to simultaneously utilize both preemption and borrowing to acquire the necessary quota to be admitted. Therefore a workload that by itself exceeds the nominal quota of its cluster queue will never trigger preemption. Similarly, if the combined resources of (a) a pending workload and (b) the sum of all already admitted workloads with equal or higher priority to the pending workload exceeds the nominal quota of their cluster queue, Kueue will not preempt already admitted lower priority workloads of that cluster queue to admit the pending workload.

When a workload is pending on a cluster queue and admitting that workload would still leave the cluster queue at or below its nominal quota, Kueue may preempt one or more currently admitted workloads of other cluster queues to reclaim the necessary borrowed quota. When such preemption is necessary, the decision of which workload(s) to preempt is based solely on considering the currently admitted workloads of just those cluster queues that are exceeding their nominal quota. Workloads admitted by cluster queues that are currently at or below their nominal quota will not be preempted.

Allowed Kinds

MLBatch allows users to directly create the following Kinds of compute resources:

AppWrapper
PyTorchJob (allowed, but recommend to put inside an AppWrapper)
RayJob (allowed, but recommend to put inside an AppWrapper)
RayCluster (allowed, but recommend to put inside an AppWrapper)

MLBatch also allows users to directly create the following Kinds of non-compute resources:

Service
Secret
ConfigMap
PersistentVolumeClaim
PodGroup (allowed, but recommend to put inside an AppWrapper)

MLBatch allows users to wrap an arbitrary number of one or more of the following Kinds inside of an AppWrapper:

PyTorchJob
RayJob
RayCluster
Deployment
StatefulSet
Pod
Job
ServiceAccount
Service
Secret
ConfigMap
PersistentVolumeClaim
PodGroup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CODEFLARE.md

CODEFLARE.md

MLBatch for CodeFlare Users

Kueue

AppWrappers

Priorities and Preemption

Allowed Kinds

Files

CODEFLARE.md

Latest commit

History

CODEFLARE.md

File metadata and controls

MLBatch for CodeFlare Users

Kueue

AppWrappers

Priorities and Preemption

Allowed Kinds