This repository describes the setup and use of the MLBatch queuing and quota management system on OpenShift and Kubernetes clusters. MLBatch leverages Kueue, the Kubeflow Training Operator, KubeRay, and the Codeflare Operator from Red Hat OpenShift AI. MLBatch enables AppWrappers and adds Coscheduler. MLBatch includes a number of configuration steps to help these components work in harmony and support large workloads on large clusters.
MLBatch handles the queuing and dispatching of batch workloads on OpenShift and Kubernetes clusters. It enforces team quotas at the namespace level. It automates the borrowing and reclamation of unused quotas across teams. Teams can use priorities within their namespaces without impact on other teams. Using AppWrappers to submit workloads activates a number of fault detection and recovery capabilities, including automatically detecting failed pods and automatically retrying failed workloads. Coscheduler supports gang scheduling and minimizes fragmentation by preferentially packing jobs requiring less than a full node's worth of GPUs together.
To learn how to setup MLBatch on a cluster and onboard teams see SETUP.md.
Quota maintenance is a key aspect of smoothly administering an MLBatch cluster. Cluster admins should carefully read QUOTA_MAINTENANCE.md.
To learn how to run workloads on an MLBatch cluster see USAGE.md or CODEFLARE.md if you are already familiar with the CodeFlare stack for managing AI/ML workloads on Kubernetes.
Properly configuring a distributed PyTorchJob
to make effective use of the
MLBatch system and hardware accelerators (GPUs, RoCE GDR) can be tedious. To
automate this process, we provide a Helm chart that captures best practices and
common configuration options. Using this Helm chart helps eliminate common
mistakes. Please see pytorchjob-generator for
detailed usage instructions.
If you will be contributing to the development of the MLBatch project, you must setup precommit hooks for your local clone of the repository. Do the following once, immediately after cloning this repo:
helm plugin install https://github.com/helm-unittest/helm-unittest.git
pre-commit install
Copyright 2024 IBM Corporation.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.