Repository showcasing ML Ops practices with kubeflow and mlflow
- MLOps Workflow: Recognizing Digits with Kubeflow
- Deploy kubeflow into an AKS cluster using default settings
- kubeflow - Minimum system requirements
- Deployment of Azure Kubernetes Service (AKS) clusters
- kubeflow operator or mlflow helm chart installations in deployed AKS clusters
- CD workflow for on-demand AKS deployments and kubeflow operator or mlflow helm chart installations
- CD wofklow for on demand deployments of an Azure Storage Account Container (For storing terraform state files)
- CD workflow for on-demand Azure Container Registry deployments in order to store internal Docker images.
-
CI workflow for building internal docker images and uploading those to an Azure Container Resgitry -
CD workflows for internal helm chart installations in deployed AKS clusters - Added
devcontainer.json
with necessary tooling for local development - Python (pytorch or tensorflow) application for ML training and inference purposes and Jupyter notebooks
- Simple feedforward neural network with MNIST dataset to map input images to their corresponding digit classes
- CNN architecture training and inference considering COCO dataset for image classification AI applications (NOTE: Compute and storage intensive. Read
Download the COCO dataset images
comments on preferred hardware specs) -
(OPTIONAL) Transformer architecture training considering pre-trained models for chatbot AI applications
- Dockerizing Python (pytorch or tensorflow) applications for ML training and inference
-
Helm charts with K8s manifests for ML jobs considering the Training Operator for CRDs - Installation of the Training Operator for CRDs and applying sample TFJob and PyTorchJob k8s manifest
-
Demonstration of model training and model deployment trough automation workflows -
(OPTIONAL) mlflow experiments for the machine learning lifecycle
NOTE: Steps 4 to 7 in the digits-recognizer-kubeflow GH repository are not showcased here. These sections focus on saving the ML model in MinIO once the model is successfully built and trained. Furthermore, the trained model is served through KServe's inference HTTP service. The relevant files are:
- The digits_recognizer_notebook.ipynb for model development and training, which also covers uploading the trained model to MiniO
- create_kserve_inference.yaml for spinning up the KServe inference HTTP service
- kserve_python_test.ipynb for testing the Inference KServe HTTP service
- digits_recognizer_pipeline.ipynb to setup the ML pipeline
Github workflows will be utilized in this Github repository. Once the workflows described in the Preconditions and Deploy an AKS cluster and install the kubeflow or mlflow components sections have been successfully executed, all resource groups listed should be visible in the Azure Portal UI:
- Deploy an Azure Storage Account Service including container for terraform backends trough the terraform.yml workflow considering the
INFRASTRUCTURE_OPERATIONS option storage-account-backend-deploy
- Deploy an AKS trough the terraform.yml workflow considering the
INFRASTRUCTURE_OPERATIONS option k8s-service-deploy
. An Azure Container Registry will be part of the deployment in order to store internal Docker images - Optional: Install ml-ops tools to an existing kubernetes cluster trough terraform.yml workflow considering the
INFRASTRUCTURE_OPERATIONS option ml-ops-tools-install
NOTE:
- Set all the required Github secrets for aboves workflows
- In order to locally access the deployed AKS cluster launch the devcontainer and retrieve the necessary kube config as displayed in the GitHub workflow step labeled with title Download the ~/.kube/config
To access the kubeflow dashboard following the installation of kustomize and kubeflow components, execute the following command:
kubectl get pods -A
kubectl port-forward -n <namespace> <pod-name> <local-port>:<server-port>
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
and visit in a browser of choice localhost:8080
.
When creating the Jupyter notebook instance consider the following data volume:
The volumes that were created appear as follows:
The Jypter instace that was created appear as follows:
NOTE: You can check the status of the Jupyter instance pods:
Once CONNECTED
to a Jupyter instance ensure to clone this Git repository (HTTPS URL: https://github.com/MGTheTrain/ml-ops-ftw.git
):
You then should have the repository cloned in your workspace:
Execute a Jupyter notebook to either train the model or perform inference (P.S. It's preferable to begin with the mnist-trainnig.ipynb. Others are either resource intensive or not yet implemented):
After successful installation of the Kubeflow Training Operator, apply some sample k8s ML training jobs, e.g. for PyTorch and for Tensorflow.
# Pytorch
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
# Tensorflow
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/tensorflow/simple.yaml
To access the MLflow dashboard following the installation of the MLflow Helm chart, execute the following command:
kubectl port-forward -n ml-ops-ftw <mlflow pod name> 5000:5000
and visit in a browser of choice localhost:5000.
- Optional: Uninstall only ml tools of an existing kubernetes cluster trough terraform.yml workflow considering the
INFRASTRUCTURE_OPERATIONS option ml-ops-tools-uninstall
- Destroy an AKS trough the terraform.yml workflow considering the
INFRASTRUCTURE_OPERATIONS option k8s-service-destroy