The big cloud providers offer a vast range of services and a lot of breadth in the MLOps space. These offerings can feel like a bundle of disconnected services until one sees the approach that holds it together. At a high level the cloud provider offerings can appear very similar but in detail each provider has a different set of services and a different approach to stitching them together.
Google's approach with Vertex is differentiated by the way it is structured, using Vertex pipelines as an orchestrator (Vertex pipelines being a managed and integrated version of open source kubeflow pipelines). AWS Sagemaker is differentiated primarily by the range of its services and how they relate to the rest of AWS. Microsoft's strategy with Azure centers on developer experience and quality of integrations.
Vertex is Google's newly-unified AI platform. The main ways in which it is unified are:
- Everything falls logically under vertex headings in the google cloud console and the APIs to the services should be consistent. (Previously AutoML was visibly a separate function).
- Pipelines can be used as an orchestrator for most of the workflow (including AutoML).
This idea of pipelines as an orchestrator across offerings is illustrated here (from TechCrunch):
This could be confusing to those familiar with kubeflow pipelines (which is what vertex pipelines are under the hood) as kubeflow pipelines started out as a distributed training system, with each step executing in a separate container, along with a UI to inspect runs and ways to resume from a failed step. Pipelines are usable for distributed training but pipelines can also be used to perform other tasks beyond training. This is illustrated in the below screenshot:
Here there is a conditional deployment decision to decide whether the model is good enough to deploy or not. If it passes the test then the model is deployed from the pipeline.
The AutoML offerings are now more consistent with other parts of Google's AI stack. Basically different ways to input your data can lead to the same training path and different training paths can lead to the same deployment path. Here's a diagram from Henry Tappen and Brian Kobashikawa (via Lak Lakshmanan):
There is also some difference, as can be seen in how datasets are handled. The dataset concept is broken into managed (which has specific metadata and lives on specific google data products) or not managed. The managed datasets are mostly for just AutoML for now.
We could think of Google as having three basic routes to training models - AutoML, pipelines (which is intended more as an orchestrator) and custom training jobs. Where pipelines use multiple containers with each step in a different container, custom training jobs are a single-container training function. It has automatic integration to TensorBoard for results. It can do distributed training via the underlying ML code framework, providing the framework supports it. It supports GPUs and you can watch/inspect runs, though inspecting runs looks a bit basic compared to pipelines.
There's a facility native to the custom training jobs for tuning hyperparameters. In addition to this, google has launched Vizier. Instead of being native to training jobs Vizier has an API and you tell it what you've tried and it makes suggestions for what to try next. This may be less integrated but Vizier is able to go deeper in what it can tune.
To deploy your models to get predictions, you have two options. If your model is built using a natively supported framework, you can tell google to load your model (serialized artifact) into a pre-built container image. If your model framework is not supported or you have custom logic then you can supply a custom image meeting the specification. Google can then create an endpoint for you to call to get predictions. You can create this either through the API or using the web console wizard.
Running models can be monitored for training/serving skew. For skew you supply your training set when you create the monitoring job. The monitoring job stores data going into your model in a storage bucket and uses this for comparison to the training data. You can get alerts based on configurable thresholds for individual features. Drift is similar but doesn't require training data (it's monitoring change over time). Feature distributions are shown in the console for any alerts.
SageMaker aims to be both comprehensive and integrated. It has services addressed at all parts of the ML lifecycle and a variety of ways to interact with them, including its own dedicated web-based IDE (SageMaker Studio - based on JupyterLab).
The services marked as 'New' in the above were mostly announced at re:invent in December 2020.
SageMaker Ground Truth is a labelling service, similar to Google's but with more automation features and less reliance on humans. SageMaker Ground Truth's automation makes it competitive with specialist labelling tools (a whole area in itself).
Data Wrangler allows data scientists to visualize, transform and analyze data from supported data sources from within SageMaker Studio:
Data Wrangler is also integrated with Clarify (which handles explainability), to highlight bias in data. This streamlines feature engineering and the resulting features can go directly to SageMaker Feature Store. Custom code can be added and SageMaker also separately has support for Spark processing jobs.
Once features are in the Feature Store, they are available to be searched for and used by other teams. They can also be used at the serving stage as well as the training stage.
The 'Build' heading is for offerings that save time throughout the whole process. AutoPilot is SageMaker's AutoML service that covers automated feature engineering, model building and selection. The various models it builds are all visible so you can evaluate them and choose which to deploy.
JumpStart is a set of CloudFormation templates for common ML use cases.
Training with SageMaker is typically done from the python sdk, which is used to invoke training jobs. A training job runs inside a container on an EC2 instance. You can use a pre-built docker image if your training job is for a natively supported algorithm and framework. Otherwise you can use your own docker image that conforms to the requirements. The SDK can also be used for distributed training jobs for frameworks that support distributed training.
Processing and training steps can be chained together in Pipelines. The resulting models can be registered in the model registry and you can track lineage on artifacts from steps. You can also view and execute pipelines from the SageMaker Studio UI.
The SageMaker SDK has a 'deploy' operation for which you specify what type of instance you want your model deployed to. As with training, this can be either a custom image or a built-in one. The expectation is training and deployment will both happen with SageMaker but this can be worked around if you want to deploy a model that you've trained outside of SageMaker. Serving real-time HTTP requests is the typical case but you can also perform batch predictions and chain inference steps in inference pipelines.
Deployed models get some monitoring by default with integration to CloudWatch for basic invocation metrics. You can also set up scheduled monitoring jobs. SageMaker can be configured to capture request and response data and to perform various comparisons on that data such as comparing against training data or triggering alerts based on constraints.
The Azure Machine Learning offering is consciously pitched at multiple roles (especially Data Scientists and Developers) and different skill levels. It is aimed to support team collaboration and automate the key problems of MLOps, across the whole ML lifecycle. This comes across in the prominence Azure gives to workspaces and git repos and there's also increasing support for Azure ML with VSCode (along with the web-based GUI called Studio).
The cloud providers are all looking to leverage existing relationships in their MLOps offerings. For Microsoft this appears to be about developer relationships (with GitHub and VSCode) as well as a reputation as a compute provider. They seem keen on integrations and references to open source tools and integrations to the Databricks platform are prominent in the documentation.
With Azure Machine Learning everything belongs to a workspace by default and workspaces can be shared between users and teams. The assets under a workspace are shown in the studio web UI.
Let's walk through the key Azure ML concepts to get a feel for the platform.
Datasets are references to where data is stored. The data itself isn't in the workspace but the dataset abstraction lets you work with the data through the workspace. Only metadata is copied to the workspace. Datasets can be FileDataSets or TabularDataSets. The data can be on a range of supported types of storage, including blob storage, databases or the Databricks file system.
An environment is a configuration with variables and library dependencies, used for training and for serving models. Plays a similar role to pipenv but is instantiated through docker under the hood.
Experiments are groups of training runs. Each time we train a model with a set of parameters, that falls under an experiment that can be automatically recorded. This allows us to review what was trained when and by whom. Here's a simple script to submit a training run with the Azure ML Python SDK:
from azureml.core import ScriptRunConfig, Experiment
from azureml.core.environment import Environment
exp = Experiment(name="myexp", workspace = ws)
# Instantiate environment
myenv = Environment(name="myenv")
# Configure the ScriptRunConfig and specify the environment
src = ScriptRunConfig(source_directory=".", script="train.py", compute_target="local", environment=myenv)
# Submit run
run = exp.submit(src)
Here we're referring to another script called "train.py" that contains typical model training code, nothing azure-specific. We name the experiment that will be used and also name the environment. Both are instantiated automatically and the submit operation runs the training job.
The above is run from the web studio with the files in the cloud already. Training can also be run from a notebook or from local by having the CLI configured and submitting a CLI command with a YAML specification for the environment image and to point to the code.
Training parameters and metrics can be automatically logged as Azure integrates with mlflow's open source approach to tracking. If you submit a run from a directory under git then git information is also tracked for the run.
Azure Machine Learning Pipelines are for training jobs that have multiple long-running steps. The steps can be chained to run in different containers so that some can run in parallel and if an individual step fails then you can retry/resume from there.
Models are either created in Azure through training runs or you can register models created elsewhere. A registered model can be deployed as an endpoint.
An endpoint sets up hosting so that you can make requests to your model in the cloud and get predictions. Endpoint hosting is inside a container image - so basically an Environment, which could be the same image/Environment used for training. There are some prebuilt images available to use as a basis. Or you can build an image from scratch that conforms to the requirements.
Azure ML's managed endpoints have traffic-splitting features for rollout and can work with GPUs. Inference can be real-time or batch. There's also integration to monitoring features. Managed endpoints and monitoring are both in Preview/Beta release at the time of writing.