This sample shows how to use DVC within the SageMaker environment. In particular, we will look at how to build a custom image with DVC libraries installed by default to provide a consistent environment to your data scientists. Furthermore, we show how you can integrate SageMaker Processing, SageMaker Trainings and SageMaker Experiments with a DVC workflow.
For full details on how this works:
- Read the Blog post at: https://aws.amazon.com/blogs/machine-learning/track-your-ml-experiments-end-to-end-with-data-version-control-and-amazon-sagemaker-experiments/
- An AWS Account
- An IAM user with Admin-like permissions
If you do not have Admin-like permissions, we recommend to have at least the following permissions:
- Administer Amazon ECR
- Administer a SageMaker Studio Domain
- Administer S3 (or at least any buckets with sagemaker in the bucket name)
- Create IAM Roles
- Create a Cloud9 environment
We suggest for the initial setup, to use Cloud9 on a t3.large
instance type.
We aim to explains how to create a custom image for Amazon SageMaker Studio that has DVC already installed. The advantage of creating an image and make it available to all SageMaker Studio users is that it creates a consistent environment for the SageMake Studio users, which they could also run locally.
This tutorial is heavily inspired by this example. Further information about custom images for SageMaker Studio can be found here
This custom image sample demonstrates how to create a custom Conda environment in a Docker image and use it as a custom kernel in SageMaker Studio.
The Conda environment must have the appropriate kernel package installed, for e.g., ipykernel
for a Python kernel.
This example creates a Conda environment called dvc
with a few Python packages (see environment.yml) and the ipykernel
.
SageMaker Studio will automatically recognize this Conda environment as a kernel named conda-env-dvc-py
.
git clone https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo
cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/
./resize-cloud9.sh 20
Set some basic environment variables
sudo yum install jq -y
export REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region')
echo "export REGION=${REGION}" | tee -a ~/.bash_profile
export ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account')
echo "export ACCOUNT_ID=${ACCOUNT_ID}" | tee -a ~/.bash_profile
export IMAGE_NAME=conda-env-dvc-kernel
echo "export IMAGE_NAME=${IMAGE_NAME}" | tee -a ~/.bash_profile
Build the Docker image and push to Amazon ECR.
# Login to ECR
aws --region ${REGION} ecr get-login-password | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom
# Create the ECR repository
aws --region ${REGION} ecr create-repository --repository-name smstudio-custom
# Build the image - it might take a few minutes to complete this step
docker build . -t ${IMAGE_NAME} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}
# Push the image to ECR
docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}
Step 1: Navigate to the cdk
directory:
cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/cdk
Step 2: Create a virtual environment:
python3 -m venv .cdk-venv
Step 3: Activate the virtual environment after the init process completes, and the virtual environment is created:
source .cdk-venv/bin/activate
Step 4: Install the required dependencies:
pip3 install --upgrade pip
pip3 install -r requirements.txt
Step 5: Install and bootstrap CDK v2 (The latest CDK version tested was 2.27.0
)
npm install -g aws-cdk@2.27.0 --force
cdk bootstrap
( Skip to Update an existing SageMaker Studio if you have already an existing SageMaker Studio domain)
Step 5: deploy CDK (CDK will deploy a stack named: sagemakerStudioCDK
which you can verify in CloudFormation
)
cdk deploy --require-approval never
CDK will create the following resources via CloudFormation
:
- provisions a new SageMaker Studio domain
- creates and attaches a SageMaker execution role, i.e.,
RoleForSagemakerStudioUsers
with the right permissions to the SageMaker Studio domain - creates a SageMaker Image and a SageMaker Image Version from the docker image
conda-env-dvc-kernel
we have created earlier - creates an AppImageConfig which specify how the kernel gateway should be configured
- provision a SageMaker Studio user, i.e.,
data-scientist-dvc
, with the correct SageMaker execution role and makes available the custom SageMaker Studio image available to it
If you have an existing SageMaker Studio environment, we need to first retrieve the exising SageMaker Studio domain ID, deploy a "reduced" version of the CDK stack, and update the SageMaker Studio domain configuration.
Step 5: Set the DOMAIN_ID
environment variable with your domain ID and save to your bash_profile
.
export DOMAIN_ID=$(aws sagemaker list-domains | jq -r '.Domains[0].DomainId')
echo "export DOMAIN_ID=${DOMAIN_ID}" | tee -a ~/.bash_profile
Step 6: deploy CDK (by setting the DOMAIN_ID
environment variable, CDK will deploy a stack named: sagemakerUserCDK
which you can verify on CloudFormation
)
cdk deploy --require-approval never
CDK will create the following resources via CloudFormation
:
- creates and attaches a SageMaker execution role, i.e.,
RoleForSagemakerStudioUsers
with the right permissions to your existing SageMaker Studio domain - creates a SageMaker Image and a SageMaker Image Version from the docker image
conda-env-dvc-kernel
we have created earlier - creates an AppImageConfig which specify how the kernel gateway should be configured
- provision a SageMaker Studio user, i.e.,
data-scientist-dvc
, with the correct SageMaker execution role and makes available the custom SageMaker Studio image available to it
Step 7: Update the SageMaker Studio domain configuration
# inject your DOMAIN_ID into the configuration file
sed -i 's/<your-sagemaker-studio-domain-id>/'"$DOMAIN_ID"'/' ../update-domain-input.json
# update the sagemaker studio domain
aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-input.json
Open the newly created SageMaker Studio user, i.e., data-scientist-dvc
.
In the SageMaker Studio domain, launch Studio
for the data-scientist-dvc
.
Open a terminal and clone the repository
git clone https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo
and open the dvc_sagemaker_script_mode.ipynb notebook.
When prompted, ensure that you select the Custom Image conda-env-dvc-kernel
as shown below
We provide two sample notebooks to see how to use DVC in combination with SageMaker:
- one that installs DVC in script mode by passing a
requirements.txt
file to both the processing job and the training job dvc_sagemaker_script_mode.ipynb; - one that shows how to create the container for the processing jobs, the training jobs, and the inference dvc_sagemaker_byoc.ipynb.
Both notebooks are meant to be used within SageMaker Studio with the custom image created before.
Before removing all resources created, you need to make sure that all Apps are deleted from the data-scientist-dvc
user, i.e. all KernelGateway
apps, as well as the default JupiterServer
.
Once done, you can destroy the CDK stack by running
cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/cdk
cdk destroy
In case you started off from an existing domain, please also execute the following command:
# inject your DOMAIN_ID into the configuration file
sed -i 's/<your-sagemaker-studio-domain-id>/'"$DOMAIN_ID"'/' ../update-domain-no-custom-images.json
# update the sagemaker studio domain
aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-no-custom-images.json
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.