Use a custom image to bring your own development environment to RStudio on Amazon SageMaker

RStudio on Amazon SageMaker is the industry’s first fully managed RStudio Workbench in cloud. You can quickly launch the familiar RStudio integrated development environment (IDE), and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale. RStudio on SageMaker already comes with a built-in image preconfigured with R programming and data science tools; however, you often need to customize your IDE environment. Starting today, you can bring your own custom image with packages and tools of your choice, and make them available to all the users of RStudio on SageMaker in a few clicks.

Bringing your own custom image has several benefits. You can standardize and simplify the getting started experience for data scientists and developers by providing a starter image, preconfigure the drivers required for connecting to data stores, or pre-install specialized data science software for your business domain. Furthermore, organizations that have previously hosted their own RStudio Workbench may have existing containerized environments that they want to continue to use in RStudio on SageMaker.

In this post, we share step-by-step instructions to create a custom image and bring it to RStudio on SageMaker using the AWS Management Console or AWS Command Line Interface (AWS CLI). You can get your first custom IDE environment up and running in few simple steps. For more information on the content discussed in this post, refer to Bring your own RStudio image.

Solution overview

When a data scientist starts a new session in RStudio on SageMaker, a new on-demand ML compute instance is provisioned and a container image that defines the runtime environment (operating system, libraries, R versions, and so on) is run on the ML instance. You can provide your data scientists multiple choices for the runtime environment by creating custom container images and making them available on the RStudio Workbench launcher, as shown in the following screenshot.

The following diagram describes the process to bring your custom image. First you build a custom container image from a Dockerfile and push it to a repository in Amazon Elastic Container Registry (Amazon ECR). Next, you create a SageMaker image that points to the container image in Amazon ECR, and attach that image to your SageMaker domain. This makes the custom image available for launching a new session in RStudio.

Prerequisites

To implement this solution, you must have the following prerequisites:

We provide more details on each in this section.

RStudio on SageMaker domain

If you have an existing SageMaker domain with RStudio enabled prior to April 7, 2022, you must delete and recreate the RStudioServerPro app under the user profile name domain-shared to get the latest updates for bring your own custom image capability. The AWS CLI commands are as follows. Note that this action interrupts RStudio users on SageMaker.

aws sagemaker delete-app 
    --domain-id  
    --app-type RStudioServerPro 
    --app-name default 
    --user-profile-name domain-shared
aws sagemaker create-app 
    --domain-id  
    --app-type RStudioServerPro 
    --app-name default 
    --user-profile-name domain-shared

If this is your first time using RStudio on SageMaker, follow the step-by-step setup process described in Get started with RStudio on Amazon SageMaker, or run the following AWS CloudFormation template to set up your first RStudio on SageMaker domain. If you already have a working RStudio on SageMaker domain, you can skip this step.

The following RStudio on SageMaker CloudFormation template requires an RStudio license approved through AWS License Manager. For more about licensing, refer to RStudio license. Also note that only one SageMaker domain is permitted per AWS Region, so you’ll need to use an AWS account and Region that doesn’t have an existing domain.

  1. Choose Launch Stack.
    Launch stack button
    The link takes you to the us-east-1 Region, but you can change to your preferred Region.
  2. In the Specify template section, choose Next.
  3. In the Specify stack details section, for Stack name, enter a name.
  4. For Parameters, enter a SageMaker user profile name.
  5. Choose Next.
  6. In the Configure stack options section, choose Next.
  7. In the Review section, select I acknowledge that AWS CloudFormation might create IAM resources and choose Next.
  8. When the stack status changes to CREATE_COMPLETE, go to the Control Panel on the SageMaker console to find the domain and the new user.

IAM policies to interact with Amazon ECR

To interact with your private Amazon ECR repositories, you need the following IAM permissions in the IAM user or role you’ll use to build and push Docker images:

{ 
    "Version":"2012-10-17", 
    "Statement":[ 
        {
            "Sid": "VisualEditor0",
            "Effect":"Allow", 
            "Action":[ 
                "ecr:CreateRepository", 
                "ecr:BatchGetImage", 
                "ecr:CompleteLayerUpload", 
                "ecr:DescribeImages", 
                "ecr:DescribeRepositories", 
                "ecr:UploadLayerPart", 
                "ecr:ListImages", 
                "ecr:InitiateLayerUpload", 
                "ecr:BatchCheckLayerAvailability", 
                "ecr:PutImage" 
            ], 
            "Resource": "*" 
        }
    ]
}

To initially build from a public Amazon ECR image as shown in this post, you need to attach the AWS-managed AmazonElasticContainerRegistryPublicReadOnly policy to your IAM user or role as well.

To build a Docker container image, you can use either a local Docker client or the SageMaker Docker Build CLI tool from a terminal within RStudio on SageMaker. For the latter, follow the prerequisites in Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks to set up the IAM permissions and CLI tool.

AWS CLI versions

There are minimum version requirements for the AWS CLI tool to run the commands mentioned in this post. Make sure to upgrade AWS CLI on your terminal of choice:

  • AWS CLI v1 >= 1.23.6
  • AWS CLI v2 >= 2.6.2

Prepare a Dockerfile

You can customize your runtime environment in RStudio in a Dockerfile. Because the customization depends on your use case and requirements, we show you the essentials and the most common customizations in this example. You can download the full sample Dockerfile.

Install RStudio Workbench session components

The most important software to install in your custom container image is RStudio Workbench. We download from the public S3 bucket hosted by RStudio PBC. There are many version releases and OS distributions for use. The version of the installation needs to be compatible with the RStudio Workbench version used in RStudio on SageMaker, which is 1.4.1717-3 at the time of writing. The OS (argument OS in the following snippet) needs to match the base OS used in the container image. In our sample Dockerfile, the base image we use is Amazon Linux 2 from an AWS-managed public Amazon ECR repository. The compatible RStudio Workbench OS is centos7.

FROM public.ecr.aws/amazonlinux/amazonlinux
...
ARG RSW_VERSION=1.4.1717-3
ARG RSW_NAME=rstudio-workbench-rhel
ARG OS=centos7
ARG RSW_DOWNLOAD_URL=https://s3.amazonaws.com/rstudio-ide-build/server/${OS}/x86_64
RUN RSW_VERSION_URL=`echo -n "${RSW_VERSION}" | sed 's/+/-/g'` 
    && curl -o rstudio-workbench.rpm ${RSW_DOWNLOAD_URL}/${RSW_NAME}-${RSW_VERSION_URL}-x86_64.rpm 
    && yum install -y rstudio-workbench.rpm

You can find all the OS release options with the following command:

aws s3 ls s3://rstudio-ide-build/server/

Install R (and versions of R)

The runtime for your custom RStudio container image needs at least one version of R. We can first install a version of R and make it the default R by creating soft links to /usr/local/bin/:

# Install main R version
ARG R_VERSION=4.1.3
RUN curl -O https://cdn.rstudio.com/r/centos-7/pkgs/R-${R_VERSION}-1-1.x86_64.rpm && 
    yum install -y R-${R_VERSION}-1-1.x86_64.rpm && 
    yum clean all && 
    rm -rf R-${R_VERSION}-1-1.x86_64.rpm

RUN ln -s /opt/R/${R_VERSION}/bin/R /usr/local/bin/R && 
    ln -s /opt/R/${R_VERSION}/bin/Rscript /usr/local/bin/Rscript

Data scientists often need multiple versions of R so that they can easily switch between projects and code base. RStudio on SageMaker supports easy switching between R versions, as shown in the following screenshot.

RStudio on SageMaker automatically scans and discovers versions of R in the following directories:

/usr/lib/R
/usr/lib64/R
/usr/local/lib/R
/usr/local/lib64/R
/opt/local/lib/R
/opt/local/lib64/R
/opt/R/*
/opt/local/R/*

We can install more versions in the container image, as shown in the following snippet. They will be installed in /opt/R/.

RUN curl -O https://cdn.rstudio.com/r/centos-7/pkgs/R-4.0.5-1-1.x86_64.rpm && 
    yum install -y R-4.0.5-1-1.x86_64.rpm && 
    yum clean all && 
    rm -rf R-4.0.5-1-1.x86_64.rpm

RUN curl -O https://cdn.rstudio.com/r/centos-7/pkgs/R-3.6.3-1-1.x86_64.rpm && 
    yum install -y R-3.6.3-1-1.x86_64.rpm && 
    yum clean all && 
    rm -rf R-3.6.3-1-1.x86_64.rpm

RUN curl -O https://cdn.rstudio.com/r/centos-7/pkgs/R-3.5.3-1-1.x86_64.rpm && 
    yum install -y R-3.5.3-1-1.x86_64.rpm && 
    yum clean all && 
    rm -rf R-3.5.3-1-1.x86_64.rpm

Install RStudio Professional Drivers

Data scientists often need to access data from sources such as Amazon Athena and Amazon Redshift within RStudio on SageMaker. You can do so using RStudio Professional Drivers and RStudio Connections. Make sure you install the relevant libraries and drivers as shown in the following snippet:

# Install RStudio Professional Drivers ----------------------------------------#
RUN yum update -y && 
    yum install -y unixODBC unixODBC-devel && 
    yum clean all

ARG DRIVERS_VERSION=2021.10.0-1
RUN curl -O https://drivers.rstudio.org/7C152C12/installer/rstudio-drivers-${DRIVERS_VERSION}.el7.x86_64.rpm && 
    yum install -y rstudio-drivers-${DRIVERS_VERSION}.el7.x86_64.rpm && 
    yum clean all && 
    rm -f rstudio-drivers-${DRIVERS_VERSION}.el7.x86_64.rpm && 
    cp /opt/rstudio-drivers/odbcinst.ini.sample /etc/odbcinst.ini

RUN /opt/R/${R_VERSION}/bin/R -e 'install.packages("odbc", repos="https://packagemanager.rstudio.com/cran/__linux__/centos7/latest")'

Install custom libraries

You can also install additional R and Python libraries so that data scientists don’t need to install them on the fly:

RUN /opt/R/${R_VERSION}/bin/R -e 
    "install.packages(c('reticulate', 'readr', 'curl', 'ggplot2', 'dplyr', 'stringr', 'fable', 'tsibble', 'dplyr', 'feasts', 'remotes', 'urca', 'sodium', 'plumber', 'jsonlite'), repos='https://packagemanager.rstudio.com/cran/__linux__/centos7/latest')"
    
RUN /opt/python/${PYTHON_VERSION}/bin/pip install --upgrade 
        'boto3>1.0<2.0' 
        'awscli>1.0<2.0' 
        'sagemaker[local]<3' 
        'sagemaker-studio-image-build' 
        'numpy'

When you’ve finished your customization in a Dockerfile, it’s time to build a container image and push it to Amazon ECR.

Build and push to Amazon ECR

You can build a container image from the Dockerfile from a terminal where the Docker engine is installed, such as your local terminal or AWS Cloud9. If you’re building it from a terminal within RStudio on SageMaker, you can use SageMaker Studio Image Build. We demonstrate the steps for both approaches.

In a local terminal where the Docker engine is present, you can run the following commands from where the Dockerfile is. You can use the sample script create-and-update-image.sh.

IMAGE_NAME=r-4.1.3-rstudio-1.4.1717-3           # the name for SageMaker Image
REPO=rstudio-custom                             # ECR repository name
TAG=$IMAGE_NAME
# login to your Amazon ECR
aws ecr get-login-password | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com

# create a repo
aws ecr create-repository --repository-name ${REPO}

# build a docker image and push it to the repo
docker build . -t ${REPO}:${TAG} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO}:${TAG}
docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO}:${TAG}

In a terminal on RStudio on SageMaker, run the following commands:

pip install sagemaker-studio-image-build
sm-docker build . --repository ${REPO}:${IMAGE_NAME}

After these commands, you have a repository and a Docker container image in Amazon ECR for our next step, in which we attach the container image for use in RStudio on SageMaker. Note the image URI in Amazon ECR .dkr.ecr..amazonaws.com/: for later use.

Update RStudio on SageMaker through the console

RStudio on SageMaker allows runtime customization through the use of a custom SageMaker image. A SageMaker image is a holder for a set of SageMaker image versions. Each image version represents a container image that is compatible with RStudio on SageMaker and stored in an Amazon ECR repository. To make a custom SageMaker image available to all RStudio users within a domain, you can attach the image to the domain following the steps in this section.

  1. On the SageMaker console, navigate to the Custom SageMaker Studio images attached to domain page, and choose Attach image.
  2. Select New image, and enter your Amazon ECR image URI.
  3. Choose Next.
  4. In the Image properties section, provide an Image name (required), Image display name (optional), Description (optional), IAM role, and tags.
    The image display name, if provided, is shown in the session launcher in RStudio on SageMaker. If the Image display name field is left empty, the image name is shown in RStudio on SageMaker instead.
  5. Leave EFS mount path and Advanced configuration (User ID and Group ID) as default because RStudio on SageMaker manages the configuration for us.
  6. In the Image type section, select RStudio image.
  7. Choose Submit.

You can now see a new entry in the list. It’s worth noting that, with the introduction of the support of custom RStudio images, you can see a new Usage type column in the table to denote whether an image is an RStudio image or an Amazon SageMaker Studio image.

It may take up to 5–10 minutes for the custom images to be available in the session launcher UI. You can then launch a new R session in RStudio on SageMaker with your custom images.

Over time, you may want to retire old and outdated images. To remove the custom images from the list of custom images in RStudio, select the images in the list and choose Detach.

Choose Detach again to confirm.

Update RStudio on SageMaker via the AWS CLI

The following sections describe the steps to create a SageMaker image and attach it for use in RStudio on SageMaker on the SageMaker console and using the AWS CLI. You can use the sample script create-and-update-image.sh.

Create the SageMaker image and image version

The first step is to create a SageMaker image from the custom container image in Amazon ECR by running the following two commands:

ROLE_ARN=
DISPLAY_NAME=RSession-r-4.1.3-rstudio-1.4.1717-3
aws sagemaker create-image 
    --image-name ${IMAGE_NAME} 
    --display-name ${DISPLAY_NAME} 
    --role-arn ${ROLE_ARN}

aws sagemaker create-image-version 
    --image-name ${IMAGE_NAME} 
    --base-image "${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO}:${TAG}"

Note that the custom image displayed in the session launcher in RStudio on SageMaker is determined by the input of --display-name. If the optional display name is not provided, the input of --image-name is used instead. Also note that the IAM role allows SageMaker to attach an Amazon ECR image to RStudio on SageMaker.

Create an AppImageConfig

In addition to a SageMaker image, which captures the image URI from Amazon ECR, an app image configuration (AppImageConfig) is required for use in a SageMaker domain. We simplify the configuration for an RSessionApp image so we can just create a placeholder configuration with the following command:

IMAGE_CONFIG_NAME=r-4-1-3-rstudio-1-4-1717-3
aws sagemaker create-app-image-config 
    --app-image-config-name ${IMAGE_CONFIG_NAME}

Attach to a SageMaker domain

With the SageMaker image and the app image configuration created, we’re ready to attach the custom container image to the SageMaker domain. To make a custom SageMaker image available to all RStudio users within a domain, you attach the image to the domain as a default user setting. All existing users and any new users will be able to use the custom image.

For better readability, we place the following configuration into the JSON file default-user-settings.json:

    "DefaultUserSettings": {
        "RSessionAppSettings": {
           "CustomImages": [
                {
                     "ImageName": "r-4.1.3-rstudio-2022",
                     "AppImageConfigName": "r-4-1-3-rstudio-2022"
                },
                {
                     "ImageName": "r-4.1.3-rstudio-1.4.1717-3",
                     "AppImageConfigName": "r-4-1-3-rstudio-1-4-1717-3"
                }
            ]
        }
    }
}

In this file, we can specify the image and AppImageConfig name pairs in a list in DefaultUserSettings.RSessionAppSettings.CustomImages. This preceding snippet assumes two custom images are being created.

Then run the following command to update the SageMaker domain:

aws sagemaker update-domain 
    --domain-id  
    --cli-input-json file://default-user-settings.json

After you update the domaim, it may take up to 5–10 minutes for the custom images to be available in the session launcher UI. You can then launch a new R session in RStudio on SageMaker with your custom images.

Detach images from a SageMaker domain

You can detach images simply by removing the ImageName and AppImageConfigName pairs from default-user-settings.json and updating the domain.

For example, updating the domain with the following default-user-settings.json removes r-4.1.3-rstudio-2022 from the R session launching UI and leaves r-4.1.3-rstudio-1.4.1717-3 as the only custom image available to all users in a domain:

{
    "DefaultUserSettings": {
        "RSessionAppSettings": {
           "CustomImages": [
                {
                     "ImageName": "r-4.1.3-rstudio-1.4.1717-3",
                     "AppImageConfigName": "r-4-1-3-rstudio-1-4-1717-3"
                }
            ]
        }
    }
}

Clean up

To safely remove images and resources in the SageMaker domain, complete the following steps in Clean up image resources.

To safely remove the RStudio on SageMaker and the SageMaker domain, complete the following steps in Delete an Amazon SageMaker Domain to delete any RSessionGateway app, user profile and the domain.

To safely remove images and repositories in Amazon ECR, complete the following steps in Deleting an image.

Finally, to delete the CloudFormation template:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Select the stack you deployed for this solution.
  3. Choose Delete.

Conclusion

RStudio on SageMaker makes it simple for data scientists to build ML and analytic solutions in R at scale, and for administrators to manage a robust data science environment for their developers. Data scientists want to customize the environment so that they can use the right libraries for the right job and achieve the desired reproducibility for each ML project. Administrators need to standardize the data science environment for regulatory and security reasons. You can now create custom container images that meet your organizational requirements and allow data scientists to use them in RStudio on SageMaker.

We encourage you to try it out. Happy developing!


About the Authors

Michael Hsieh is a Senior AI/ML Specialist Solutions Architect. He works with customers to advance their ML journey with a combination of AWS ML offerings and his ML domain knowledge. As a Seattle transplant, he loves exploring the great Mother Nature the city has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at Shilshole Bay.

Declan Kelly is a Software Engineer on the Amazon SageMaker Studio team. He has been working on Amazon SageMaker Studio since its launch at AWS re:Invent 2019. Outside of work, he enjoys hiking and climbing.

Sean MorganSean Morgan is an AI/ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time, Sean is an active open-source contributor and maintainer, and is the special interest group lead for TensorFlow Add-ons.

View Original Source (aws.amazon.com) Here.

Leave a Reply

Your email address will not be published.

Shared by: AWS Machine Learning