Building an interactive and scalable ML research environment using AWS ParallelCluster
When it comes to running distributed machine learning (ML) workloads, AWS offers you both managed and self-service offerings. Amazon SageMaker is a managed service that can help engineering, data science, and research teams save time and reduce operational overhead. AWS ParallelCluster is an open-source, self-service cluster management tool for customers who wish to maintain more direct control over their computing infrastructure. This post addresses how to perform distributed ML on AWS. For more information about distributed training using Amazon SageMaker, see the following posts on launching TensorFlow distributed training with Horovod and multi-region serverless distributed training.
AWS ParallelCluster is an AWS-supported open-source cluster management tool that helps users deploy and manage high performance computing (HPC) clusters in the AWS Cloud. AWS ParallelCluster allows data scientists and researchers to reproduce a familiar working environment on elastically scaled AWS resources by automatically setting up the required compute resources and shared file system. Broadly supported data science and ML tools such as Jupyter, Conda, MXNet, PyTorch, and TensorFlow allow flexible, interactive development with low-overhead scaling. These features make AWS ParallelCluster environments ideally suited for ML research environments that support distributed model development and training.
AWS ParallelCluster enables a scalable research workflow built around on-demand allocation of compute resources. Rather than working with, and potentially underutilizing, a single high-power GPU-enabled workstation, AWS ParallelCluster manages an on-demand fleet of GPU-enabled compute workers. This allows trivial scale-up for parallel training experiments and automatic scale-down when resources aren’t required, minimizing cost and (most importantly) saving researcher time. An attached Amazon FSx file system takes advantage of a traditional high-performance Lustre file system during development, but archives models and data into the low-cost Amazon S3.
The following graphic shows an AWS ParallelCluster-based research environment. Autoscaled Amazon EC2 resources access remote storage, with models and data archived to S3.
This post shows you how to set up, run, and tear down a complete AWS ParallelCluster environment implementing this architecture. The post runs two NLP tutorials, fine-tuning a BERT model on a paraphrasing task and training an English-German machine translation model. This includes the following steps:
- AWS ParallelCluster configuration and setup
- Conda-based installation of your ML and NLP packages
- Initial interactive model training
- Parallel model training and evaluation
- Data archiving and cluster teardown
The tutorial lays out a workflow using standard tools, and you can adapt it to your research requirements.
Prerequisites
This post uses a combination of m5
and p3
EC2 instances and Amazon FSx and Amazon S3 storage. Furthermore, because you are using GPU-enabled instances for training, this tutorial takes your account out of the free AWS tier. Before you begin, complete the following prerequisites:
- Set up an AWS account and create an access token with administrator permissions.
- Request quota increases in your target AWS Region for at least one
m5.xlarge
, threep3.2xlarge
, and threep3.8xlarge
On-Demand Instances.
Setting up your client and cluster
Start with a one-time setup and configuration of your workstation with the aws-parallelcluster
client in a dedicated Conda environment. You reuse this pattern again later when setting up isolated environments for each subproject that contains a precise set of dependencies required to reproduce your work.
Installing Conda
Perform a one-time installation of a base Miniconda environment and initialize your shell to enable Conda. This post works from a macOS workstation; use the download URL for your preferred platform. This configuration sets up a base environment and activates it in your interactive shell. See the following code:
Setting up your client environment
Install AWS ParallelCluster and the AWS CLI tools using a Conda environment called pcluster_client
. This environment provides separation between the client and your system environment. First, write an environment.yml
file specifying the environment name and dependency versions. Call conda env update
to download and install the libraries. See the following code:
Configuring pcluster and creating storage
To configure AWS ParallelCluster, conda activate
your pcluster_client
environment and configure aws
and pcluster
via the default configuration flow. For more information, see Configuring AWS ParallelCluster.
During configuration, upload your id_rsa
public key to AWS and store your private key locally, which you use to access your pcluster
instances. See the following code:
After configuring AWS ParallelCluster, create an S3 bucket for persistent storage of your data and models with the following code:
Add config entries for a GPU-enabled cluster and Amazon FSx file system with the following code:
Creating and bootstrapping your cluster
After configuration, bring your cluster online. This command creates a persistent master instance, attaches an Amazon FSx file system, and sets up a p3
class Auto Scaling group. After cluster creation is complete, set up Miniconda again, this time installing it onto the /workspace
file system accessible on all master and compute nodes. See the following code:
Your compute cluster now contains a single m5
class instance, with p3.2xlarge
instances available via the slurm job manager. You can use an interactive salloc
session to access your p3
resources via srun
commands. An important implication of your autoscaled cluster strategy is that while all code and data are available across the cluster, access to attached GPUs is limited to compute nodes accessed via srun
. You can demonstrate this via calls to nvidia-smi
, which reports the status of attached resources. See the following code:
AWS ParallelCluster performs automatic management of your compute Auto Scaling group. This keeps a compute node running and available for the lifetime of your salloc
and terminates the idle compute node several minutes after the job ends.
Model training
Initial GPU-enabled interactive training
For an initial research task, run a standard natural language process workflow, fine-tuning a pre-trained BERT model onto a specific subtask. Establish a working environment with your model dependencies, download the pre-trained model and training data, and run fine-tuning training on a GPU. For more information about PyTorch pre-trained BERT examples, see the GitHub repo.
First, run a one-time setup of your project: a Conda environment with library dependencies and a workspace with training data. Write an environment.yml
specifying the dependencies for your project, call conda env update
to create and install the environment, and call conda env activate
. Fetch your training data into /workspace/bert_tuning
. See the following code:
After downloading your dependencies, fetch the training script and run fine-tuning in an interactive session. The only difference from the documented non-cluster example is that you run your training via salloc --exclusive srun
rather than directly invoking the training script. The /workspace
Amazon FSx file system allows the compute node to access your Conda environment’s installed libraries and your model definition, training data, and model checkpoints. As before, allocate a GPU-enabled node for the training run, which terminates after your run is complete. See the following code:
Multi-GPU training
Using salloc
is useful for interactive model development, short training jobs, and testing. However, the majority of modern research requires multiple long-running training jobs for model development and tuning. To support more compute-intensive experimentation, update your cluster to multi-GPU compute instances and use sbatch
for non-interactive training. Enqueue multiple training jobs for an experiment and let AWS ParallelCluster scale up your compute group for the run and scale down after the experiment is complete.
From your workstation, add configuration for a multi-GPU cluster, shut down any remaining single-GPU nodes, and update your cluster configuration to multi-GPU p3.8xlarge
compute instances. See the following code:
This post retrains a transformer-based English-to-German translation model using the FairSeq NLP framework. As before, set up a new workspace and environment and download training data. See the following code:
After downloading and preprocessing your training data, write your training script and launch a quick interactive training run to confirm that your script launches and successfully trains for several epochs. Your first job is limited to a single GPU via CUDA_VISIBLE_DEVICES
and should train in approximately 60 seconds/epoch; after an epoch or so, interrupt with ctrl-C
. Because your underlying model supports distributed data-parallel training, you can expect nearly linear performance scaling with additional GPUs on a single worker. Training in a second job with all four devices should train in approximately 15–20 seconds/epoch, confirming effective multi-GPU scaling, which you again interrupt. See the following code:
After your initial validation, run sbatch
to schedule your full training run. The sinfo
command provides information about your running cluster, and squeue shows the status of your batch job. tail
on the job log allows you to monitor training progress, and ssh
access to the compute node address reported by squeue
allows you to check resource utilization. As before, AWS ParallelCluster scales up your compute cluster for the batch training job and releases the GPU-enabled instances after batch training is complete. See the following code:
The job takes approximately 80–90 minutes to complete. You can now evaluate your model via interactive translation. See the following code:
Jupyter and other HTTP services
Interactive notebook-based development is frequently used for data exploration, model analysis, and prototyping. You can launch and access a notebook server running on your AWS ParallelCluster workers. Add jupyterlab
to the project’s workspace environment and srun
the notebook. See the following code:
In a separate terminal, set up a pcluster ssh
tunnel to the notebook worker using the node address and access token reported by Jupyter and open a local browser. See the following code:
You can use a similar approach to run tools such as tensorboard
in your cluster environment.
Storage and cluster teardown
After completing model training and evaluation, you can archive your /workspace
file system to Amazon S3 via Amazon FSx’s hierarchical storage support. For more information, see Using Data Repositories. After the hsm_archive
actions complete in approximately 60–90 minutes, verify the contents of your s3
export bucket via the AWS CLI with the following code:
A later call to pcluster create
with the same configuration restores your cluster, pre-populating /workspace
from your S3 archive.
Multiple clusters
You can use AWS ParallelCluster to manage multiple concurrent compute clusters. For instance, you can use a mix of CPU and GPU clusters to support preprocessing or analysis tasks that involve significant CPU-bound processing. Additionally, this can provide independent clusters for multiple researchers in a single shared AWS workspace.
Adapting this workflow to a multi-cluster configuration is relatively simple. Set up a standalone Amazon FSx file system and manage its lifecycle via existing CloudFormation templates in the amazon-fsx-workshop/lustre GitHub repo. Specify an export prefix and update ~/.parallelcluster
/config with the following code:
Multiple clusters now share a /workspace file system, decoupled from the lifetime of any individual cluster. You can use calls to lfs hsm_archive from any cluster to back up file system contents to S3, potentially via a nightly cron.
Capacity management
AWS ParallelCluster manages a compute cluster of EC2 instances via a standard Auto Scaling group, allowing you to use existing AWS-native tools for capacity management as you scale clusters. AWS ParallelCluster has built-in support for using Spot Instances within compute fleets via cluster_type
configuration, and uses Reserved Instance capacity if available. You can use On-Demand Capacity Reservations so AWS ParallelCluster can rapidly scale to match your target compute fleet size.
Conclusion
If you wish to maintain more direct control over your computing infrastructure, an AWS ParallelCluster-based workflow provides an ideal working environment for applied machine learning research. Rapid cluster setup, scaling, and updates allow interactive exploration of a modeling task, including identification of a proper instance type and multi-instance scaling for parallel training runs. Conda environments and a high-performance Amazon FSx file system provide a familiar file interface and handle the critical, but undifferentiated, heavy lifting of reproducibly archiving model artifacts to S3 transparently.
For more information about configuring AWS ParallelCluster and building an interactive and scalable ML or HPC research environment, see the AWS ParallelCluster User Guide or the aws-parallelcluster GitHub repo.
About the author
Alex Ford is an Applied Scientist with AWS. He is passionate about emerging applications at the intersection of machine learning and the natural sciences. In his spare time, he explores the geography and geology of the Cascadia subduction zone, with deep affection for the Index batholith.
Tags: Archive
Leave a Reply