How to train procedurally generated game-like environments at scale with Amazon SageMaker RL

A gym is a toolkit for developing and comparing reinforcement learning algorithms. Procgen Benchmark is a suite of 16 procedurally-generated gym environments designed to benchmark both sample efficiency and generalization in reinforcement learning. These environments are associated with the paper Leveraging Procedural Generation to Benchmark Reinforcement Learning (citation). Compared to Gym Retro, these environments have the following benefits:

Faster – Gym Retro environments are already fast, but Procgen environments can run over four times faster.
Non-deterministic – Gym Retro environments are always the same, so you can memorize a sequence of actions that gets the highest reward. Procgen environments are randomized so this isn’t possible.
Customizable – If you install from source, you can perform experiments where you change the environments, or build your own environments. The environment-specific code for each environment is often less than 300 lines. This is almost impossible with Gym Retro.

This post demonstrates how to use the Amazon SageMaker reinforcement learning starter kit for the NeurIPS 2020 – Procgen competition hosted on AIcrowd. The competition was held from June to November 2020, and results can be found here but you can still try out the solution on your own. Our solution allows participants using AIcrowd’s existing neurips2020-procgen-starter-kit to get started with SageMaker seamlessly without making any algorithmic changes. It also helps you reduce the time and effort required to build your sample-efficient reinforcement learning solutions using homogenous and heteregeneous scaling.

Finally, our solution utilizes Spot Instances to reduce cost. The cost savings with Spot GPU Instances is approximately 70% for GPU instances such as ml.p3.2x and ml.p3.8x when training with a popular state-of-the-art reinforcement learning algorithm, Proximate Policy Optimization, and a multi-layer convolutional neural network as the agent’s policy.

Architecture

As part of the solution, we use the following services:

Amazon Simple Storage Service (Amazon S3) to store datasets
A SageMaker notebook to preprocess and visualize the data, and to train the deep learning model
An Amazon Virtual Private Cloud (Amazon VPC) or AWS VPC
AWS Lambda

SageMaker reinforcement learning uses Ray and RLLib the same as in the starter kit. SageMaker supports distributed reinforcement learning in a single SageMaker ML instance with just a few lines of configuration by using the Ray RLlib library.

A typical SageMaker reinforcement learning job for an actor-critic algorithm uses GPU instances to learn a policy network and CPU instances to collect experiences for faster training at optimized costs. SageMaker allows you to achieve this by spinning up two jobs within the same Amazon VPC, and the communications between the instances are taken care of automatically.

Cost

You can contact AICrowd to get credits to use any AWS service.

You’re responsible for the cost of the AWS services used while running this solution, and should set up a budget alert when you’ve reached 90% of your allotted credits. For more information, see Amazon SageMaker Pricing.

As of September 1, 2020, SageMaker training costs (excluding notebook instances) are as follows:

c5.4xlarge – $0.952 per hour (16 vCPU)
g4dn.4xlarge – $1.686 per hour (1 GPU, 16 vCPU)
p3.2xlarge – $4.284 per hour (1 GPU, 8 vCPU)

Launching the solution

To launch the solution, complete the following steps:

While signed in to your AWS account, choose the following link to create the AWS CloudFormation stack for the Region you want to run your notebook:

You’re redirected to the AWS CloudFormation console to create your stack.

Acknowledge the use of the instance type for your SageMaker notebook and training instance.

Make sure that your AWS account has the limits for required instances. If you need to increase the limits for the instances you want to use, contact AWS Support.

As the final parameter, provide the name of the S3 bucket for the solution.

The default name is neurips-2020. You should provide a unique name to make sure there are no conflicts with your existing S3 buckets. An S3 bucket name is globally unique, and the namespace is shared by all AWS accounts. This means that after a bucket is created, the name of that bucket can’t be used by another AWS account in any AWS Region until the bucket is deleted.

Choose Create stack.

You can monitor the progress of your stack by choosing the Event tab or refreshing your screen. If you encounter any error during stack creation (such as confirming again that your S3 bucket name is unique), you can delete the stack and launch it again. When stack creation is complete, go to the SageMaker console. Your notebook should already be created and its status should read InService.

You’re now ready to start training!

On the SageMaker console, choose Notebook instances.
Locate the rl-procgen-neurips instance and choose Open Jupyter or Open JupyterLab.
Choose the notebook 1_train.ipynb.

You can use the cells following the training to run evaluations, do rollouts, and visualize your outcome.

Configuring algorithm parameters and the agent’s neural network model

To configure your RLLib algorithm parameters, go to your notebook folder and open source/train-sagemaker-distributed-{}.py. A subset of algorithm parameters are provided for PPO, but for the full set of algorithm-specific parameters, see Proximal Policy Optimization (PPO). For baselines provided in the starter kit, refer to experiments{}.yaml files and copy additional parameters to the RLLib configuration parameters in the source/train-sagemaker-distributed-.py.

To check whether your model is using the correct parameters, go to the S3 bucket and navigate to the JSON file with the parameters. For example, {Amazon SageMaker training job} >output>intermediate>training>{PPO_procgen_env_wrapper_}>param.json.

To add a custom model, create a file inside the models/ directory and name it models/my_vision_network.py.

For a working implementation of how to add a custom model, see the GitHub repo. You can set the custom_model field in the experiment .yaml file to my_vision_network to use that model.

Make sure that the model is registered. If you get an error that your model isn’t registered, go to train-sagmaker.py or train-sagmaker-distributed.py and edit def register_algorithms_and_preprocessors(self) by adding the following code:

ModelCatalog.register_custom_model("impala_cnn_tf", ImpalaCNN)

Distributed training with multiple instances

SageMaker supports distributed reinforcement learning in a single SageMaker ML instance with just a few lines of configuration by using the Ray RLlib library.

In homogeneous scaling, you use multiple instances with the same type (typically CPU instances) for a single SageMaker job. A single CPU core is reserved for the driver, and you can use the remaining as rollout workers, which generate experiences through environmental simulations. The number of available CPU cores increases with multiple instances. Homogeneous scaling is beneficial when experience collection is the bottleneck of the training workflow; for example, when your environment is computationally heavy.

With more rollout workers, neural network updates can often become the bottleneck. In this case, you could use heterogeneous scaling, in which you use different instance types together. A typical choice is to use GPU instances to perform network optimization and CPU instances to collect experiences for faster training at optimized costs. SageMaker allows you to achieve this by spinning up two jobs within the same Amazon VPC, and the communications between the instances are taken care of automatically.

To run distributed training with multiple instances, use 2_train-homo-distributed-cpu.ipynb / 3_train-homo-distributed-gpu.ipynb and train-hetero-distributed.ipynb for homogenous and heterogenous scaling, respectively. The configurable parameters for distributed training are stored in source/train-sagemaker-distributed.py. You don’t have to configure ray_num_cpus or ray_num_gpus.

Make sure you scale num_workers and train_batch_size to reflect the number of instances in the notebook. For example, if you set train_instance_count = 5 for a p3.2xlarge instance, the maximum number of workers is 39. See the following code:

"num_workers": 8*5 -1, # adjust based on total number of CPUs available in the cluster, e.g., p3.2xlarge has 8 CPUs and 1 CPU is reserved for resource allocation
  "num_gpus": 0.2, # adjust based on number of GPUs available in a single node, e.g., p3.2xlarge has 1 GPU
  "num_gpus_per_worker": 0.1, # adjust based on number of GPUs, e.g., p3.2x large (1 GPU - num_gpus) / num_workers = 0.1
  "rollout_fragment_length": 140,
  "train_batch_size": 64 * (8*5 -1),

To use a Spot Instance, you need to set the flag train_use_spot_instances = True in the final cell of train-homo-distributed.ipynb or train-hetero-distributed.ipynb. You can also use the MaxWaitTimeInSeconds parameter to control the total duration of your training job (actual training time plus waiting time).

Summary

We compared our starter kit with three different GPU instances (ml.g4n.4x, ml.p3.2x, and ml.p3.8x) using single and multiple instances. On all GPU instances, our Spot Instance training provided a 70% cost reduction. This means that you spend less than $1 with the starter kit hyperparameters for the competition’s benchmarking solution, such as, 8 MM steps, using an ml.p3.2x instance.

Our starter kit allows you to run multiple instances of ml.p3.2x with 1 GPU versus a single instance of ml.p3.8x with 4 GPUs. We observed that running a single instance of ml.p3.8x with 4 GPUs is more cost-effective than running five instances of ml.p3.2x (=5 GPUs) due to communication overhead. The single instance training with ml.p3.8x converges in 20 minutes, helping you iterate faster to meet the competition deadline.

Finally, we observed that ml.g4n.4x instances provide an additional 40% cost reduction over 70% reduction from Spot Instances. However, it takes longer to train: 45 minutes with ml.p3.8x versus 70 minutes with ml.g4n.4x.

To get started with this solution, sign in to your AWS account and choose the quick create to launch the CloudFormation stack for the Region you want to run your notebook.

About the Authors

Jonathan Chung is an Applied scientist in AWS. He works on applying deep learning to various applications including games and document analysis. He enjoys cooking and visiting historical cities around the world.

Anna Luo is an Applied Scientist in AWS. She works on utilizing reinforcement learning techniques for different domains including supply chain and recommender system. Her current personal goal is to master snowboarding.

Sahika Genc is a Principal Applied Scientist in the AWS AI team. Her current research focus is deep reinforcement learning (RL) for smart automation and robotics. Previously, she was a senior research scientist in the Artificial Intelligence and Learning Laboratory at the General Electric (GE) Global Research Center, where she led science teams on healthcare analytics for patient monitoring.