Optimize your inference jobs using dynamic batch inference with TorchServe on Amazon SageMaker

In deep learning, batch processing refers to feeding multiple inputs into a model. Although it’s essential during training, it can be very helpful to manage the cost and optimize throughput during inference time as well. Hardware accelerators are optimized for parallelism, and batching helps saturate the compute capacity and often leads to higher throughput.

Batching can be helpful in several scenarios during model deployment in production. Here we broadly categorize them into two use cases:

Real-time applications where several inference requests are received from different clients and are dynamically batched and fed to the serving model. Latency is usually important in these use cases.
Offline applications where several inputs or requests are batched on the client side and sent to the serving model. Higher throughput is often the objective for these use cases, which helps manage the cost. Example use cases include video analysis and model evaluation.

Amazon SageMaker provides two popular options for your inference jobs. For real-time applications, SageMaker Hosting uses TorchServe as the backend serving library that handles the dynamic batching of the received requests. For offline applications, you can use SageMaker batch transform jobs. In this post, we go through an example of each option to help you get started.

Because TorchServe is natively integrated with SageMaker via the SageMaker PyTorch inference toolkit, you can easily deploy a PyTorch model onto TorchServe using SageMaker Hosting. There may be also times when you need to customize your environment further using custom Docker images. In this post, we first show how to deploy a real-time endpoint using the native SageMaker PyTorch inference toolkit and configuring the batch size to optimize throughput. In the second example, we demonstrate how to use a custom Docker image to configure advanced TorchServe configurations that aren’t available as an environment variable to optimize your batch inference job.

Best practices for batch inference

Batch processing can increase throughput and optimize your resources because it helps complete a larger number of inferences in a certain amount of time at the expense of latency. To optimize model deployment for higher throughput, the general guideline is to increase the batch size until throughput decreases. This most often suits offline applications, where several inputs are batched (such as video frames, images, or text) to get prediction outputs.

For real-time applications, latency is often a main concern. There’s a trade-off between higher throughput and increased batch size and latency; you may need to adjust as needed to meet your latency SLA. In terms of best practices on the cloud, the cost per a certain number of inferences is a helpful guideline in making an informed decision that meets your business needs. One contributing factor in managing the cost is choosing the right accelerator. For more information, see Choose the best AI accelerator and model compilation for computer vision inference with Amazon SageMaker.

TorchServe dynamic batching on SageMaker

TorchServe is the native PyTorch library for serving models in production at scale. It’s a joint development from Facebook and AWS. TorchServe allows you to monitor, add custom metrics, support multiple models, scale up and down the number of workers through secure management APIs, and provide inference and explanation endpoints.

To support batch processing, TorchServe provides a dynamic batching feature. It aggregates the received requests within a specified time frame, batches them together, and sends the batch for inference. The received requests are processed through the handlers in TorchServe. TorchServe has several default handlers, and you’re welcome to author a custom handler if your use case isn’t covered. When using a custom handler, make sure that the batch inference logic has been implemented in the handler. An example of a custom handler with batch inference support is available on GitHub.

You can configure dynamic batching using two settings, batch_size and max_batch_delay, either through environment variables in SageMaker or through the config.properties file in TorchServe (if using a custom container). TorchServe uses any of the settings that comes first, either the maximum batch size (batch_size) or specified time window to wait for the batch of requests through max_batch_delay.

With TorchServe integrations with SageMaker, you can now deploy PyTorch models natively on SageMaker, where you can define a SageMaker PyTorch model. You can add custom model loading, inference, and preprocessing and postprocessing logic in a script passed as an entry point to the SageMaker PyTorch (see the following example code). Alternatively, you can use a custom container to deploy your models. For more information, see The SageMaker PyTorch Model Server.

You can set the batch size for PyTorch models on SageMaker through environment variables. If you choose to use a custom container, you can bundle settings in config.properties with your model when packaging your model in TorchServe. The following code snippet shows an example how to set the batch size using environment variables and how to deploy a PyTorch model on SageMaker:

from SageMaker.pytorch.model import PyTorchModel

env_variables_dict = {
    "SAGEMAKER_TS_BATCH_SIZE": "3",
    "SAGEMAKER_TS_MAX_BATCH_DELAY": "100000"
}

pytorch_model = PyTorchModel(
    model_data=model_artifact,
    role=role,
    source_dir="code",
    framework_version='1.9',
    entry_point="inference.py",
    env=env_variables_dict
)


predictor = pytorch_model.deploy(initial_instance_count=1, instance_type="ml.c5.2xlarge", serializer=SageMaker.serializers.JSONSerializer(), deserializer=SageMaker.deserializers.BytesDeserializer())

In the code snippet, model_artifact refers to all the required files for loading back the trained model, which is archived in a .tar file and pushed into an Amazon Simple Storage Service (Amazon S3) bucket. The inference.py is similar to the TorchServe custom handler; it has several functions that you can override to accommodate the model initialization, preprocessing and postprocessing of received requests, and inference logic.

The following notebook shows a full example of deploying a Hugging Face BERT model.

If you need a custom container, you can build a custom container image and push it to the Amazon Elastic Container Registry (Amazon ECR) repository. The model artifact in this case can be a TorchServe .mar file that bundles the model artifacts along with handler. We demonstrate this in the next section, where we use a SageMaker batch transform job.

SageMaker batch transform job

For offline use cases where requests are batched from a data source such as a dataset, SageMaker provides batch transform jobs. These jobs enable you to read data from an S3 bucket and write the results to a target S3 bucket. For more information, see Use Batch Transform to Get Inferences from Large Datasets. A full example of batch inference using batch transform jobs can be found in the following notebook, where we use a machine translation model from the FLORES competition. In this example, we show how to use a custom container to score our model using SageMaker. Using a custom inference container allows you to further customize your TorchServe configuration. In this example, we want to change and disable JSON decoding, which we can do through the TorchServe config.properties file.

When using a custom handler for TorchServe, we need to make sure that the handler implements the batch inference logic. Each handler can have custom functions to perform preprocessing, inference, and postprocessing. An example of a custom handler with batch inference support is available on GitHub.

We use our custom container to bundle the model artifacts with the handler as we do in TorchServe (making a .mar file). We also need an entry point to the Docker container that starts TorchServe with the batch size and JSON decoding set in config.properties. We demonstrate this in the example notebook.

The SageMaker batch transform job requires access to the input files from an S3 bucket, where it divides the input files into mini batches and sends them for inference. Consider the following points when configuring the batch transformation job:

Place the input files (such as a dataset) in an S3 bucket and set it as a data source in the job settings.
Assign an S3 bucket in which to save the results of the batch transform job.
Set BatchStrategy to MultiRecord and SplitType to Line if you need the batch transform job to make mini batches from the input file. If it can’t automatically split the dataset into mini batches, you can divide it into mini batches by putting each batch in a separate input file, placed in the data source S3 bucket.
Make sure that the batch size fits into the memory. SageMaker usually handles this automatically; however, when dividing batches manually, this needs to be tuned based on the memory.

The following code is an example for a batch transform job:

s3_bucket_name= 'SageMaker-us-west-2-XXXXXXXX'
batch_input = f"s3://{s3_bucket_name}/folder/jobename_TorchServe_SageMaker/"
batch_output = f"s3://{s3_bucket_name}/folder/jobname_TorchServe_SageMaker_output/"

batch_job_name = 'job-batch' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
batch_job_name

request = {
    "ModelClientConfig": {
        "InvocationsTimeoutInSeconds": 3600,
        "InvocationsMaxRetries": 1,
    },
    "TransformJobName": batch_job_name,
    "ModelName": model_name,
    "MaxConcurrentTransforms":1,
    "BatchStrategy": "MultiRecord",
    "TransformOutput": {"S3OutputPath": batch_output, "AssembleWith": "Line", "Accept": "application/json"},
    "TransformInput": {
        "DataSource": {
            "S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": batch_input}
        },
        
        "SplitType" : "Line",
        "ContentType": "application/json",
    },
    "TransformResources": {"InstanceType": "ml.p2.xlarge", "InstanceCount": 1},
}

When we use the preceding settings and launch our transform job, it reads the input files from the source S3 bucket in batches and sends them for inference. The results are written back to the S3 bucket specified to the outputs.

The following code snippet shows how to create and launch a job using the preceding settings:

sm.create_transform_job(**request)

while True:
    response = sm.describe_transform_job(TransformJobName=batch_job_name)
    status = response["TransformJobStatus"]
    if status == "Completed":
        print("Transform job ended with status: " + status)
        break
    if status == "Failed":
        message = response["FailureReason"]
        print("Transform failed with the following error: {}".format(message))
        raise Exception("Transform job failed")
    print("Transform job is still in status: " + status)
    time.sleep(30)

Conclusion

In this post, we reviewed the two modes SageMaker offers for online and offline inference. The former uses dynamic batching provided in TorchServe to batch the requests from multiple clients. The latter uses a SageMaker transform job to batch the requests from input files in an S3 bucket and run inference.

We also showed how to serve models on SageMaker using native SageMaker PyTorch inference toolkit container images, and how to use custom containers for use cases that require advanced TorchServe configuration settings.

As TorchServe continues to evolve to address the needs of the PyTorch community, new features are integrated into SageMaker to provide performant ways for serving models in production. For more information, check out the TorchServe GitHub repo and the SageMaker examples.

About the Authors

Phi Nguyen is a solutions architect at AWS helping customers with their cloud journey with a special focus on data lake, analytics, semantics technologies and machine learning. In his spare time, you can find him biking to work, coaching his son’s soccer team or enjoying nature walk with his family.

Nikhil Kulkarni is a software developer with AWS Machine Learning, focusing on making machine learning workloads more performant on the cloud and is a co-creator of AWS Deep Learning Containers for training and inference. He’s passionate about distributed Deep Learning Systems. Outside of work, he enjoys reading books, fiddling with the guitar and making pizza.

Hamid Shojanazeri is a Partner Engineer at Pytorch working on OSS high performance model optimization and serving. Hamid holds a P.h.D in Computer vision and worked as a researcher in multimedia labs in Australia, Malaysia and NLP lead in Opus.ai. He likes to find simpler solutions to hard problems and is an art enthusiast in his spare time.

Geeta Chauhan leads AI Partner Engineering at Meta AI with expertise in building resilient, anti-fragile, large scale distributed platforms for startups and Fortune 500s. Her team works with strategic partners, machine learning leaders across the industry and all major cloud service providers for building and launching new AI product services and experiences; and taking PyTorch models from research to production.. She is a winner of Women in IT – Silicon Valley – CTO of the year 2019, an ACM Distinguished Speaker and thought leader on topics ranging from Ethics in AI, Deep Learning, Blockchain, IoT. She is passionate about promoting use of AI for Good.