Calculate inference units for an Amazon Rekognition Custom Labels model

Amazon Rekognition Custom Labels allows you to extend the object and scene detection capabilities of Amazon Rekognition to extract information from images that is uniquely helpful to your business. For example, you can find your logo in social media posts, identify your products on store shelves, classify machine parts in an assembly line, distinguish healthy and infected plants, or detect your animated characters in videos.

Amazon Rekognition Custom Labels provides a simple end-to-end experience where you start by labeling a dataset. Amazon Rekognition Custom Labels then builds a custom machine learning (ML) model for you by inspecting the data and selecting the right ML algorithm. After your model is trained, you can start using it immediately for image analysis. You start a model by calling the StartProjectVersion API and providing the minimum number of inference units (IUs) to use. A single IU represents one unit of compute power. The number of images you can process with a single IU in a certain time frame depends on many factors, such as the size of the images processed and the complexity of the custom model. As you start the model, you can provide a higher number of IUs to increase the transactions per second (TPS) throughput of your model. Amazon Rekognition Custom Labels then provisions multiple compute resources in parallel to process your images more quickly.

However, determining the right number of IUs for your workload is tricky, because over-provisioned IUs causes unnecessary cost, and insufficient IUs result in exceeding provisioned throughput. Due to the lack of information of calculating appropriate IU, some customers tend to over-provision IUs to ensure their workloads run without any exception errors. This can be quite costly. Other customers spend a lot of time adding IUs until their workloads run smoothly. In this post, we show you how to calculate the IUs needed to meet your workload performance requirement at the lowest possible cost.

Understanding inference units

For this post, we use a commonly seen customer scenario to explain the concept of IUs.

For a specific use case in object/scene detection or classification, you train an Amazon Rekognition Custom Labels model. After the model is trained, you need to start the model for inference. Let’s assume you start your custom model at 2:00 PM and end at 5:00 PM, and choose to provision 1 IU, your total inference hours billed is 3 hours. Assuming that the model allows you to analyze five concurrent images per second (5 TPS), you can process 54,000 (5*3,600*3) images in 3 hours.

Now let’s assume that you want to process twice the number of images (108,000) in 3 hours using the same model. You can start the model for the same duration of 3 hours and provision 2 IUs. Similarly, if you need to process 54,000 images but want to reduce the processing time from 3 hours to 1 hour, you can provision 3 IUs, which means processing 15 images per second and then stopping the model after 1 hour. Your total inference hours billed would be 6 hours (3 hours elapsed * 2 IUs) and 3 hours (1 hour elapsed * 3 IUs), respectively. Because your throughput and cost are based on the provisioned IUs per hour, it’s important to calculate the right IUs needed for your workload.

After you train an Amazon Rekognition Custom Labels model, you can start the model with one or more IU. If your load of requests is higher than the maximum supported TPS based on the provisioned IU, Amazon Rekognition Custom Labels returns an exception called ProvisionedThroughputExceededException for all requests over the max TPS, which indicates that the model is maximally utilized. In general, max TPS depends on the trained custom model, input images, and number of IUs provisioned. Therefore, you can determine required IUs by calculating max TPS. To do this, you can start the model with 1 IU, and progressively increase input requests until the ProvisionedThroughputExceededException exception is raised. After you get the max TPS throughput of the model, you can use it to calculate the overall IUs needed for your workload. For example, if the max TPS throughput is 5 TPS and you need to process 15 images per second, you have to start the model with 3 IUs.

Solution overview

In the following sections, we discuss how to calculate max TPS throughput of an Amazon Rekognition Custom Labels model. Then you can calculate the exact IUs needed as you start the model to process your images.

We walk through the following high-level steps:

Train a model using Amazon Rekognition Custom Labels.
Start your model.
Launch an Amazon Elastic Compute Cloud (Amazon EC2) instance and set up your test environment.
Create a test script.
Add sample image(s) and run the script.
Review the program output.

Train a model using Amazon Rekognition Custom Labels

You start by training a model in Amazon Rekognition Custom Labels for your use case. To learn more about how to create, train, evaluate, and use a model that detects objects, scenes, and concepts in images, refer to Getting started with Amazon Rekognition Custom Labels.

Start your model

After your model is trained, start the model with 1 IU. You can use the following command from the AWS Command Line Interface (AWS CLI) to start your model:

start-project-version
--project-version-arn 
--min-inference-units 1

In addition to the AWS CLI, the following code snippet shows how you can also use the API to start your Amazon Rekognition Custom Labels model:

import boto3
client = boto3.client('rekognition')
response = client.start_project_version(ProjectVersionArn='string',MinInferenceUnits=1)

Launch an EC2 instance and set up your test environment

Launch an EC2 instance that you use to run a script that uses a sample image to call the model we started in the previous step. You can follow the steps in the quick start guide to launch an EC2 instance. Although the guide uses an instance type of t2.micro, you should use a compute-optimized instance type such as C5 to run this test.

After you connect to the EC2 instance, run the following commands from the terminal to install the required dependencies:

sudo yum install python3
sudo yum install gcc
sudo yum install python3-devel
sudo pip3 install locust
sudo pip3 install boto3

Create a test script

Create a Python file named tps.py with the following code:

from os import walk
import inspect
import time
import functools

import gevent.monkey

gevent.monkey.patch_all()

import argparse
from pathlib import Path
import boto3 as boto3
from botocore.config import Config
from locust import task, constant_pacing, events, LoadTestShape
from locust.env import Environment
from locust.stats import stats_printer, stats_history
from locust.log import setup_logging
from locust import User

setup_logging("INFO", None)

'''
This script will:
1. Read list of images from a base path
2. Run step load
3. Print the stats
4. Stop runs when non-zero faliure rate is observed
5. Use last run to calculate maximum TPS
'''

image_base_path = None
project_version_arn = None
aws_region = None


class WebserviceUser(User):

    wait_time = constant_pacing(0)

    def detection_tests(self, image_path):
        try:
            start_time = time.time()
            with open(image_path, 'rb') as image:
                r = self.client.detect_custom_labels(ProjectVersionArn=
                                                     project_version_arn, Image={'Bytes': image.read()})
        except Exception as exception:
            total_time = int((time.time() - start_time) * 1000)
            print(exception)
            self.environment.events.request_failure.fire(request_type="GET",
                                                         name=inspect.stack()[0][3], response_time=total_time,
                                                         response_length=0, exception=exception)
        else:
            total_time = int((time.time() - start_time) * 1000)
            self.environment.events.request_success.fire(request_type="GET", name=inspect.stack()[0][3],
                                                         response_time=total_time, response_length=0)

    def __init__(self, *args, **kwargs):
        for (dirpath, dirnames, filenames) in walk(image_base_path):
            self.images = [Path(image_base_path) / filename for filename in filenames]
            break
        super(WebserviceUser, self).__init__(*args, **kwargs)
        config = Config(
            retries={
                'max_attempts': 1,
                'mode': 'standard'
            }
        )
        self.client = boto3.client(
            'rekognition',
            aws_region,
            config=config
        )

        def detection_tests(image_path, arg):
            try:
                start_time = time.time()
                with open(image_path, 'rb') as image:
                    r = self.client.detect_custom_labels(ProjectVersionArn=
                                                         project_version_arn, Image={'Bytes': image.read()})
            except Exception as exception:
                total_time = int((time.time() - start_time) * 1000)
                print(f'image: {image_path}, exception: {exception}')
                self.environment.events.request_failure.fire(request_type="GET",
                                                             name=inspect.stack()[0][3], response_time=total_time,
                                                             response_length=0, exception=exception)
            else:
                total_time = int((time.time() - start_time) * 1000)
                self.environment.events.request_success.fire(request_type="GET", name=inspect.stack()[0][3],
                                                             response_time=total_time, response_length=0)

        self.tasks = [functools.partial(detection_tests, image) for image in self.images]

def run_load(user_count, spawn_rate):
    # setup Environment and Runner
    env = Environment(user_classes=[WebserviceUser])
    local_runner = env.create_local_runner()

    # # start a WebUI instance
    env.create_web_ui("127.0.0.1", 8089)

    # start a greenlet that periodically outputs the current stats
    gevent.spawn(stats_printer(env.stats))

    # start a greenlet that save current stats to history
    gevent.spawn(stats_history, env.runner)

    # start the test
    env.runner.start(user_count, spawn_rate=spawn_rate)

    # in 60 seconds stop the runner
    gevent.spawn_later(30, lambda: env.runner.quit())

    # wait for the greenlets
    env.runner.greenlet.join()

    # stop the web server for good measures
    env.web_ui.stop()

    # Sleep so that history is up to date
    time.sleep(5)

    # NOTE: Max TPS calculated from last run. 
    last_stats = env.stats.history[-1]
    max_tps = last_stats['current_rps'] - last_stats['current_fail_per_sec']
    p95_latency = last_stats['response_time_percentile_95']
    failure_tps = last_stats['current_fail_per_sec']
    print(f'Max supported TPS: {max_tps}')
    print(f'95th percentile response time: {p95_latency}')
    return max_tps, p95_latency, failure_tps


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Script to find the max TPS supported by a project version')
    parser.add_argument('--images', type=str, help='path to folder with images', required=True)
    parser.add_argument('--project-version-arn', type=str, help='Project Version arn to run loadtest against', required=True)
    parser.add_argument('--region', type=str, help='Project Version arn to run loadtest against', required=True)
    args = parser.parse_args()
    image_base_path = args.images
    project_version_arn = args.project_version_arn
    aws_region = args.region

    user_count = 10
    failure_tps = 0
    # NOTE: If max TPS is not reached in 3 iterations the customer might be running with >1 IU
    max_iterations = 3
    # NOTE: Advanced users can replace with custom shape & LocalRunner to find maxima
    # https://docs.locust.io/en/stable/generating-custom-load-shape.html
    while failure_tps <= 0 and max_iterations >= 0:
        max_tps, p95_latency, failure_tps = run_load(user_count, user_count/10)
        user_count *= 2
        max_iterations -= 1

Add sample image(s) and run the script

Create a folder named images and add at least one sample image. This folder of images is used for inference using the model trained with Amazon Rekognition Custom Labels.

Run the Python script to calculate model TPS throughput:

python3 ./tps.py --images ./images --project-version-arn  --region

Review the program output

This program gradually increases the load of requests using the representative images to maximally utilize the model. It creates multiple threads in parallel, and adds new threads over time. It runs for a few minutes and prints tabular formatted statistical information including number of requests, number of failed requests, latency statistics (average, min, max, median), average requests per second, and average failures per second. When it completes all the runs, it prints the max TPS (Max supported TPS) and 95th percentile latency in milliseconds (95th percentile response time).

Let’s assume that you get max TPS throughput of the model as 3, and you plan to process 12 images per second. In this case, you can start the model with 4 IUs to achieve the desired throughput.

Conclusion

In this post, we showed how to calculate the IUs needed to meet your requirement of workload performance at the lowest possible cost. By right-sizing the IU for your model, you ensure that you can process images with the required throughput and not pay extra by avoiding over-provisioning resources. To learn more about Amazon Rekognition Custom Labels use cases and other features, refer to Key features.

In addition to optimizing the IU, if your workload requires processing images in batches (such as once a day or week, or at scheduled times during the day), you can provision your custom model at scheduled times. The post Batch image processing with Amazon Rekognition Custom Labels shows you how to build a cost-optimal batch solution with Amazon Rekognition Custom Labels that provisions your custom model at scheduled times, processes all your images, and deprovisions your resources to avoid incurring extra cost.

About the Authors

Ditesh Kumar is a Software Developer working for Amazon Rekognition Custom Labels. He is focused on building scalable Computer Vision services and adding new features to enhance usability and adoption of Custom Labels. In his spare time Ditesh is a big fan of hiking and traveling and enjoys spending time with his family.

Kashif Imran is a Principal Solutions Architect at Amazon Web Services. He works with some of the largest AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement computer vision applications at scale. His expertise spans application architecture, serverless, containers, NoSQL, and machine learning.

Sherry Ding is a Senior AI/ML Specialist Solutions Architect. She has extensive experience in machine learning with a PhD degree in Computer Science. She mainly works with Public Sector customers on various AI/ML related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.-