Model serving made easier with Deep Java Library and AWS Lambda
Developing and deploying a deep learning model involves many steps: gathering and cleansing data, designing the model, fine-tuning model parameters, evaluating the results, and going through it again until a desirable result is achieved. Then comes the final step: deploying the model.
AWS Lambda is one of the most cost effective service that lets you run code without provisioning or managing servers. It offers many advantages when working with serverless infrastructure. When you break down the logic of your deep learning service into a single Lambda function for a single request, things become much simpler and easy to scale. You can forget all about the resource handling needed for the parallel requests coming into your model. If your usage is sparse and tolerable to a higher latency, Lambda is a great choice among various solutions.
Now, let’s say you’ve decided to use Lambda to deploy your model. You go through the process, but it becomes confusing or complex with the various setup steps to run your models. Namely, you face issues with the Lambda size limits and managing the model dependencies within.
Deep Java Library (DJL) is a deep learning framework designed to make your life easier. DJL uses various deep learning backends (such as Apache MXNet, PyTorch, and TensorFlow) for your use case and is easy to set up and integrate within your Java application! Thanks to its excellent dependency management design, DJL makes it extremely simple to create a project that you can deploy on Lambda. DJL helps alleviate some of the problems we mentioned by downloading the prepackaged framework dependencies so you don’t have to package them yourself, and loads your models from a specified location such as Amazon Simple Storage Service (Amazon S3) so you don’t need to figure out how to push your models to Lambda.
This post covers how to get your models running on Lambda with DJL in 5 minutes.
About DJL
Deep Java Library (DJL) is a Deep Learning Framework written in Java, supporting both training and inference. DJL is built on top of modern deep learning engines (such as TenserFlow, PyTorch, and MXNet). You can easily use DJL to train your model or deploy your favorite models from a variety of engines without any additional conversion. It contains a powerful model zoo design that allows you to manage trained models and load them in a single line. The built-in model zoo currently supports more than 70 pre-trained and ready-to-use models from GluonCV, HuggingFace, TorchHub, and Keras.
Prerequisites
You need the following items to proceed:
- An AWS account with access to Lambda
- The AWS Command Line Interface (AWS CLI) installed on your system and configured with your credentials and Region
- A Java environment set up on your system
In this post, we follow along with the steps from the following GitHub repo.
Building and deploying on AWS
First we need to ensure we’re in the correct code directory. We need to create the an S3 bucket for storage, an AWS CloudFormation stack, and the Lambda function with the following code:
This creates the following:
- An S3 bucket with the name stored in bucket-name.txt
- A CloudFormation stack named
djl-lambda
and a template file named out.yml - A Lambda function named
DJL-Lambda
Now we have our model deployed on a serverless API. The next section invokes the Lambda function.
Invoking the Lambda function
We can invoke the Lambda function with the following code:
The output is stored into build/output.json
:
Cleaning up
Use the cleanup scripts to clean up the resources and tear down the services created in your AWS account:
Cost analysis
What happens if we try to set this up on an Amazon Elastic Compute Cloud (Amazon EC2) instance and compare the cost to Lambda? EC2 instances need to continuously run for it to receive requests at any time. This means that you’re paying for that additional time when it’s not in use. If we use a cheap t3.micro instance with 2 vCPUs and 1 GB of memory (knowing that some of this memory is used by the operating system and for other tasks), the cost comes out to $7.48 a month or about 1.6 million requests to Lambda. When using a more powerful instance such as t3.small with 2 vCPUs and 4 GB of memory, the cost comes out to $29.95 a month or about 2.57 million requests to Lambda.
There are pros and cons with using either Lambda or Amazon EC2 for hosting, and it comes down to requirements and cost. Lambda is the ideal choice if your requirements allow for sparse usage and higher latency due to the cold startup of Lambda (5-second startup) when it isn’t used frequently, but it’s cheaper than Amazon EC2 if you aren’t using it much, and the first call can be slow. Subsequent requests become faster, but if Lambda sits idly for 30–45 minutes, it goes back to cold-start mode.
Amazon EC2, on the other hand, is better if you require low latency calls all the time or are making more requests than what it costs in Lambda (shown in the following chart).
Minimal package size
DJL automatically downloads the deep learning framework at runtime, allowing for a smaller package size. We use the following dependency:
This auto-detection dependency results in a .zip file less than 3 MB. The downloaded MXNet native library file is stored in a /tmp
folder that takes up about 155 MB of space. We can further reduce this to 50 MB if we use a custom build of MXNet without MKL support.
The MXNet native library is stored in an S3 bucket, and the framework download latency is negligible when compared to the Lambda startup time.
Model loading
The DJL model zoo offers many easy options to deploy models:
- Bundling the model in a .zip file
- Loading models from a custom model zoo
- Loading models from an S3 bucket (supports Amazon SageMaker trained model .tar.gz format)
We use the MXNet model zoo to load the model. By default, because we didn’t specify any model, it uses the resnet-18
model, but you can change this by passing in an artifactId
parameter in the request:
Limitations
There are certain limitations when using serverless APIs, specifically in AWS Lambda:
- GPU instances are not yet available, as of this writing
- Lambda has a 512 MB limit for the
/tmp
folder - If the endpoint isn’t frequently used, cold startup can be slow
As mentioned earlier, this way of hosting your models on Lambda is ideal when requests are sparse and the requirement allows for higher latency calls due to the Lambda cold startup. If your requirements require low latency for all requests, we recommend using AWS Elastic Beanstalk with EC2 instances.
Conclusion
In this post, we demonstrated how to easily launch serverless APIs using DJL. To do so, we just need to run the gradle deployment command, which creates the S3 bucket, CloudFormation stack, and Lambda function. This creates an endpoint to accept parameters to run your own deep learning models.
Deploying your models with DJL on Lambda is a great and cost-effective method if Lambda has sparse usage and allows for high latency, due to its cold startup nature. Using DJL allows your team to focus more on designing, building, and improving your ML models, while keeping costs low and keeping the deployment process easy and scalable.
For more information on DJL and its other features, see Deep Java Library.
Follow our GitHub repo, demo repository, Slack channel, and Twitter for more documentation and examples of DJL!
About the Author
Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.
Tags: Archive
Leave a Reply