Amazon EC2 DL2q instance for cost-efficient, high-performance AI inference is now generally available
This is a guest post by A.K Roy from Qualcomm AI.
Amazon Elastic Compute Cloud (Amazon EC2) DL2q instances, powered by Qualcomm AI 100 Standard accelerators, can be used to cost-efficiently deploy deep learning (DL) workloads in the cloud. They can also be used to develop and validate performance and accuracy of DL workloads that will be deployed on Qualcomm devices. DL2q instances are the first instances to bring Qualcomm’s artificial intelligent (AI) technology to the cloud.
With eight Qualcomm AI 100 Standard accelerators and 128 GiB of total accelerator memory, customers can also use DL2q instances to run popular generative AI applications, such as content generation, text summarization, and virtual assistants, as well as classic AI applications for natural language processing and computer vision. Additionally, Qualcomm AI 100 accelerators feature the same AI technology used across smartphones, autonomous driving, personal computers, and extended reality headsets, so DL2q instances can be used to develop and validate these AI workloads before deployment.
New DL2q instance highlights
Each DL2q instance incorporates eight Qualcomm Cloud AI100 accelerators, with an aggregated performance of over 2.8 PetaOps of Int8 inference performance and 1.4 PetaFlops of FP16 inference performance. The instance has an aggregate 112 of AI cores, accelerator memory capacity of 128 GB and memory bandwidth of 1.1 TB per second.
Each DL2q instance has 96 vCPUs, a system memory capacity of 768 GB and supports a networking bandwidth of 100 Gbps as well as Amazon Elastic Block Store (Amazon EBS) storage of 19 Gbps.
Instance name | vCPUs | Cloud AI100 accelerators | Accelerator memory | Accelerator memory BW (aggregated) | Instance memory | Instance networking | Storage (Amazon EBS) bandwidth |
DL2q.24xlarge | 96 | 8 | 128 GB | 1.088 TB/s | 768 GB | 100 Gbps | 19 Gbps |
Qualcomm Cloud AI100 accelerator innovation
The Cloud AI100 accelerator system-on-chip (SoC) is a purpose-built, scalable multi-core architecture, supporting a wide range of deep-learning use-cases spanning from the datacenter to the edge. The SoC employs scalar, vector, and tensor compute cores with an industry-leading on-die SRAM capacity of 126 MB. The cores are interconnected with a high-bandwidth low-latency network-on-chip (NoC) mesh.
The AI100 accelerator supports a broad and comprehensive range of models and use-cases. The table below highlights the range of the model support.
Model category | Number of models | Examples |
NLP | 157 | BERT, BART, FasterTransformer, T5, Z-code MOE |
Generative AI – NLP | 40 | LLaMA, CodeGen, GPT, OPT, BLOOM, Jais, Luminous, StarCoder, XGen |
Generative AI – Image | 3 | Stable diffusion v1.5 and v2.1, OpenAI CLIP |
CV – Image classification | 45 | ViT, ResNet, ResNext, MobileNet, EfficientNet |
CV – Object detection | 23 | YOLO v2, v3, v4, v5, and v7, SSD-ResNet, RetinaNet |
CV – Other | 15 | LPRNet, Super-resolution/SRGAN, ByteTrack |
Automotive networks* | 53 | Perception and LIDAR, pedestrian, lane, and traffic light detection |
Total | >300 | |
* Most automotive networks are composite networks consisting of a fusion of individual networks.
The large on-die SRAM on the DL2q accelerator enables efficient implementation of advanced performance techniques such as MX6 micro-exponent precision for storing the weights and MX9 micro-exponent precision for accelerator-to-accelerator communication. The micro-exponent technology is described in the following Open Compute Project (OCP) industry announcement: AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI » Open Compute Project.
The instance user can use the following strategy to maximize the performance-per-cost:
- Store weights using the MX6 micro-exponent precision in the on-accelerator DDR memory. Using the MX6 precision maximizes the utilization of the available memory capacity and the memory-bandwidth to deliver best-in-class throughput and latency.
- Compute in FP16 to deliver the required use case accuracy, while using the superior on-chip SRAM and spare TOPs on the card, to implement high-performance low-latency MX6 to FP16 kernels.
- Use an optimized batching strategy and a higher batch-size by using the large on-chip SRAM available to maximize the reuse of weights, while retaining the activations on-chip to the maximum possible.
DL2q AI Stack and toolchain
The DL2q instance is accompanied by the Qualcomm AI Stack that delivers a consistent developer experience across Qualcomm AI in the cloud and other Qualcomm products. The same Qualcomm AI stack and base AI technology runs on the DL2q instances and Qualcomm edge devices, providing customers a consistent developer experience, with a unified API across their cloud, automotive, personal computer, extended reality, and smartphone development environments.
The toolchain enables the instance user to quickly onboard a previously trained model, compile and optimize the model for the instance capabilities, and subsequently deploy the compiled models for production inference use-cases in three steps shown in the following figure.
To learn more about tuning the performance of a model, see the Cloud AI 100 Key Performance Parameters Documentation.
Get started with DL2q instances
In this example, you compile and deploy a pre-trained BERT model from Hugging Face on an EC2 DL2q instance using a pre-built available DL2q AMI, in four steps.
You can use either a pre-built Qualcomm DLAMI on the instance or start with an Amazon Linux2 AMI and build your own DL2q AMI with the Cloud AI 100 Platform and Apps SDK available in this Amazon Simple Storage Service (Amazon S3) bucket: s3://ec2-linux-qualcomm-ai100-sdks/latest/
.
The steps that follow use the pre-built DL2q AMI, Qualcomm Base AL2 DLAMI.
Use SSH to access your DL2q instance with the Qualcomm Base AL2 DLAMI AMI and follow steps 1 thru 4.
Step 1. Set up the environment and install required packages
- Install Python 3.8.
- Set up the Python 3.8 virtual environment.
- Activate the Python 3.8 virtual environment.
- Install the required packages, shown in the requirements.txt document available at the Qualcomm public Github site.
- Import the necessary libraries.
Step 2. Import the model
- Import and tokenize the model.
- Define a sample input and extract the
inputIds
andattentionMask
. - Convert the model to ONNX, which can then be passed to the compiler.
- You’ll run the model in FP16 precision. So, you need to check if the model contains any constants beyond the FP16 range. Pass the model to the
fix_onnx_fp16
function to generate the new ONNX file with the fixes required.
Step 3. Compile the model
The qaic-exec
command line interface (CLI) compiler tool is used to compile the model. The input to this compiler is the ONNX file generated in step 2. The compiler produces a binary file (called QPC, for Qualcomm program container) in the path defined by -aic-binary-dir
argument.
In the compile command below, you use four AI compute cores and a batch size of one to compile the model.
The QPC is generated in the bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc
folder.
Step 4. Run the model
Set up a session to run the inference on a Cloud AI100 Qualcomm accelerator in the DL2q instance.
The Qualcomm qaic Python library is a set of APIs that provides support for running inference on the Cloud AI100 accelerator.
- Use the Session API call to create an instance of session. The Session API call is the entry point to using the qaic Python library.
- Restructure the data from output buffer with
output_shape
andoutput_type
. - Decode the output produced.
Here are the outputs for the input sentence “The dog [MASK] on the mat.”
That’s it. With just a few steps, you compiled and ran a PyTorch model on an Amazon EC2 DL2q instance. To learn more about onboarding and compiling models on the DL2q instance, see the Cloud AI100 Tutorial Documentation.
To learn more about which DL model architectures are a good fit for AWS DL2q instances and the current model support matrix, see the Qualcomm Cloud AI100 documentation.
Available now
You can launch DL2q instances today in the US West (Oregon) and Europe (Frankfurt) AWS Regions as On-demand, Reserved, and Spot Instances, or as part of a Savings Plan. As usual with Amazon EC2, you pay only for what you use. For more information, see Amazon EC2 pricing.
DL2q instances can be deployed using AWS Deep Learning AMIs (DLAMI), and container images are available through managed services such as Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.
To learn more, visit the Amazon EC2 DL2q instance page, and send feedback to AWS re:Post for EC2 or through your usual AWS Support contacts.
About the authors
A.K Roy is a Director of Product Management at Qualcomm, for Cloud and Datacenter AI products and solutions. He has over 20 years of experience in product strategy and development, with the current focus of best-in-class performance and performance/$ end-to-end solutions for AI inference in the Cloud, for the broad range of use-cases, including GenAI, LLMs, Auto and Hybrid AI.
Jianying Lang is a Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in HPC and AI field. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD degree in Computational Physics from the University of Colorado at Boulder.
Leave a Reply