Applying voice classification in an Amazon Connect telemedicine contact flow
Given the rising demand for fast and effective COVID-19 detection, customers are exploring the usage of respiratory sound data, like coughing, breathing, and counting, to automatically diagnose COVID-19 based on machine learning (ML) models. University of Cambridge researchers built a COVID-19 sound application and demonstrated that a simple binary ML classifier can classify healthy and COVID-19 coughs with over 80% area under the curve (AUC) for all tasks. Massachusetts Institute of Technology (MIT) researchers published a similar open voice model, and their Convolutional Neural Network (CNN) based binary classifier achieves COVID-19 sensitivity of 98.5% with a specificity of 94.2% (AUC 0.97). Carnegie Mellon University also built a COVID voice detector to develop an automated AI system to diagnose a COVID-19 infection based on the human voice. The promising results of these preliminary studies based on crowdsourced audio signals shows the power of AI in the medical industry for disease diagnosis and detection.
Although the research has shown a lot of promise, it’s still difficult to create a scalable solution that takes advantage of these promising models. In this post, we demonstrate a smart call center application workflow that integrates a voice classification model to detect COVID-19 infections or other types of respiratory diseases in people calling in to the call center. For the purposes of creating an end-to-end workflow, we train the model on the open-source Coswara data, which relies on a variety of sounds like deep or shallow breathing, coughing, and counting to distinguish healthy versus unhealthy sound. You can replace this model and training data with any other model or datasets to achieve the level of performance as demonstrated in the research papers.
Overview of solution
This solution uses Amazon Connect, an easy-to-use omnichannel cloud contact center contact flow to make real-time inference to an ML model trained and deployed using Amazon SageMaker. The audio recordings are labeled as healthy (negative) and unhealthy (positive), meaning a COVID-19 infection and other respiratory illness. Because the distribution of positive and negative labels are highly imbalanced, we use the oversampling technique from the Python imbalanced learn library to improve the ratio. We used the PyTorch acoustic classification model, which relies on deep Convolutional Neural Network (CNN) for this audio-based COVID prediction. The trained CNN model is deployed to a SageMaker inference endpoint. The AWS Lambda function triggered by the Amazon Connect contact flow is used to make real-time inference based on the audio streams from an Amazon Connect phone call recording in Amazon Kinesis Video Streams.
The following is the architecture diagram for integrating online ML inference in a telemedicine contact flow via Amazon Connect.
Training and deploying a voice classification model using SageMaker
We first create a SageMaker notebook instance, on which we build a voice classification deep learning model to predict the likelihood of respiratory diseases using the open-source Coswara dataset. To deploy the AWS CloudFormation stack for the notebook instance, choose Launch Stack:
Feel free to change the notebook instance type if necessary. The deployment also clones the following two GitHub repositories:
- The GitHub repository for this project, including sample Jupyter notebooks
- The GitHub repository for Coswara data
Go to the Jupyter notebook
coswara-audio-classification.ipynb under the
The notebook walks you through the following tasks:
- Preprocess the Coswara data, including uncompressing files and generating the metadata CSV files for each type of audio recording.
- Build and upload the Docker container image for SageMaker training and inference jobs to Amazon Elastic Container Registry (Amazon ECR).
- Upload Coswara data to an Amazon Simple Storage Service (Amazon S3) bucket for the SageMaker training job.
- Train a Pytorch CNN estimator for voice classification given the sample hyperparameters.
- Create a hyperparameter optimization (HPO) job (optional).
- Deploy the trained PyTorch estimator to the SageMaker inference endpoint.
- Test batch prediction and invoke the endpoint.
Because this dataset is highly unbalanced, we labeled healthy samples as negative and all non-healthy samples as positive, and over-sampled the positive ones using imbalanced-learn library in the
train.py file under the notebook folder:
In the preceding code, the data and target are torch tensors returned by the
getitem function defined in the
CoswareDataset class in the
coswara_dataset.py file. The oversampling approach improved the prediction performance by approximately 40%. We implemented a very deep CNN for voice classification in the
inference.py file with the default number of classes as two, and applied different metrics in the Scikit-learn Python library to evaluate the prediction performance:
The tuning job tries to maximize the F-beta score, which is the weighted harmonic mean of precision and recall. When you’re satisfied with the prediction performance of the training job, you can deploy a SageMaker inference endpoint:
After deploying the estimator for online prediction, take note of the inference endpoint name, which you use in the next step.
It’s noteworthy that the inference endpoint can be invoked by two types of request body defined in the
- A text string for the S3 object of the audio recording WAV file
- A pickled NumPy array
See the following code:
The output is the probability of the positive class from 0 to 1, which indicates how likely the voice is unhealthy in this use case, defined in
inference.py as well:
Deploying a CloudFormation template for Lambda functions for audio streaming inference
You can deploy the Lambda function with the following CloudFormation stack one-click deployment in the
You need to fill in the S3 bucket name for the audio recording and the SageMaker inference endpoint as parameters.
If you want to deploy this stack in AWS Regions other than
us-east-1, or if you want to change the Lambda functions, go to the connect-audio-stream-solution folder and follow the steps to build and deploy the Serverless Application Model (AWS SAM) stack. Take note of the CloudFormation stack outputs for the Lambda function ARNs, which you use in the next step.
Setting up an interactive voice response using Amazon Connect
We use an Amazon Connect contact flow to trigger Lambda functions, created in the previous step, to process the captured audio recording in Kinesis Video Streams, assuming you have an Amazon Connect instance ready to use. For instructions on setting up an Amazon Connect instance, see Create an Amazon Connect instance. You also need to enable live audio streaming for your instance. Your instance should be created in the same AWS Region as your previous CloudFormation stack, because your video stream should be created in the same Region for Lambda functions to consume.
You can create a new inbound contact flow by importing the flow configuration file. You need to claim a phone number and associate it with the newly created contact flow. There are two Lambda functions to be configured here: the ARNs of
ContactFlowlambdaTriggerArn, located on the Outputs tab of the CloudFormation stack you deployed in the previous step.
After changing the ARNs for the Lambda functions, save and publish the contact flow. Now you’re ready to test it by calling the associated phone number with this contact flow.
To avoid unexpected future charges, clean up your resources:
- Delete the SageMaker inference endpoint.
- Empty and delete the S3 bucket
- Delete the CloudFormation stack for the SageMaker notebook instances and Lambda functions used by Amazon Connect.
This solution was inspired and built upon the following GitHub repos:
- Audio Classification on AWS: Building PyTorch voice classification model using Amazon SageMaker
- Amazon Connect Real-time Transcription Lambda: Amazon Connect live audio streaming and real-time transcription using Amazon Transcribe
In this post, we demonstrated how to predict the likelihood of COVID-19 or other respiratory diseases just based on voice classification. To further improve the ML prediction performance, you can incorporate other related information into the model, like age, gender, or existing symptoms. Audio data augmentation plus handcrafted features can help yield better prediction results, according to existing studies. You can use the audio-based diagnostic prediction in an Amazon Connect contact flow to triage the targeted group of incoming calls and escalate to a doctor to follow up if necessary. The intelligence provided by the acoustic classification can be used by call center agents in conjunction with Contact Lens for Amazon Connect, which provides a turn-by-turn transcript, real-time alerts, automated call categorization based on keywords and phrases, sentiment analysis, issue detection (the reason the customer contacted the call center), and sensitive data redaction.
To find the latest developments to this solution, check out the GitHub repo.
About the Authors