Enhancing speech-to-text accuracy of COVID-19-related terms with Amazon Transcribe Medical

As the world responds to the ongoing pandemic, it’s more important than ever to accurately access, consume, and analyze information related to COVID-19. Topics about the healthcare crisis permeate many dimensions of our personal and professional lives, through channels as diverse as news reporting, social media, business meetings, radio and podcasts, customer support calls, and especially among clinician-patient conversations. More data analytics application builders are seeking medical speech-recognition capabilities that help them efficiently and accurately transcribe video and audio containing COVID-19 terminology into text for downstream analytics. This post demonstrates how to use a custom vocabulary in Amazon Transcribe Medical to better recognize COVID-19 terms.

Amazon Transcribe Medical is a fully-managed speech recognition (ASR) service that makes it easy to add medical speech-to-text capabilities to your applications. Powered by deep learning, the service already offers a ready-to-use medical speech-recognition model that you can integrate into a variety of voice applications in the healthcare and life science domain. You can now use the custom vocabulary feature to accurately transcribe more specific medical terminologies, such as medicine names, product brands, medical procedures, or illnesses. You can input the terminology you’d like to transcribe and associate each term with a corresponding pronunciation and display form. Custom vocabulary is available in all AWS Regions where Amazon Transcribe Medical is available.

Transcribing COVID-19-specific terms

The batch (asynchronous) transcription API and streaming (synchronous) transcription API both support custom vocabulary. This post uses the former to demonstrate the power of custom vocabulary.

For this use case, you use an audio file (covid-19.wav) stored in an Amazon Simple Storage Service (Amazon S3) bucket. For more information about using Amazon S3, see Getting started with Amazon Simple Storage Service. The following is the audio file’s transcript:

“Coronavirus disease 2019, also known as COVID-19, is an infectious disease caused by severe acute respiratory syndrome coronavirus 2. It is abbreviated as SARS-CoV-2. The disease was first identified in December 2019 in Wuhan, China. Symptoms include fever, cough, and shortness of breath. At the time of this recording, there is no vaccine or specific antiviral treatment for COVID-19.”

The transcript contains medical terms, abbreviations, and formatting that are specific to COVID-19.

To transcribe the audio without the support of custom vocabulary, complete the following steps:

On the Amazon Transcribe Medical console, choose Transcription jobs.
Choose Create job.
For Name, provide a name for your job.
For Audio input type, select Dictation.

The audio file for this post features a single speaker. For an audio file with multiple speakers, select Conversation.

For Input data, enter the location of your input file in Amazon S3.
For Output data, enter the name of the S3 bucket for your output.
Choose Next.
On the Configure job – optional page, don’t make any changes.
Choose Create.

The transcription results show that the general medical terms (such as antiviral) were recognized pretty well. However, some specific terms related to the coronavirus were mis-transcribed or not recognized at all. The following text shows the transcription with highlighted terms that resulted in errors (wrong word, misspelling, wrong capitalization, incorrect formatting, or missing terms):

“Corona virus Disease, 2019 also known as Covad, 19 is an infectious disease caused by severe acute respiratory syndrome. Coronavirus two It is abbreviated as stars Cov two. The disease was first identified in December 2019 and full time China. Symptoms include fever, cough and shortness of breath. At the time of this recording there is no vaccine or specific antiviral treatment for cover 19.“

The resultant machine transcription is unsurprising. COVID-19-related terminology is specific to the recently-emerged pandemic, and not part of the original Amazon Transcribe Medical lexicon. But now, with the use of the custom vocabulary feature, you can inform Amazon Transcribe Medical to better recognize these specific medical terms.

Creating a custom vocabulary

To create your custom vocabulary, complete the following steps:

In your preferred simple text editor, create a custom vocabulary file and populate it with a list of terms that relate to COVID-19.

You can use the following example file: covid-19-dictionary.txt. For instructions on creating your own, see Medical Custom Vocabularies.

When creating the custom vocabulary, enter a terminology (Phrase), your preferred output format (DisplayAs), and its corresponding pronunciation using the International Phonetic Alphabet (IPA). The following screenshot shows an example of the vocabulary list in covid-19-dictionary.txt.

The text file contains your custom vocabulary list with corresponding display formats and pronunciations.

Save the file; for this use case, name it covid-19-dictionary.txt.
Upload it to your S3 bucket; for this use case, the full path is s3://my-bucket/covid-19-dictionary.

Your path name may vary depending on what you named your bucket.

On the Amazon Transcribe Medical console, choose Custom vocabulary.
Choose Create vocabulary.
For Name, enter a name for your vocabulary; for example, COVID-19-Dictionary.
For Vocabulary input file location in Amazon S3, enter the full path to the custom vocabulary file.
Choose Create vocabulary.

On the Custom vocabulary page, you can see your custom vocabulary listed.

Using your custom vocabulary

To use your custom vocabulary, repeat steps 1-7 in the previous section to create a transcription job. Then complete the following:

On the Configure job – optional page, in the Customization section, select Custom vocabulary.
Choose the vocabulary you created earlier.
Choose Create.

Now you can run the transcription job and look at the new transcription output. In the following output, the highlighted words indicate the correct transcriptions that were originally missed.

“Coronavirus disease 2019, also known as COVID-19, is an infectious disease caused by severe acute respiratory syndrome coronavirus two. It is abbreviated as SARS-CoV-2. The disease was first identified in December 2019 in Wuhan China. Symptoms include fever, cough, and shortness of breath. At the time of this recording, there is no vaccine or specific antibiotic treatment for COVID-19.”

The transcription correctly transcribed the terms coronavirus, COVID-19, SARS-CoV-2, and Wuhan.

Custom vocabulary is powerful, but you should use it in a targeted manner. To mitigate the occurrence of false positive transcriptions, don’t use a single vocabulary file with over 300 words. Additionally, the more specific a list of terms is, the better the transcription results.

Conclusion

This post demonstrated how to use custom vocabulary in Amazon Transcribe Medical. As we continue to work together to grapple with the coronavirus pandemic, voice applications and data analytics solutions can use such custom vocabularies to transcribe COVID-19 related terms for valuable analyses.

Amazon Transcribe Medical is available as both batch (asynchronous) and streaming (synchronous) public APIs. The service offers state-of-the-art medical transcription for both dictation and conversation dynamics, with support for US English in primary care (spanning Internal Medicine, Family Medicine, Pediatrics, and OB-GYN). Come try out making your own custom medical vocabulary and transcribe medical speech via the service console today!

About the authors

Paul Zhao is a Product Manager at AWS Machine Learning. He manages the Amazon Transcribe service. Outside of work, Paul is a motorcycle enthusiast and avid woodworker.

Katrin Kirchhoff is a Senior Manager and Principal Scientist at AWS AI. She works on machine learning for several AWS language services. In her spare time she likes traveling and exploring new places.

Scott Seyfarth is a Data Scientist at AWS AI. He works on improving the Amazon Transcribe and Transcribe Medical services. Scott is also a phonetician and a linguist who has done research on Armenian, Javanese, and American English.