Turning Microsoft Word documents into audio playlists using Amazon Polly

Listening to your Microsoft Word documents as audio is a great way to save time or to be productive on a long commute. You can easily convert an entire block of text into MP3 format with Amazon Polly. But you can vastly improve your listening experience with just a few simple steps.

In this blog post, I show how you can use a serverless workflow to convert your word documents into MP3 playlists using AWS Lambda and Amazon Polly.

To review a Word document that I needed to listen to, I converted the whole document to one block of text, then converted it to MP3 using Amazon Polly. After listening, I realized that a long, single-voice MP3 file results in a monotonous stream of audio.

Next, I split the document into small parts and processed each part with a different voice and cadence. This process added audio cues to keep me engaged while listening. I came up with the following serverless architecture that takes in a Microsoft Word document and generates MP3 files and an ordered M3U playlist file. I can download my list and listen to the Word document as an audio playlist anywhere!

Solution overview

The following diagram shows the architecture of this solution.

The following steps generate the MP3 files and playlist:

Upload the Word document to the Project bucket at /src.
On upload, a PUT object event triggers the Word to SSML AWS Lambda function.
The Lambda function splits the document into multiple SSML files, assigns a VoiceId tag to each file, and saves them to the project bucket at /ssml.
Several PUT object events in the /ssml key trigger the Amazon Polly SSML to MP3 Lambda function, which starts an Amazon Polly task to convert the SSML document into an MP3 file. The Amazon Polly task then saves the MP3 file in Amazon S3 and the file metadata to the Mp3 metadata table in Amazon DynamoDB.
After Amazon Polly completes its tasks, invoke the m3u builder Lambda function to generate the m3u playlist file and save it to the Project bucket.

The following table shows the solution components and describes how they are used.

Resource	Type	Description
Project bucket	S3 bucket	S3 bucket used for storing the Word document before processing, the generated SSML files, the generated MP3 files, and the M3U playlist file. Event notifications on the bucket trigger various Lambda functions.
Word to SSML	Lambda function	A Lambda function that uses the Java 8 runtime to take in a Word document and split it into several SSML documents based on the contained sections, topics, and paragraphs in the document. The S3 bucket stores the SSML documents, with each file assigned a VoiceId tag used later by the Amazon Polly SSML to MP3 Lambda function.
Amazon Polly SSML to MP3	Lambda function	A Lambda function that takes one SSML file in S3 and converts it to MP3 using an Amazon Polly voice that matches the assigned VoiceId. It then stores the MP3 files in the Project bucket. It also saves the metadata of processed files and the corresponding Amazon Polly tasks to a DynamoDB table.
MP3 metadata	DynamoDB table	A DynamoDB table that stores the metadata of processed SSML files and corresponding Amazon Polly tasks.
M3U builder	Lambda function	A Lambda function that processes the metadata in the MP3 metadata table database, generates a correctly ordered M3U playlist file, and stores it in the Project bucket.

Building the Word to SSML Lambda function

I used Apache POI to read the Word document and split it into several small SSML files. I provide an extensible implementation that works for any three-level document that contains a set of sections, each containing a set of topics, and each of those topics containing a set of paragraphs.

I used the public Amazon Polly FAQs as an example document, which uses categories of the FAQ (for example, general, billing, data privacy) as the sections. Those sections divide into individual questions for the topics, and into individual answers for the paragraphs.

This same model generally applies to any three-level document: The user supplies a way to identify sections and topics. The default implementation extracts the sections from text with the Heading 1 Word style and identifies topics by recognizing the question mark character in the sentence.

Prerequisites

You need a few tools to follow the steps in this post:

OpenJDK 8 and Apache Maven 3.5: The Word to SSML Lambda function uses the Java 8 runtime and uses Apache Maven for packaging. Install OpenJDK version 8 or higher and Maven version 3.5 or higher. I tested this solution with Maven version 3.5.0 and OpenJDK Runtime Environment Corretto-8.202.08.2.
AWS Command Line Interface: Some of the instructions assume that you have a working AWS CLI version to execute the test steps.
S3 bucket: Lambda functions can only use artifacts from an S3 bucket in the Region in which you choose to deploy your solution. Choose a bucket to reuse, or create a bucket by running the following command:
```
aws s3 mb s3:// --region 
```

Deployment steps

Follow these steps to deploy your tool.

Clone the GitHub repository for the project.

git clone https://github.com/aws-samples/amazon-polly-mp3-for-microsoft-word.git

Export the AWS Region, project S3 bucket, and AWS CloudFormation stack name as environment variables for convenience.
```
export PROJECT_BUCKET=
export REGION= 
export STACK_NAME=polly-stack
```
Change to the project directory and execute the deploy_lambda_cloudformation.sh script to provide your chosen AWS Region, S3 bucket, and name for your CloudFormation stack. This script performs the following actions:
1. Packages the three Lambda functions and copies it to your S3 bucket.
2. Copies the CloudFormation template to your S3 bucket.
3. Deploys the stack with the chosen name.
4. Waits until the Lambda function successfully creates the stack. This should take approximately two minutes.
5. Updates the bucket notifications template (scripts/bucket_lambda_notification.json) with values from the stack output.
6. Adds event notifications to the S3 bucket.
```
cd Amazon-Polly-Microsoft-Word-to-MP3
bash scripts/deploy_lambda_cloudformation.sh $REGION $PROJECT_BUCKET $STACK_NAME
```
[Optional] After the script executes, in the AWS CloudFormation console, verify that the stack deployed and is in CREATE_COMPLETE status.
In the S3 console, verify that the bucket contains your event notifications. The first notification, on the polly-faq-reader/src/ path, invokes the Word to SSML Lambda function when a new DOCX file uploads to this path. This Lambda function generates several SSML text files and uploads them to the polly-faq-reader/ssml/ A notification set up on this path then invokes the Amazon Polly SSML to MP3 Lambda function. The following screenshot shows sample events.
Now you’re ready to test the MP3 conversion. Copy the demo/src/polly-faq.docx to the Project bucket at polly-faq-reader/src/. This triggers the Lambda functions to generate SSML and MP3 files.
```
aws s3 cp demo/src/polly-faq.docx s3://${PROJECT_BUCKET}/polly-faq-reader/src/
```

List the polly-faq-reader/ prefix in the S3 bucket and verify that it generates new SSML and MP3 directories.

aws s3 ls s3://$PROJECT_BUCKET/polly-faq-reader/
                           PRE mp3/
                           PRE src/
                           PRE ssml/

Wait about two minutes for the Amazon Polly tasks to complete. To verify when MP3 conversion completes, you can verify that the number of files in the /ssml directory matches the number of /mp3 files.
```
aws s3 ls s3://$PROJECT_BUCKET/polly-faq-reader/mp3/ | wc -l
     62
aws s3 ls s3://$PROJECT_BUCKET/polly-faq-reader/ssml/ | wc -l
     62
```
The tool builds an M3U playlist file to play all the generated MP3 files in the correct order. In your terminal, in the scripts directory, execute the invoke_m3u_builder.sh script providing your Region, bucket name, and name of your AWS CloudFormation stack.
```
bash scripts/invoke_m3u_builder.sh $REGION ${PROJECT_BUCKET} ${STACK_NAME}
```
Verify that a new polly-faq.m3u file is present in the S3 bucket at polly-faq-reader/mp3/.
```
aws s3 ls s3://$PROJECT_BUCKET/polly-faq-reader/mp3/polly-faq.m3u
```

Download the mp3 files and m3u playlist to your computer.

cd 
aws s3 sync s3://$PROJECT_BUCKET/polly-faq-reader/mp3/

Open the m3u playlist file in your preferred media player and listen to the files.

Clean up

To clean up the deployment and avoid incurring future costs, follow these steps:

In the S3 console, select your bucket and delete the two event notifications.
In the AWS CloudFormation console, and delete the polly-stack.

If you no longer need the SSML or MP3 files, delete them. Run the following commands:

aws s3 rm --recursive s3://$PROJECT_BUCKET/polly-faq-reader/ssml/
aws s3 rm --recursive s3://$PROJECT_BUCKET/polly-faq-reader/mp3/

Conclusion

In this post, I demonstrated a serverless workflow to convert Microsoft Word documents into an MP3 audio playlist using Amazon Polly and AWS Lambda.

To dig deeper into the code, check out the GitHub repository and create issues for providing feedback or suggesting enhancements. Open-source code contributions are welcome as pull requests.

About the Author

Vinod Shukla is a Partner Solutions Architect at Amazon Web Services. As part of the AWS Quick Starts team, he enjoys working with partners providing technical guidance and assistance in building gold-standard reference deployments.