Build a serverless meeting summarization backend with large language models on Amazon SageMaker JumpStart

AWS delivers services that meet customers’ artificial intelligence (AI) and machine learning (ML) needs with services ranging from custom hardware like AWS Trainium and AWS Inferentia to generative AI foundation models (FMs) on Amazon Bedrock. In February 2022, AWS and Hugging Face announced a collaboration to make generative AI more accessible and cost efficient.

Generative AI has grown at an accelerating rate from the largest pre-trained model in 2019 having 330 million parameters to more than 500 billion parameters today. The performance and quality of the models also improved drastically with the number of parameters. These models span tasks like text-to-text, text-to-image, text-to-embedding, and more. You can use large language models (LLMs), more specifically, for tasks including summarization, metadata extraction, and question answering.

Amazon SageMaker JumpStart is an ML hub that can helps you accelerate your ML journey. With JumpStart, you can access pre-trained models and foundation models from the Foundations Model Hub to perform tasks like article summarization and image generation. Pre-trained models are fully customizable for your use cases and can be easily deployed into production with the user interface or SDK. Most importantly, none of your data is used to train the underlying models. Because all data is encrypted and doesn’t leave the virtual private cloud (VPC), you can trust that your data will remain private and confidential.

This post focuses on building a serverless meeting summarization using Amazon Transcribe to transcribe meeting audio and the Flan-T5-XL model from Hugging Face (available on JumpStart) for summarization.

Solution overview

The Meeting Notes Generator Solution creates an automated serverless pipeline using AWS Lambda for transcribing and summarizing audio and video recordings of meetings. The solution can be deployed with other FMs available on JumpStart.

The solution includes the following components:

A shell script for creating a custom Lambda layer
A configurable AWS CloudFormation template for deploying the solution
Lambda function code for starting Amazon Transcribe transcription jobs
Lambda function code for invoking a SageMaker real-time endpoint hosting the Flan T5 XL model

The following diagram illustrates this architecture.

As shown in the architecture diagram, the meeting recordings, transcripts, and notes are stored in respective Amazon Simple Storage Service (Amazon S3) buckets. The solution takes an event-driven approach to transcribe and summarize upon S3 upload events. The events trigger Lambda functions to make API calls to Amazon Transcribe and invoke the real-time endpoint hosting the Flan T5 XL model.

The CloudFormation template and instructions for deploying the solution can be found in the GitHub repository.

Real-time inference with SageMaker

Real-time inference on SageMaker is designed for workloads with low latency requirements. SageMaker endpoints are fully managed and support multiple hosting options and auto scaling. Once created, the endpoint can be invoked with the InvokeEndpoint API. The provided CloudFormation template creates a real-time endpoint with the default instance count of 1, but it can be adjusted based on expected load on the endpoint and as the service quota for the instance type permits. You can request service quota increases on the Service Quotas page of the AWS Management Console.

The following snippet of the CloudFormation template defines the SageMaker model, endpoint configuration, and endpoint using the ModelData and ImageURI of the Flan T5 XL from JumpStart. You can explore more FMs on Getting started with Amazon SageMaker JumpStart. To deploy the solution with a different model, replace the ModelData and ImageURI parameters in the CloudFormation template with the desired model S3 artifact and container image URI, respectively. Check out the sample notebook on GitHub for sample code on how to retrieve the latest JumpStart model artifact on Amazon S3 and the corresponding public container image provided by SageMaker.

  # SageMaker Model
  SageMakerModel:
    Type: AWS::SageMaker::Model
    Properties:
      ModelName: !Sub ${AWS::StackName}-SageMakerModel
      Containers:
        - Image: !Ref ImageURI
          ModelDataUrl: !Ref ModelData
          Mode: SingleModel
          Environment: {
            "MODEL_CACHE_ROOT": "/opt/ml/model",
            "SAGEMAKER_ENV": "1",
            "SAGEMAKER_MODEL_SERVER_TIMEOUT": "3600",
            "SAGEMAKER_MODEL_SERVER_WORKERS": "1",
            "SAGEMAKER_PROGRAM": "inference.py",
            "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code/",
            "TS_DEFAULT_WORKERS_PER_MODEL": 1
          }
      EnableNetworkIsolation: true
      ExecutionRoleArn: !GetAtt SageMakerExecutionRole.Arn

  # SageMaker Endpoint Config
  SageMakerEndpointConfig:
    Type: AWS::SageMaker::EndpointConfig
    Properties:
      EndpointConfigName: !Sub ${AWS::StackName}-SageMakerEndpointConfig
      ProductionVariants:
        - ModelName: !GetAtt SageMakerModel.ModelName
          VariantName: !Sub ${SageMakerModel.ModelName}-1
          InitialInstanceCount: !Ref InstanceCount
          InstanceType: !Ref InstanceType
          InitialVariantWeight: 1.0
          VolumeSizeInGB: 40

  # SageMaker Endpoint
  SageMakerEndpoint:
    Type: AWS::SageMaker::Endpoint
    Properties:
      EndpointName: !Sub ${AWS::StackName}-SageMakerEndpoint
      EndpointConfigName: !GetAtt SageMakerEndpointConfig.EndpointConfigName

Deploy the solution

For detailed steps on deploying the solution, follow the Deployment with CloudFormation section of the GitHub repository.

If you want to use a different instance type or more instances for the endpoint, submit a quota increase request for the desired instance type on the AWS Service Quotas Dashboard.

To use a different FM for the endpoint, replace the ImageURI and ModelData parameters in the CloudFormation template for the corresponding FM.

Test the solution

After you deploy the solution using the Lambda layer creation script and the CloudFormation template, you can test the architecture by uploading an audio or video meeting recording in any of the media formats supported by Amazon Transcribe. Complete the following steps:

On the Amazon S3 console, choose Buckets in the navigation pane.
From the list of S3 buckets, choose the S3 bucket created by the CloudFormation template named meeting-note-generator-demo-bucket-.
Choose Create folder.
For Folder name, enter the S3 prefix specified in the S3RecordingsPrefix parameter of the CloudFormation template (recordings by default).
Choose Create folder.
In the newly created folder, choose Upload.
Choose Add files and choose the meeting recording file to upload.
Choose Upload.

Now we can check for a successful transcription.

On the Amazon Transcribe console, choose Transcription jobs in the navigation pane.
Check that a transcription job with a corresponding name to the uploaded meeting recording has the status In progress or Complete.
When the status is Complete, return to the Amazon S3 console and open the demo bucket.
In the S3 bucket, open the transcripts/ folder.
Download the generated text file to view the transcription.

We can also check the generated summary.

In the S3 bucket, open the notes/ folder.
Download the generated text file to view the generated summary.

Prompt engineering

Even though LLMs have improved in the last few years, the models can only take in finite inputs; therefore, inserting an entire transcript of a meeting may exceed the limit of the model and cause an error with the invocation. To design around this challenge, we can break down the context into manageable chunks by limiting the number of tokens in each invocation context. In this sample solution, the transcript is broken down into smaller chunks with a maximum limit on the number of tokens per chunk. Then each transcript chunk is summarized using the Flan T5 XL model. Finally, the chunk summaries are combined to form the context for the final combined summary, as shown in the following diagram.

The following code from the GenerateMeetingNotes Lambda function uses the Natural Language Toolkit (NLTK) library to tokenize the transcript, then it chunks the transcript into sections, each containing up to a certain number of tokens:

# Chunk transcript into chunks
transcript = contents['results']['transcripts'][0]['transcript']
transcript_tokens = word_tokenize(transcript)

num_chunks = int(math.ceil(len(transcript_tokens) / CHUNK_LENGTH))
transcript_chunks = []
for i in range(num_chunks):
    if i == num_chunks - 1:
        chunk = TreebankWordDetokenizer().detokenize(transcript_tokens[CHUNK_LENGTH * i:])
    else:
        chunk = TreebankWordDetokenizer().detokenize(transcript_tokens[CHUNK_LENGTH * i:CHUNK_LENGTH * (i + 1)])
    transcript_chunks.append(chunk)

After the transcript is broken up into smaller chunks, the following code invokes the SageMaker real-time inference endpoint to get summaries of each transcript chunk:

# Summarize each chunk
chunk_summaries = []
for i in range(len(transcript_chunks)):
    text_input = '{}n{}'.format(transcript_chunks[i], instruction)
    payload = {
        "text_inputs": text_input,
        "max_length": 100,
        "num_return_sequences": 1,
        "top_k": 50,
        "top_p": 0.95,
        "do_sample": True
    }
    query_response = query_endpoint_with_json_payload(json.dumps(payload).encode('utf-8'))
    generated_texts = parse_response_multiple_texts(query_response)
    chunk_summaries.append(generated_texts[0])
    print(generated_texts[0])

Finally, the following code snippet combines the chunk summaries as the context to generate a final summary:

# Create a combined summary
text_input = '{}n{}'.format(' '.join(chunk_summaries), instruction)
payload = {
    "text_inputs": text_input,
    "max_length": 100,
    "num_return_sequences": 1,
    "top_k": 50,
    "top_p": 0.95,
    "do_sample": True
}
query_response = query_endpoint_with_json_payload(json.dumps(payload).encode('utf-8'))
generated_texts = parse_response_multiple_texts(query_response)

results = {
    "summary": generated_texts,
    "chunk_summaries": chunk_summaries
}

The full GenerateMeetingNotes Lambda function can be found in the GitHub repository.

Clean up

To clean up the solution, complete the following steps:

Delete all objects in the demo S3 bucket and the logs S3 bucket.
Delete the CloudFormation stack.
Delete the Lambda layer.

Conclusion

This post demonstrated how to use FMs on JumpStart to quickly build a serverless meeting notes generator architecture with AWS CloudFormation. Combined with AWS AI services like Amazon Transcribe and serverless technologies like Lambda, you can use FMs on JumpStart and Amazon Bedrock to build applications for various generative AI use cases.

For additional posts on ML at AWS, visit the AWS ML Blog.

About the author

Eric Kim is a Solutions Architect (SA) at Amazon Web Services. He works with game developers and publishers to build scalable games and supporting services on AWS. He primarily focuses on applications of artificial intelligence and machine learning.