Generating searchable PDFs from scanned documents automatically with Amazon Textract
Amazon Textract is a machine learning service that makes it easy to extract text and data from virtually any document. Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. This allows you to use Amazon Textract to instantly “read” virtually any type of document and accurately extract text and data without the need for any manual effort or custom code.
The blog post Automatically extract text and structured data from documents with Amazon Textract shows how to use Amazon Textract to automatically extract text and data from scanned documents without any machine learning (ML) experience. One of the use cases covered in the post is search and discovery. You can search through millions of documents by extracting text and structured data from documents with Amazon Textract and creating a smart index using Amazon ES.
This post demonstrates how to generate searchable PDF documents by extracting text from scanned documents using Amazon Textract. The solution allows you to download relevant documents, search within a document when it is stored offline, or select and copy text.
You can see an example of searchable PDF document that is generated using Amazon Textract from a scanned document. While text is locked in images in the scanned document, you can select, copy, and search text in the searchable PDF document.
To generate a searchable PDF, use Amazon Textract to extract text from documents and add the extracted text as a layer to the image in the PDF document. Amazon Textract detects and analyzes text input documents and returns information about detected items such as pages, words, lines, form data (key-value pairs), tables, and selection elements. It also provides bounding box information, which is an axis-aligned coarse representation of the location of the recognized item on the document page. You can use the detected text and its bounding box information to place text in the PDF page.
PDFDocument is a sample library in AWS Samples GitHub repo and provides the necessary logic to generate a searchable PDF document using Amazon Textract. It also uses open-source Java library Apache PDFBox to create PDF documents, but there are similar PDF processing libraries available in other programming languages.
The following code example shows how to use sample library to generate a searchable PDF document from an image:
Generating a searchable PDF from an image document
The following code shows how to take an image document and generate a corresponding searchable PDF document. Extract the text using Amazon Textract and create a searchable PDF by adding the text as a layer with the image.
Generating a searchable PDF from a PDF document
The following code example takes an input PDF document from an Amazon S3 bucket and generates the corresponding searchable PDF document. You extract text from the PDF document using Amazon Textract, and create a searchable PDF by adding text as a layer with an image for each page.
Running code on a local machine
To run the code on a local machine, complete the following steps. The code examples are available on the GitHub repo.
- Set up your AWS Account and AWS CLI.
For more information, see Getting Started with Amazon Textract.
- Download and unzip searchablepdf.zip from the GitHub repo.
- Install Apache Maven if it is not already installed.
- In the project directory, run
mvn package
. - Run
java -cp target/searchable-pdf-1.0.jar
Demo.
This runs the Java project with Demo as the main class.
By default, only the first example to create a searchable PDF from an image on a local drive is enabled. To run other examples, uncomment the relevant lines in Demo class.
Running code in Lambda
To run the code in Lambda, complete the following steps. The code examples are available on the GitHub repo.
- Download and unzip searchablepdf.zip from the GitHub repo.
- Install Apache Maven if it is not already installed.
- In the project directory, run
mvn package
.
The build creates a .jar in project-dir/target/searchable-pdf1.0.jar
, using information in the pom.xml
to do the necessary transforms. This is a standalone .jar (.zip file) that includes all the dependencies. This is your deployment package that you can upload to Lambda to create a function. For more information, see AWS Lambda Deployment Package in Java. DemoLambda has all the necessary code to read S3 events and take action based on the type of input document.
- Create an S3 bucket.
- In the S3 bucket, create a folder labeled
documents
. - Create a Lambda with Java 8 and IAM role that has read and write permissions to the S3 bucket you created earlier.
- Configure the IAM role to also have permissions to call Amazon Textract.
- Set handler to
DemoLambda::handleRequest
. - Increase timeout to 5 minutes.
- Upload the .jar file you built earlier.
- Add a trigger in the Lambda function such that when an object uploads to the documents folder, the Lambda function executes.
Make sure that you set a trigger for the documents folder. If you add a trigger for the whole bucket, the function also triggers every time an output PDF document generates.
- Upload an image (.jpeg or .png) or PDF document to the documents folder in your S3 bucket.
In a few seconds, you should see the searchable PDF document in your S3 bucket.
These steps show simple S3 and Lambda integration. For large-scale document processing, see the reference architecture at following GitHub repo.
Conclusion
This post showed how to use Amazon Textract to generate searchable PDF documents automatically. You can search across millions of documents to find the relevant file by creating a smart search index using Amazon ES. Searchable PDF documents then allows you to select and copy text and search within a document after downloading it for offline use.
To learn more about different text and data extraction features of Amazon Textract, see How Amazon Textract Works.
About the Authors
Kashif Imran is a Solutions Architect at Amazon Web Services. He works with some of the largest strategic AWS customers to provide technical guidance and design advice. His expertise spans application architecture, serverless, containers, NoSQL and machine learning.
Tags: Archive
Leave a Reply