Redacting PII from application log output with Amazon Comprehend

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships in text. The service can extract people, places, sentiments, and topics in unstructured data. You can now use Amazon Comprehend ML capabilities to detect and redact personally identifiable information (PII) in application logs, customer emails, support tickets, and more. No ML experience required. Redacting PII entities helps you protect privacy and comply with local laws and regulations.

Use case: Applications printing PII data in log output

Some applications print PII data in their log output inadvertently. In some cases, this may be due to developers forgetting to remove debug statements before deploying the application in production, and in other cases it may be due to legacy applications that are handed down and are difficult to update. PII can also get printed in stack traces. It’s generally a mistake to have PII present in such logs. Correlation IDs and primary keys are better identifiers than PII when debugging applications.

PII in application logs can quickly propagate to downstream systems, compounding security concerns. For example it may get submitted to search and analytics systems where it’s searchable and viewable by everyone. It may also be stored in object storage such as Amazon Simple Storage Service (Amazon S3) for analytics purposes. With the PII detection API of Amazon Comprehend, you can remove PII from application log output before such a log statement even gets printed.

In this post, I take the use case of a Java application that is generating log output with PII. The initial log output goes through filter-like processing that redacts PII before the log statement is output by the application. You can take a similar approach for other programming languages.

The application can be repackaged by changing its log format file, such as log4j.xml, and adding one Java class from this sample project, or adding this Java class as a dependency in the form of a .jar file.

The sample application is available in the following GitHub repo.

PII entity types

The following table lists some of the entity types Amazon Comprehend detects.

PII Entity Types	Description
EMAIL	An email address, such as marymajor@email.com.
NAME	An individual’s name. This entity type does not include titles, such as Mr., Mrs., Miss, or Dr. Amazon Comprehend does not apply this entity type to names that are part of organizations or addresses. For example, Amazon Comprehend recognizes the “John Doe Organization” as an organization, and it recognizes “Jane Doe Street” as an address.
PHONE	A phone number. This entity type also includes fax and pager numbers.
SSN	A Social Security Number (SSN) is a 9-digit number that is issued to US citizens, permanent residents, and temporary working residents. Amazon Comprehend also recognizes Social Security Numbers when only the last 4 digits are present.

For the full list, see Detect Personally Identifiable Information (PII).

The API response from Amazon Comprehend includes the entity type, its begin offset, end offset, and a confidence score. For this post, we use all of them.

Application overview

Our example application is a very simple application that simulates opening a bank account for a user. In its current form, the log output looks like the following code. We can see this by making requests to the endpoint payment:

curl localhost:8080/payment
2020-09-29T10:29:04,115 INFO [http-nio-8080-exec-1] c.e.l.c.PaymentController: Processing user User(name=Terina, ssn=626031641, email=mel.swift@Taylor.com, description=Ea minima omnis autem illo.)
2020-09-29T10:29:04,711 INFO [http-nio-8080-exec-2] c.e.l.c.PaymentController: User Napoleon, SSN 366435036, opened an account
2020-09-29T10:29:05,253 INFO [http-nio-8080-exec-4] c.e.l.c.PaymentController: User Cristen, SSN 197961488, opened an account
2020-09-29T10:29:05,673 INFO [http-nio-8080-exec-5] c.e.l.c.PaymentController: Processing user User(name=Giuseppe, ssn=713425581, email=elijah.dach@Shawnna.com, description=Impedit asperiores in magnam exercitationem.)

The output prints Name, SSN, and Email. This PII data is being generated by the java-faker library, which is a Java port of the well-known Ruby gem. See the following code:

        
            com.github.javafaker
            javafaker
            1.0.2

Log4j 2

Log4j 2 is a common Java library used for logging. Appenders in Log4j are responsible for delivering log events to their destinations, which can be console, file, and more. Log4j also has a RewriteAppender that lets you rewrite the log message before it is output. RewriteAppender works in conjunction with a RewritePolicy that provides the implementation for changing the log output.

The sample application uses the following log4j.xml file for log configuration:

SensitiveDataPolicy

The Log4j RewritePolicy we created for this project is named SensitiveDataPolicy. It uses four parameters:

maskMode – This parameter has two modes:
- REPLACE – The policy replaces discovered entities with their type names. For example, in case of social security numbers, the replaced string is [SSN].
- MASK – The policy replaces the discovered entity with a string consisting of the character provided as a mask parameter.
mask – The character to use to replace the discovered entity with. Only relevant if maskMode is MASK.
minScore – The minimum confidence score acceptable to us.
entitiesToReplace – A comma-separated list of entity type names that we want to replace. For example, we’re choosing to replace social security number and email, so the string value we provide is SSN,EMAIL. Amazon Comprehend also detects NAME in our application, but it’s printed as is.

Choosing redaction vs. masking is a matter of preference. Redaction is usually preferred when the context needs to be preserved, such as in natural text, whereas masking is best for maintaining text length as well as structured data such as formatted files or key-value pairs.

Detecting PII is as simple as making an API call to Amazon Comprehend using the AWS SDK and providing the text to analyze:

        DetectPiiEntitiesRequest piiEntitiesRequest =
                DetectPiiEntitiesRequest.builder()
                        .languageCode("en")
                        .text(msg.getFormattedMessage())
                        .build();

        DetectPiiEntitiesResponse piiEntitiesResponse = comprehendClient.detectPiiEntities(piiEntitiesRequest);

Asynchronous logging

Because our policy makes synchronous calls to Amazon Comprehend for PII detection, we want this processing to happen asynchronously, outside of customer request loop, to avoid introducing latency. For instructions, see Asynchronous Loggers for Low-Latency Logging. We add the Disruptor library to our classpath by adding it to pom.xml:

        
            com.lmax
            disruptor
            3.4.2

We also need to set a system property. After we package our application with mvn package, we can run it as in the following code:

java -jar target/comprehend-logging.jar -Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector

Updated log output

The log output from this application now looks like the following. We can see that SSN and Email are being suppressed.

2020-09-29T12:52:30,423 INFO  [http-nio-8080-exec-6] ?: User Willa, SSN *********, opened an account
2020-09-29T12:52:30,824 INFO  [http-nio-8080-exec-8] ?: User Vania, SSN *********, opened an account
2020-09-29T12:52:31,245 INFO  [http-nio-8080-exec-9] ?: Processing user User(name=Laronda, ssn=*********, email=******************************, description=Doloremque culpa iure dolore omnis.)
2020-09-29T12:52:31,637 INFO  [http-nio-8080-exec-1] ?: Processing user User(name=Tommye, ssn=*********, email=*************************, description=Corporis sed tempore.)

Conclusion

We learned how to use Amazon Comprehend to redact sensitive data natively within next-generation applications. For information about applying it as a postprocessing technique for logs in storage, see Detecting and redacting PII using Amazon Comprehend. The API lets you have complete control over the entities that are important for your use case and lets you either mask or redact the information.

For more information about Amazon Comprehend availability and quotas, see Amazon Comprehend endpoints and quotas.

About the Author

Pradeep Singh is a Solutions Architect at Amazon Web Services. He helps AWS customers take advantage of AWS services to design scalable and secure applications. His expertise spans Application Architecture, Containers, Analytics and Machine Learning.