Redacting PII from application log output with Amazon Comprehend
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships in text. The service can extract people, places, sentiments, and topics in unstructured data. You can now use Amazon Comprehend ML capabilities to detect and redact personally identifiable information (PII) in application logs, customer emails, support tickets, and more. No ML experience required. Redacting PII entities helps you protect privacy and comply with local laws and regulations.
Use case: Applications printing PII data in log output
Some applications print PII data in their log output inadvertently. In some cases, this may be due to developers forgetting to remove debug statements before deploying the application in production, and in other cases it may be due to legacy applications that are handed down and are difficult to update. PII can also get printed in stack traces. It’s generally a mistake to have PII present in such logs. Correlation IDs and primary keys are better identifiers than PII when debugging applications.
PII in application logs can quickly propagate to downstream systems, compounding security concerns. For example it may get submitted to search and analytics systems where it’s searchable and viewable by everyone. It may also be stored in object storage such as Amazon Simple Storage Service (Amazon S3) for analytics purposes. With the PII detection API of Amazon Comprehend, you can remove PII from application log output before such a log statement even gets printed.
In this post, I take the use case of a Java application that is generating log output with PII. The initial log output goes through filter-like processing that redacts PII before the log statement is output by the application. You can take a similar approach for other programming languages.
The application can be repackaged by changing its log format file, such as log4j.xml
, and adding one Java class from this sample project, or adding this Java class as a dependency in the form of a .jar file.
The sample application is available in the following GitHub repo.
PII entity types
The following table lists some of the entity types Amazon Comprehend detects.
PII Entity Types | Description |
An email address, such as marymajor@email.com. | |
NAME | An individual’s name. This entity type does not include titles, such as Mr., Mrs., Miss, or Dr. Amazon Comprehend does not apply this entity type to names that are part of organizations or addresses. For example, Amazon Comprehend recognizes the “John Doe Organization” as an organization, and it recognizes “Jane Doe Street” as an address. |
PHONE | A phone number. This entity type also includes fax and pager numbers. |
SSN | A Social Security Number (SSN) is a 9-digit number that is issued to US citizens, permanent residents, and temporary working residents. Amazon Comprehend also recognizes Social Security Numbers when only the last 4 digits are present. |
For the full list, see Detect Personally Identifiable Information (PII).
The API response from Amazon Comprehend includes the entity type, its begin offset, end offset, and a confidence score. For this post, we use all of them.
Application overview
Our example application is a very simple application that simulates opening a bank account for a user. In its current form, the log output looks like the following code. We can see this by making requests to the endpoint payment
:
The output prints Name
, SSN
, and Email
. This PII data is being generated by the java-faker library, which is a Java port of the well-known Ruby gem. See the following code:
Log4j 2
Log4j 2 is a common Java library used for logging. Appenders
in Log4j are responsible for delivering log events to their destinations, which can be console, file, and more. Log4j also has a RewriteAppender
that lets you rewrite the log message before it is output. RewriteAppender
works in conjunction with a RewritePolicy
that provides the implementation for changing the log output.
The sample application uses the following log4j.xml
file for log configuration:
SensitiveDataPolicy
The Log4j RewritePolicy
we created for this project is named SensitiveDataPolicy
. It uses four parameters:
- maskMode – This parameter has two modes:
- REPLACE – The policy replaces discovered entities with their type names. For example, in case of social security numbers, the replaced string is
[SSN]
. - MASK – The policy replaces the discovered entity with a string consisting of the character provided as a
mask
parameter.
- REPLACE – The policy replaces discovered entities with their type names. For example, in case of social security numbers, the replaced string is
- mask – The character to use to replace the discovered entity with. Only relevant if
maskMode
isMASK
. - minScore – The minimum confidence score acceptable to us.
- entitiesToReplace – A comma-separated list of entity type names that we want to replace. For example, we’re choosing to replace social security number and email, so the string value we provide is
SSN,EMAIL
. Amazon Comprehend also detectsNAME
in our application, but it’s printed as is.
Choosing redaction vs. masking is a matter of preference. Redaction is usually preferred when the context needs to be preserved, such as in natural text, whereas masking is best for maintaining text length as well as structured data such as formatted files or key-value pairs.
Detecting PII is as simple as making an API call to Amazon Comprehend using the AWS SDK and providing the text to analyze:
Asynchronous logging
Because our policy makes synchronous calls to Amazon Comprehend for PII detection, we want this processing to happen asynchronously, outside of customer request loop, to avoid introducing latency. For instructions, see Asynchronous Loggers for Low-Latency Logging. We add the Disruptor library to our classpath by adding it to pom.xml
:
We also need to set a system property. After we package our application with mvn package, we can run it as in the following code:
Updated log output
The log output from this application now looks like the following. We can see that SSN
and Email
are being suppressed.
Conclusion
We learned how to use Amazon Comprehend to redact sensitive data natively within next-generation applications. For information about applying it as a postprocessing technique for logs in storage, see Detecting and redacting PII using Amazon Comprehend. The API lets you have complete control over the entities that are important for your use case and lets you either mask or redact the information.
For more information about Amazon Comprehend availability and quotas, see Amazon Comprehend endpoints and quotas.
About the Author
Pradeep Singh is a Solutions Architect at Amazon Web Services. He helps AWS customers take advantage of AWS services to design scalable and secure applications. His expertise spans Application Architecture, Containers, Analytics and Machine Learning.
Tags: Archive
Leave a Reply