Create high-quality data for ML models with Amazon SageMaker Ground Truth

Machine learning (ML) has improved business across industries in recent years—from the recommendation system on your Prime Video account, to document summarization and efficient search with Alexa’s voice assistance. However, the question remains of how to incorporate this technology into your business. Unlike traditional rule-based methods, ML automatically infers patterns from data so as to perform your task of interest. Although this bypasses the need to curate rules for automation, it also means that ML models can only be as good as the data on which they’re trained. However, data creation is often a challenging task. At the Amazon Machine Learning Solutions Lab, we’ve repeatedly encountered this problem and want to ease this journey for our customers. If you want to offload this process, you can use Amazon SageMaker Ground Truth Plus.

By the end of this post, you’ll be able to achieve the following:

Understand the business processes involved in setting up a data acquisition pipeline
Identify AWS Cloud services for supporting and expediting your data labeling pipeline
Run a data acquisition and labeling task for custom use cases
Create high-quality data following business and technical best practices

Throughout this post, we focus on the data creation process and rely on AWS services to handle the infrastructure and process components. Namely, we use Amazon SageMaker Ground Truth to handle the labeling infrastructure pipeline and user interface. This service uses a point-and-go approach to collect your data from Amazon Simple Storage Service (Amazon S3) and set up a labeling workflow. For labeling, it provides you with the built-in flexibility to acquire data labels using your private team, an Amazon Mechanical Turk force, or your preferred labeling vendor from AWS Marketplace. Lastly, you can use AWS Lambda and Amazon SageMaker notebooks to process, visualize, or quality control the data—either pre- or post-labeling.

Now that all of the pieces have been laid down, let’s start the process!

The data creation process

Contrary to common intuition, the first step for data creation is not data collection. Working backward from the users to articulate the problem is crucial. For example, what do users care about in the final artifact? Where do experts believe the signals relevant to the use case reside in the data? What information about the use case environment could be provided to model? If you don’t know the answers to those questions, don’t worry. Give yourself some time to talk with users and field experts to understand the nuances. This initial understanding will orient you in the right direction and set you up for success.

For this post, we assume that you have covered this initial process of user requirement specification. The next three sections walk you through the subsequent process of creating quality data: planning, source data creation, and data annotation. Piloting loops at the data creation and annotation steps are vital for ensuring the efficient creation of labeled data. This involves iterating between data creation, annotation, quality assurance, and updating the pipeline as necessary.

The following figure provides an overview of the steps required in a typical data creation pipeline. You can work backward from the use case to identify the data that you need (Requirements Specification), build a process to obtain the data (Planning), implement the actual data acquisition process (Data Collection and Annotation), and assess the results. Pilot runs, highlighted with dashed lines, let you iterate on the process until a high-quality data acquisition pipeline has been developed.

In a typical data creation pipeline, you go through requirements specification for the use case in scope, plan for the data creation process, implement the process for data collection and labeling, and evaluate the results against the original requirements specification. Successive iterations of this workflow enable refinement of the pipeline.

Overview of steps required in a typical data creation pipeline.

Planning

A standard data creation process can be time-consuming and a waste of valuable human resources if conducted inefficiently. Why would it be time-consuming? To answer this question, we must understand the scope of the data creation process. To assist you, we have collected a high-level checklist and description of key components and stakeholders that you must consider. Answering these questions can be difficult at first. Depending on your use case, only some of these may be applicable.

Identify the legal point of contact for required approvals – Using data for your application can require license or vendor contract review to ensure compliance with company policies and use cases. It’s important to identify your legal support throughout the data acquisition and annotation steps of the process.
Identify the security point of contact for data handling –Leakage of purchased data might result in serious fines and repercussions for your company. It’s important to identify your security support throughout the data acquisition and annotation steps to ensure secure practices.
Detail use case requirements and define source data and annotation guidelines – Creating and annotating data is difficult due to the high specificity required. Stakeholders, including data generators and annotators, must be completely aligned to avoid wasting resources. To this end, it’s common practice to use a guidelines document that specifies every aspect of the annotation task: exact instructions, edge cases, an example walkthrough, and so on.
Align on expectations for collecting your source data – Consider the following:
- Conduct research on potential data sources – For example, public datasets, existing datasets from other internal teams, self-collected, or purchased data from vendors.
- Perform quality assessment – Create an analysis pipeline with relation to the final use case.
Align on expectations for creating data annotations – Consider the following:
- Identify the technical stakeholders – This is usually an individual or team in your company capable of using the technical documentation regarding Ground Truth to implement an annotation pipeline. These stakeholders are also responsible for quality assessment of the annotated data to make sure that it meets the needs of your downstream ML application.
- Identify the data annotators – These individuals use predetermined instructions to add labels to your source data within Ground Truth. They may need to possess domain knowledge depending on your use case and annotation guidelines. You can use a workforce internal to your company, or pay for a workforce managed by an external vendor.
Ensure oversight of the data creation process – As you can see from the preceding points, data creation is a detailed process that involves numerous specialized stakeholders. Therefore, it’s crucial to monitor it end to end toward the desired outcome. Having a dedicated person or team oversee the process can help you ensure a cohesive, efficient data creation process.

Depending on the route that you decide to take, you must also consider the following:

Create the source dataset – This refers to instances when existing data isn’t suitable for the task at hand, or legal constraints prevent you from using it. Internal teams or external vendors (next point) must be used. This is often the case for highly specialized domains or areas with low public research. For example, a physician’s common questions, garment lay down, or sports experts. It can be internal or external.
Research vendors and conduct an onboarding process – When external vendors are used, a contracting and onboarding process must be set in place between both entities.

In this section, we reviewed the components and stakeholders that we must consider. However, what does the actual process look like? In the following figure, we outline a process workflow for data creation and annotation. The iterative approach uses small batches of data called pilots to decrease turnaround time, detect errors early on, and avoid wasting resources in the creation of low-quality data. We describe these pilot rounds later in this post. We also cover some best practices for data creation, annotation, and quality control.

The following figure illustrates the iterative development of a data creation pipeline. Vertically, we find the data sourcing block (green) and the annotation block (blue). Both blocks have independent pilot rounds (Data creation/Annotation, QAQC, and Update). Increasingly higher sourced data is created and can be used to construct increasingly higher-quality annotations.

During the iterative development of a data creation or annotation pipeline, small batches of data are used for independent pilots. Each pilot round has a data creation or annotation phase, some quality assurance and quality control of the results, and an update step to refine the process. After these processes are finessed through successive pilots, you can proceed to large-scale data creation and annotation.

Overview of iterative development in a data creation pipeline.

Source data creation

The input creation process revolves around staging your items of interest, which depend on your task type. These could be images (newspaper scans), videos (traffic scenes), 3D point clouds (medical scans), or simply text (subtitle tracks, transcriptions). In general, when staging your task-related items, make sure of the following:

Reflect the real-world use case for the eventual AI/ML system – The setup for collecting images or videos for your training data should closely match the setup for your input data in the real-world application. This means having consistent placement surfaces, lighting sources, or camera angles.
Account for and minimize variability sources – Consider the following:
- Develop best practices for maintaining data collection standards – Depending on the granularity of your use case, you may need to specify requirements to guarantee consistency among your data points. For example, if you’re collecting image or video data from single camera points, you may need to make sure of the consistent placement of your objects of interest, or require a quality check for the camera before a data capture round. This can avoid issues like camera tilt or blur, and minimize downstream overheads like removing out-of-frame or blurry images, as well as needing to manually center the image frame on your area of interest.
- Pre-empt test time sources of variability – If you anticipate variability in any of the attributes mentioned so far during test time, make sure that you can capture those variability sources during training data creation. For example, if you expect your ML application to work in multiple different light settings, you should aim to create training images and videos at various light settings. Depending on the use case, variability in camera positioning can also influence the quality of your labels.
Incorporate prior domain knowledge when available – Consider the following:
- Inputs on sources of error – Domain practitioners can provide insights into sources of error based on their years of experience. They can provide feedback on the best practices for the previous two points: What settings reflect the real-world use case best? What are the possible sources of variability during data collection, or at the time of use?
- Domain-specific data collection best practices – Although your technical stakeholders may already have a good idea of the technical aspects to focus on in the images or videos collected, domain practitioners can provide feedback on how best to stage or collect the data such that these needs are met.

Quality control and quality assurance of the created data

Now that you have set up the data collection pipeline, it might be tempting to go ahead and collect as much data as possible. Wait a minute! We must first check if the data collected through the setup is suitable for your real-word use case. We can use some initial samples and iteratively improve the setup through the insights that we gained from analyzing that sample data. Work closely with your technical, business, and annotation stakeholders during the pilot process. This will make sure that your resultant pipeline is meeting business needs while generating ML-ready labeled data within minimal overheads.

Annotations

The annotation of inputs is where we add the magic touch to our data—the labels! Depending on your task type and data creation process, you may need manual annotators, or you can use off-the-shelf automated methods. The data annotation pipeline itself can be a technically challenging task. Ground Truth eases this journey for your technical stakeholders with its built-in repertoire of labeling workflows for common data sources. With a few additional steps, it also enables you to build custom labeling workflows beyond preconfigured options.

Ask yourself the following questions when developing a suitable annotation workflow:

Do I need a manual annotation process for my data? In some cases, automated labeling services may be sufficient for the task at hand. Reviewing the documentation and available tools can help you identify if manual annotation is necessary for your use case (for more information, see What is data labeling?). The data creation process can allow for varying levels of control regarding the granularity of your data annotation. Depending on this process, you can also sometimes bypass the need for manual annotation. For more information, refer to Build a custom Q&A dataset using Amazon SageMaker Ground Truth to train a Hugging Face Q&A NLU model.
What forms my ground truth? In most cases, the ground truth will come from your annotation process—that’s the whole point! In others, the user may have access to ground truth labels. This can significantly speed up your quality assurance process, or reduce the overhead required for multiple manual annotations.
What is the upper bound for the amount of deviance from my ground truth state? Work with your end-users to understand the typical errors around these labels, the sources of such errors, and the desired reduction in errors. This will help you identify which aspects of the labeling task are most challenging or are likely to have annotation errors.
Are there preexisting rules used by the users or field practitioners to label these items? Use and refine these guidelines to build a set of instructions for your manual annotators.

Piloting the input annotation process

When piloting the input annotation process, consider the following:

Review the instructions with the annotators and field practitioners – Instructions should be concise and specific. Ask for feedback from your users (Are the instructions accurate? Can we revise any instructions to make sure that they are understandable by non-field practitioners?) and annotators (Is everything understandable? Is the task clear?). If possible, add an example of good and bad labeled data to help your annotators identify what is expected, and what common labeling errors might look like.
Collect data for annotations – Review the data with your customer to make sure that it meets the expected standards, and to align on expected outcomes from the manual annotation.
Provide examples to your pool of manual annotators as a test run – What is the typical variance among the annotators in this set of examples? Study the variance for each annotation within a given image to identify the consistency trends among annotators. Then compare the variances across the images or video frames to identify which labels are challenging to place.

Quality control of the annotations

Annotation quality control has two main components: assessing consistency between the annotators, and assessing the quality of the annotations themselves.

You can assign multiple annotators to the same task (for example, three annotators label the key points on the same image), and measure the average value alongside the standard deviation of these labels among the annotators. Doing so helps you identify any outlier annotations (incorrect label used, or label far away from the average annotation), which can guide actionable outcomes, such as refining your instructions or providing further training to certain annotators.

Assessing the quality of annotations themselves is tied to annotator variability and (when available) the availability of domain experts or ground truth information. Are there certain labels (across all of your images) where the average variance between annotators is consistently high? Are any labels far off from your expectations of where they should be, or what they should look like?

Based on our experience, a typical quality control loop for data annotation can look like this:

Iterate on the instructions or image staging based on results from the test run – Are any objects occluded, or does image staging not match the expectations of annotators or users? Are the instructions misleading, or did you miss any labels or common errors in your exemplar images? Can you refine the instructions for your annotators?
If you are satisfied that you have addressed any issues from the test run, do a batch of annotations – For testing the results from the batch, follow the same quality assessment approach of assessing inter-annotator and inter-image label variabilities.

Conclusion

This post serves as a guide for business stakeholders to understand the complexities of data creation for AI/ML applications. The processes described also serve as a guide for technical practitioners to generate quality data while optimizing business constraints such as personnel and costs. If not done well, a data creation and labeling pipeline can take upwards of 4–6 months.

With the guidelines and suggestions outlined in this post, you can preempt roadblocks, reduce time to completion, and minimize the costs in your journey toward creating high-quality data.

About the authors

Jasleen Grewal is an Applied Scientist at Amazon Web Services, where she works with AWS customers to solve real world problems using machine learning, with special focus on precision medicine and genomics. She has a strong background in bioinformatics, oncology, and clinical genomics. She is passionate about using AI/ML and cloud services to improve patient care.

Boris Aronchik is a Manager in the Amazon AI Machine Learning Solutions Lab, where he leads a team of ML scientists and engineers to help AWS customers realize business goals leveraging AI/ML solutions.

Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.

Lin Lee Cheong is a Senior Scientist and Manager with the Amazon ML Solutions Lab team at Amazon Web Services. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems.