Prepare data from Snowflake for machine learning with Amazon SageMaker Data Wrangler
Data preparation remains a major challenge in the machine learning (ML) space. Data scientists and engineers need to write queries and code to get data from source data stores, and then write the queries to transform this data, to create features to be used in model development and training. All of this data pipeline development work doesn’t really focus on the building of ML models, but focuses on the building of data pipelines necessary to make the data available to the models. Amazon SageMaker Data Wrangler makes it easier for data scientists and engineers to prepare data in the early phase of developing ML applications by using a visual interface.
Data Wrangler simplifies the process of data preparation and feature engineering using a single visual interface. Data Wrangler comes with over 300 built-in data transformations to help normalize, transform, and combine features without writing any code. You can now use Snowflake as a data source in Data Wrangler to easily prepare data in Snowflake for ML.
In this post, we use a simulated dataset that represents loans from a financial services provider, which has been provided by Snowflake. This dataset contains lender data about loans granted to individuals. We use Data Wrangler to transform and prepare the data for later use in ML models, first building a data flow in Data Wrangler, then exporting it to Amazon SageMaker Pipelines. First, we walk through setting up Snowflake as the data source, then explore and transform the data using Data Wrangler.
Prerequisites
This post assumes you have the following:
- A Snowflake account with permissions to create storage integrations
- Data in a table in Snowflake
- An AWS account with permissions to create AWS Identity and Access Management (IAM) policies and roles
- An Amazon Simple Storage Service (Amazon S3) bucket that Data Wrangler can use for outputting transformed data
Set up permissions for Data Wrangler
In this section, we cover the permissions required to set up Snowflake as a data source for Data Wrangler. This section requires you to perform steps in both the AWS Management Console and Snowflake. The user in each environment should have permission to create policies, roles, and secrets in AWS, and the ability to create storage integrations in Snowflake.
All permissions for AWS resources are managed via your IAM role attached to your Amazon SageMaker Studio instance. Snowflake-specific permissions are managed by the Snowflake admin; they can grant granular permissions and privileges to each Snowflake user. This includes databases, schemas, tables, warehouses, and storage integration objects. Make sure that the correct permissions are set up outside of Data Wrangler.
AWS access requirements
Snowflake requires the following permissions on your output S3 bucket and prefix to be able to access objects in the prefix:
s3:GetObject
s3:GetObjectVersion
s3:ListBucket
You can add a bucket policy to ensure that Snowflake only communicates with your bucket over HTTPS. For instructions, see What S3 bucket policy should I use to comply with the AWS Config rule s3-bucket-ssl-requests-only?
Create an IAM policy allowing Amazon S3 access
In this section, we cover creating the policy required for Snowflake to access data in an S3 bucket of your choosing. If you already have a policy and role that allows access to the S3 bucket you plan to use for the Data Wrangler output, you can skip this section and the next section, and start creating your storage integration in Snowflake.
- On the IAM console, choose Policies in the navigation pane.
- Choose Create policy.
- On the JSON tab, enter the following JSON snippet, substituting your bucket and prefix name for the placeholders:
- Choose Next: Tags.
- Choose Next: Review.
- For Name, enter a name for your policy (for example,
snowflake_datawrangler_s3_access
). - Choose Create policy.
Create an IAM role
In this section, we create an IAM role and attach it to the policy we created.
- On the IAM console, choose Roles in the navigation pane.
- Choose Create role.
- Select Another AWS account as the trusted entity type
- For Account ID field, enter your own AWS account ID.
You modify the trusted relationship and grant access to Snowflake later.
- Select the Require External ID
- Enter a dummy ID such as your own account ID.
Later, we modify the trust relationship and specify the external ID for your Snowflake stage. An external ID is required to grant access to your AWS resources (such as Amazon S3) to a third party (Snowflake).
- Choose Next.
- Locate the policy you created previously for the S3 bucket and choose this policy.
- Choose Next.
- Enter a name and description for the role, then choose Create role.
You now have an IAM policy created for an IAM role, and the policy is attached to the role.
- Record the role ARN value located on the role summary page.
In the next step, you create a Snowflake integration that references this role.
Create a storage integration in Snowflake
A storage integration in Snowflake stores a generated IAM entity for external cloud storage, with an optional set of allowed or blocked locations, in Amazon S3. An AWS administrator in your organization grants permissions on the storage location to the generated IAM entity. With this feature, users don’t need to supply credentials when creating stages or when loading or unloading data.
Create the storage integration with the following code:
Retrieve the IAM user for your Snowflake account
Run the following DESCRIBE INTEGRATION
command to retrieve the ARN for the IAM user that was created automatically for your Snowflake account:
Record the following values from the output:
- STORAGE_AWS_IAM_USER_ARN – The IAM user created for your Snowflake account
- STORAGE_AWS_EXTERNAL_ID– The external ID needed to establish a trust relationship
Update the IAM role trust policy
Now we update the trust policy.
- On the IAM console, choose Roles in the navigation pane.
- Choose the role you created.
- On the Trust relationship tab, choose Edit trust relationship.
- Modify the policy document as shown in the following code with the
DESC STORAGE INTEGRATION
output values you recorded in the previous step:
- Choose Update trust policy.
Create an external stage in Snowflake
We use an external stage within Snowflake for loading data from an S3 bucket in your own account into Snowflake. In this step, we create an external (Amazon S3) stage that references the storage integration you created. For more information, see Creating an S3 Stage.
This requires a role that has the CREATE_STAGE
privilege for the schema as well as the USAGE
privilege on the storage integration. You can grant these privileges to the role as shown in the code in the next step.
Create the stage using the CREATE_STAGE
command with placeholders for the external stage and S3 bucket and prefix. The stage also references a named file format object called my_csv_format
:
Create a secret for Snowflake credentials (Optional)
Data Wrangler allows users to use the ARN of an AWS Secrets Manager secret or a Snowflake account name, user name, and password to access Snowflake. If you intend to use the Snowflake account name, user name, and password option, skip to the next section, which covers adding the data source. By default, Data Wrangler creates a Secrets Manager secret on your behalf, when using the second option.
To create a Secrets Manager secret manually, complete the following steps:
- On the Secrets Manager console, choose Store a new secret.
- For Select secret type¸ select Other types of secrets.
- Specify the details of your secret as key-value pairs.
The names of the key are case-sensitive and must be lowercase. If you enter any of these incorrectly, Data Wrangler raises an error.
If you prefer, you can use the plaintext option and enter the secret values as JSON:
- Choose Next.
- For Secret name, add the prefix
AmazonSageMaker
(for example, our secret isAmazonSageMaker-DataWranglerSnowflakeCreds
). - In the Tags section, add a tag with the key
SageMaker
and valuetrue
.
- Choose Next.
- The rest of the fields are optional; choose Next until you have the option to choose Store to store the secret.
After you store the secret, you’re returned to the Secrets Manager console.
- Choose the secret you just created, then retrieve the secret ARN.
- Store this in the text editor of your choice for use later when you create the Data Wrangler data source.
Set up the data source in Data Wrangler
In this section, we cover setting up Snowflake as a data source in Data Wrangler. This post assumes that you have access to SageMaker, an instance of Studio, and a user for Studio. For more information about prerequisites, see Get Started with Data Wrangler.
Create a new data flow
To create your data flow, complete the following steps:
- On the SageMaker console, choose Amazon SageMaker Studio in the navigation pane.
- Choose Open Studio.
- In the Launcher, choose New data flow.
Alternatively, on the File drop-down, choose New, then choose Data Wrangler Flow.
Creating a new flow can take a few minutes. After the flow has created, you see the Import data page.
Add Snowflake as a data source in Data Wrangler
Next, we add Snowflake as a data source.
- On the Add data source menu, choose Snowflake.
- Add your Snowflake connection details.
Data Wrangler uses HTTPS to connect to Snowflake.
- If you created a Secrets Manager secret manually, choose the Authentication method drop-down menu and choose ARN.
- Choose Connect.
You’re redirected to the import menu.
Run a query
Now that Snowflake is set up as a data source, you can access your data in Snowflake directly from the Data Wrangler query editor. The query we write in the editor is what Data Wrangler uses to import data from Snowflake to start our data flow.
- On the drop-down menus, choose the data warehouse, database, and schema you want to use for your query.
For this post, our dataset is in the database FIN_LOANS
, the schema is DEV
, and the table is LOAN_INT_HV
. My data warehouse is called MOONMAXW_DEV_WH
; depending on your setup, these will likely differ.
Alternatively, you can specify the full path to the dataset in the query editor. Make sure you still choose the database and schema on the drop-down menus.
- In the query editor, enter a query and preview the results.
For this post, we retrieve all columns from 1,000 rows.
- Choose Import.
- Enter a dataset name when prompted (for this post, we use
snowflake_loan_int_hv
). - Choose Add.
You’re taken to the Prepare page, where you can add transformations and analyses to the data.
Add transformations to the data
Data Wrangler has over 300 built-in transformations. In this section, we use some of these transformations to prepare the dataset for an ML model.
On the Data Wrangler flow page, make sure you have chosen the Prepare tab. If you’re following the steps in the post, you’re directed here automatically after adding your dataset.
Convert data types
The first step we want to perform is to check that the correct data type was inferred on ingest for each column.
- Next to Data types, choose the plus sign.
- Choose Edit data types.
Looking through the columns, we identify that MNTHS_SINCE_LAST_DELINQ
and MNTHS_SINCE_LAST_RECORD
should most likely be represented as a number type, rather than string.
- On the right-hand menu, scroll down until you find
MNTHS_SINCE_LAST_DELINQ
andMNTHS_SINCE_LAST_RECORD
. - On the drop-down menu, choose Float.
Looking through the dataset, we can confirm that the rest of the columns appear to have been correctly inferred.
- Choose Preview to preview the changes.
- Choose Apply to apply the changes.
- Choose Back to data flow to see the current state of the flow.
Manage columns
The dataset we’re using has several columns that likely aren’t beneficial to future models, so we start our transformation process by dropping the columns that aren’t useful.
- Next to Data types, choose the plus sign.
- Choose Add transformation.
The transformation console opens. Here you can preview your dataset, select from the available transformations, and preview the transformations.
Looking through the data, we can see that the fields EMP_TITLE
, URL
, DESCRIPTION
, and TITLE
will likely not provide value to our model in our use case, so we drop them.
- On the Transform menu, choose Manage columns.
- On the Transform drop-down menu, leave Drop column
- Enter
EMP_TITLE
for Column to drop. - Choose Preview to review the changes.
- Choose Add to add the step.
- If you want to see the step you added and previous steps, choose Previous steps on the Transform
- Repeat these steps for the remaining columns (
URL
,DESCRIPTION
, andTITLE
). - Choose Back to data flow to see the current state of the flow.
In the data flow view, we can see that this node in the flow has four steps, which represent the four columns we’re dropping for this part of the flow.
Format string
Next, we look for columns that are string data that can be formatted to be more beneficial to use later. Looking through our dataset, we can see that INT_RATE
might be useful in a future model as float, but has a trailing character of %
. Before we can use another built-in transformation (parse as type) to convert this to a float, we must strip the trailing character.
- Next to Steps, choose the plus sign.
- Choose Add transform.
- Choose Format string.
- On the Transform drop-down, choose Remove Symbols.
- On the Input column drop-down, choose the
INT_RATE
column. - For Symbols, enter
%
. - Optionally, in the Output field, enter the name of a column that this data is written to.
For this post, we keep the original column and set the output column to INT_RATE_PERCENTAGE
to denote to future users of this data that this column is the interest rate as a percentage. Later, we convert this to a float.
- Choose Preview.
When Data Wrangler adds a new column, it’s automatically added as the rightmost column.
- Review the change to ensure accuracy.
- Choose Add.
Parse column as type
Continuing with the preceding example, we’ve identified that INT_RATE_PERCENTAGE
should be converted to a float type.
- Next to Steps, choose the plus sign.
- Choose Add transform.
- Choose Parse Column as Type.
- On the Column drop-down, choose INT_RATE_PERCENTAGE.
The From field is automatically populated.
- On the to drop-down, choose Float.
- Choose Preview.
- Choose Add.
- Choose Back to data flow.
As you can see, we now have six steps in this portion of the flow, four that represent columns being dropped, one that represents string formatting, and one that represents parse column as type.
Encode categorical data
Next, we want to look for categorical data in our dataset. Data Wrangler has a built-in functionality to encode categorical data using both ordinal and one-hot encodings. Looking at our dataset, we can see that the TERM
, HOME_OWNERSHIP
, and PURPOSE
columns all appear to be categorical in nature.
- Next to Steps, choose the plus sign.
- Choose Add transform.
The first column in our list TERM
has two possible values: 60 months and 36 months. Perhaps our future model would benefit from having these values one-hot encoded and placed into new columns.
- Choose Encode Categorical.
- On the Transform drop-down, choose One-hot encode.
- For Input column, choose TERM.
- On the Output style drop-down, choose Columns.
- Leave all other fields and check boxes as is.
- Choose Preview.
We can now see two columns, TERM_36 months
and TERM_60 months
, are one-hot encoded to represent the corresponding value in the TERM
column.
- Choose Add.
The HOME_OWNERSHIP
column has four possible values: RENT
, MORTGAGE
, OWN
, and other
.
- Repeat the preceding steps to apply a one-hot encoding approach on these values.
Lastly, the PURPOSE
column has several possible values. For this data, we use a one-hot encoding approach as well, but we set the output to a vector, rather than columns.
- On the Transform drop-down, choose One-hot encode.
- For Input column, choose PURPOSE.
- On the Output style drop-down, choose Vector.
- For Output Column, we call this column
PURPOSE_VCTR
.
This keeps the original PURPOSE
column, if we decide to use it later.
- Leave all other fields and check boxes as is.
- Choose Preview.
- Choose Add.
- Choose Back to data flow.
We can now see nine different transformations in this flow, and we still haven’t written a single line of code.
Handle outliers
As our last step in this flow, we want to handle outliers in our dataset. As part of the data exploration process, we can create an analysis (which we cover in the next section). In the following example scatter plot, I explored if I could gain insights from looking at the relationship between annual income, interest rate, and employment length by observing the dataset on a scatter plot. On the graph, we have the loan receivers INT_RATE_PERCENTAGE
on the X axis, ANNUAL_INC
on the Y axis, and the data is color-coded by EMP_LENGTH
. The dataset has some outliers that might skew the result of our model later. To address this, we use Data Wrangler’s built-in transformation for handling outliers.
- Next to Steps, choose the plus sign.
- Choose Add transform.
- Choose Handle outliers.
- On the Transform drop-down, choose Standard deviation numeric outliers.
- For Input column, enter
ANNUAL_INC
. - For Output column, enter
ANNUAL_INC_NO_OUTLIERS
.
This is optional, but it’s good practice to notate that a column has been transformed for later consumers.
- On the Fix method drop-down, leave Clip
This option automatically clips values to the corresponding outlier detection bound, which we set next.
- For Standard deviations, leave the default of 4 to start.
This allows values within four standard deviations of the mean to be considered valid (and therefore not clipped). Values outside of this bound are clipped.
- Choose Preview.
- Choose Add.
The output includes an object type. We need to convert this to a float for it to be valid within our dataset and visualization.
- Follow the steps as when parsing a column as type, this time using the
ANNUAL_INC_NO_OUTLIERS
columns. - Choose Back to data flow to see the current state of the flow.
Add analyses to the data
In this section, we walk through adding analyses to dataset. We focus on visualizations, but there are several other options, including detecting target leakage, generating a bias report, or adding your own custom visualizations using the Altair library.
Scatter plot
To create a scatter plot, complete the following steps:
- On the data flow page, next to Steps, choose the plus sign.
- Choose Add analysis.
- For Analysis type¸ choose Scatter plot.
- Using the preceding example, we name this analysis
EmpLengthAnnualIncIntRate
. - For X Axis, enter
INT_RATE_PERCENTAGE
. - For Y Axis, enter
ANNUAL_INC_NO_OUTLIERS
. - For Color by, enter
EMP_LENGTH
. - Choose Preview.
The following screenshot shows our scatter plot.
We can compare this to the old version, before the anomalies were removed.
So far this is looking good, but let’s add a facet to break out each category in the Grade
column into its own graph.
- For Facet by, choose GRADE.
- Choose Preview.
The following screenshot has been trimmed down for display purposes. The Y axis still represents ANNUAL_INC
. For faceted plots, this is displayed on the bottommost plot.
- Choose Save to save the analysis.
Export the data flow
Finally, we export this whole data flow as a pipeline, which creates a Jupyter notebook with the code pre-populated. With Data Wrangler, you can also export your data to a Jupyter notebook as a SageMaker processing job, SageMaker feature store, or export directly to Python code.
- On the Data Flow console, choose the Export
- Choose the steps to export. For our use case, we choose each box that represents a step.
- Choose Export step, then choose Pipeline.
The pre-populated Jupyter notebook loads and opens automatically, displaying all the generated steps and code for your data flow. The following screenshot shows the input section that defines the data source.
Clean up
If your work with Data Wrangler is complete, shut down your Data Wrangler instance to avoid incurring additional fees.
Conclusion
In this post, we covered setting up Snowflake as a data source for Data Wrangler, adding transformations and analyses to a dataset, then exporting to the data flow for further use in a Jupyter notebook. We further improved our data flow after visualizing our dataset using the Data Wrangler built-in analysis functionality. Most notably, we built a data preparation pipeline without having to write a single line of code.
To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler, and see the latest information on the Data Wrangler product page.
Data Wrangler makes it easy to ingest data and perform data preparation tasks such as exploratory data analysis, feature selection, feature engineering. We’ve only covered a few of the capabilities of Data Wrangler in this post on data preparation; you can use Data Wrangler for more advanced data analysis such as feature importance, target leakage, and model explainability using an easy and intuitive user interface.
About the Authors
Maxwell Moon is a Senior Solutions Architect at AWS working with Independent Software Vendors (ISVs) to design and scale their applications on AWS. Outside of work, Maxwell is a dad to two cats, is an avid supporter of the Wolverhampton Wanderers Football Club, and tries to spend as much time playing music as possible.
Bosco Albuquerque is a Sr Partner Solutions Architect at AWS and has over 20 years of experience in working with database and analytics products from enterprise database vendors, and cloud providers and has helped large technology companies in designing data analytics solutions as well as led engineering teams is designing and implementing data analytics platforms and data products.
Tags: Archive
Leave a Reply