Prepare training and validation dataset for facies classification using Snowflake integration and train using Amazon SageMaker Canvas
This post is co-written with Thatcher Thornberry from bpx energy.
Facies classification is the process of segmenting lithologic formations from geologic data at the wellbore location. During drilling, wireline logs are obtained, which have depth-dependent geologic information. Geologists are deployed to analyze this log data and determine depth ranges for potential facies of interest from the different types of log data. Accurately classifying these regions is critical for the drilling processes that follow.
Facies classification using AI and machine learning (ML) has become an increasingly popular area of investigation for many oil majors. Many data scientists and business analysts at large oil companies don’t have the necessary skillset to run advanced ML experiments on important tasks such as facies classification. To address this, we show you how to easily prepare and train a best-in-class ML classification model on this problem.
In this post, aimed primarily at those who are already using Snowflake, we explain how you can import both training and validation data for a facies classification task from Snowflake into Amazon SageMaker Canvas and subsequently train the model using a 3+ category prediction model.
Solution overview
Our solution consists of the following steps:
- Upload facies CSV data from your local machine to Snowflake. For this post, we use data from the following open-source GitHub repo.
- Configure AWS Identity and Access Management (IAM) roles for Snowflake and create a Snowflake integration.
- Create a secret for Snowflake credentials (optional, but advised).
- Import Snowflake directly into Canvas.
- Build a facies classification model.
- Analyze the model.
- Run batch and single predictions using the multi-class model.
- Share the trained model to Amazon SageMaker Studio.
Prerequisites
Prerequisites for this post include the following:
- An AWS account.
- Canvas set up, with an Amazon SageMaker user profile associated with it.
- A Snowflake account. For steps to create a Snowflake account, refer to How to: Create a Snowflake Free Trial Account
- The Snowflake CLI. For steps to connect to Snowflake by CLI, refer to SnowSQL, the command line interface for connecting to Snowflake. For steps to connect to to Snowflake by CLI, refer to Snowflake SnowSQL: Command Line Tool to access Snowflake.
- An existing database within Snowflake.
Upload facies CSV data to Snowflake
In this section, we take two open-source datasets and upload them directly from our local machine to a Snowflake database. From there, we set up an integration layer between Snowflake and Canvas.
- Download the training_data.csv and validation_data_nofacies.csv files to your local machine. Make note of where you saved them.
- Ensuring that you have the correct Snowflake credentials and have installed the Snowflake CLI desktop app, you can federate in. For more information, refer to Log into SnowSQL.
- Select the appropriate Snowflake warehouse to work within, which in our case is
COMPUTE_WH
:
- Choose a database to use for the remainder of the walkthrough:
- Create a named file format that will describe a set of staged data to access or load into Snowflake tables.
This can be run either in the Snowflake CLI or in a Snowflake worksheet on the web application. For this post, we run a SnowSQL query in the web application. See Getting Started With Worksheets for instructions to create a worksheet on the Snowflake web application.
- Create a table in Snowflake using the CREATE statement.
The following statement creates a new table in the current or specified schema (or replaces an existing table).
It’s important that the data types and the order in which they appear are correct, and align with what is found in the CSV files that we previously downloaded. If they’re inconsistent, we’ll run into issues later when we try to copy the data across.
- Do the same for the validation database.
Note that the schema is a little different to the training data. Again, ensure that the data types and column or feature orders are correct.
- Load the CSV data file from your local system into the Snowflake staging environment:
- The following is the syntax of the statement for Windows OS:
- The following is the syntax of the statement for Mac OS:
The following screenshot shows an example command and output from within the SnowSQL CLI.
- Copy the data into the target Snowflake table.
Here, we load the training CSV data to the target table, which we created earlier. Note that you have to do this for both the training and validation CSV files, copying them into the training and validation tables, respectively.
- Verify that the data has been loaded into the target table by running a SELECT query (you can do this for both the training and validation data):
Configure Snowflake IAM roles and create the Snowflake integration
As a prerequisite for this section, please follow the official Snowflake documentation on how to configure a Snowflake Storage Integration to Access Amazon S3.
Retrieve the IAM user for your Snowflake account
Once you have successfully configured your Snowflake storage integration, run the following DESCRIBE INTEGRATION
command to retrieve the ARN for the IAM user that was created automatically for your Snowflake account:
Record the following values from the output:
- STORAGE_AWS_IAM_USER_ARN – The IAM user created for your Snowflake account
- STORAGE_AWS_EXTERNAL_ID – The external ID needed to establish a trust relationship
Update the IAM role trust policy
Now we update the trust policy:
- On the IAM console, choose Roles in the navigation pane.
- Choose the role you created.
- On the Trust relationship tab, choose Edit trust relationship.
- Modify the policy document as shown in the following code with the DESC STORAGE INTEGRATION output values you recorded in the previous step.
- Choose Update trust policy.
Create an external stage in Snowflake
We use an external stage within Snowflake for loading data from an S3 bucket in your own account into Snowflake. In this step, we create an external (Amazon S3) stage that references the storage integration you created. For more information, see Creating an S3 Stage.
This requires a role that has the CREATE_STAGE
privilege for the schema as well as the USAGE privilege on the storage integration. You can grant these privileges to the role as shown in the code in the next step.
Create the stage using the CREATE_STAGE
command with placeholders for the external stage and S3 bucket and prefix. The stage also references a named file format object called my_csv_format
:
Create a secret for Snowflake credentials
Canvas allows you to use the ARN of an AWS Secrets Manager secret or a Snowflake account name, user name, and password to access Snowflake. If you intend to use the Snowflake account name, user name, and password option, skip to the next section, which covers adding the data source.
To create a Secrets Manager secret manually, complete the following steps:
- On the Secrets Manager console, choose Store a new secret.
- For Select secret type¸ select Other types of secrets.
- Specify the details of your secret as key-value pairs.
The names of the key are case-sensitive and must be lowercase.
If you prefer, you can use the plaintext option and enter the secret values as JSON:
- Choose Next.
- For Secret name, add the prefix
AmazonSageMaker
(for example, our secret isAmazonSageMaker-CanvasSnowflakeCreds
). - In the Tags section, add a tag with the key SageMaker and value true.
- Choose Next.
- The rest of the fields are optional; choose Next until you have the option to choose Store to store the secret.
- After you store the secret, you’re returned to the Secrets Manager console.
- Choose the secret you just created, then retrieve the secret ARN.
- Store this in your preferred text editor for use later when you create the Canvas data source.
Import Snowflake directly into Canvas
To import your facies dataset directly into Canvas, complete the following steps:
- On the SageMaker console, choose Amazon SageMaker Canvas in the navigation pane.
- Choose your user profile and choose Open Canvas.
- On the Canvas landing page, choose Datasets in the navigation pane.
- Choose Import.
- Click on Snowflake in the below image and then immediately “Add Connection”.
- Enter the ARN of the Snowflake secret that we previously created, the storage integration name (
SAGEMAKER_CANVAS_INTEGRATION
), and a unique connection name of your choosing. - Choose Add connection.
If all the entries are valid, you should see all the databases associated with the connection in the navigation pane (see the following example for NICK_FACIES
).
- Choose the
TRAINING_DATA
table, then choose Preview dataset.
If you’re happy with the data, you can edit the custom SQL in the data visualizer.
- Choose Edit in SQL.
- Run the following SQL command before importing into Canvas. (This assumes that the database is called
NICK_FACIES
. Replace this value with your database name.)
Something similar to the following screenshot should appear in the Import preview section.
- If you’re happy with the preview, choose Import data.
- Choose an appropriate data name, ensuring that it’s unique and fewer than 32 characters long.
- Use the following command to import the validation dataset, using the same method as earlier:
Build a facies classification model
To build your facies classification model, complete the following steps:
- Choose Models in the navigation pane, then choose New Model.
- Give your model a suitable name.
- On the Select tab, choose the recently imported training dataset, then choose Select dataset.
- On the Build tab, drop the
WELL_NAME
column.
We do this because the well names themselves aren’t useful information for the ML model. They are merely arbitrary names that we find useful to distinguish between the wells themselves. The name we give a particular well is irrelevant to the ML model.
- Choose FACIES as the target column.
- Leave Model type as 3+ category prediction.
- Validate the data.
- Choose Standard build.
Your page should look similar to the following screenshot just before building your model.
After you choose Standard build, the model enters the analyze stage. You’re provided an expected build time. You can now close this window, log out of Canvas (in order to avoid charges), and return to Canvas at a later time.
Analyze the facies classification model
To analyze the model, complete the following steps:
- Federate back into Canvas.
- Locate your previously created model, choose View, then choose Analyze.
- On the Overview tab, you can see the impact that individual features are having on the model output.
- In the right pane, you can visualize the impact that a given feature (X axis) is having on the prediction of each facies class (Y axis).
These visualizations will change accordingly depending on the feature you select. We encourage you to explore this page by cycling through all 9 classes and 10 features.
- On the Scoring tab, we can see the predicted vs. actual facies classification.
- Choose Advanced metrics to view F1 scores, average accuracy, precision, recall, and AUC.
- Again, we encourage viewing all the different classes.
- Choose Download to download an image to your local machine.
In the following image, we can see a number of different advanced metrics, such as the F1 score. In statistical analysis, the F1 score conveys the balance between the precision and the recall of a classification model, and is computed using the following equation: 2*((Precision * Recall)/ (Precision + Recall))
.
Run batch and single prediction using the multi-class facies classification model
To run a prediction, complete the following steps:
- Choose Single prediction to modify the feature values as needed, and get a facies classification returned on the right of the page.
You can then copy the prediction chart image to your clipboard, and also download the predictions into a CSV file.
- Choose Batch prediction and then choose Select dataset to choose the validation dataset you previously imported.
- Choose Generate predictions.
You’re redirected to the Predict page, where the Status will read Generating predictions for a few seconds.
After the predictions are returned, you can preview, download, or delete the predictions by choosing the options menu (three vertical dots) next to the predictions.
The following is an example of a predictions preview.
Share a trained model in Studio
You can now share the latest version of the model with another Studio user. This allows data scientists to review the model in detail, test it, make any changes that may improve accuracy, and share the updated model back with you.
The ability to share your work with a more technical user within Studio is a key feature of Canvas, given the key distinction between ML personas’ workflows. Note the strong focus here on collaboration between cross-functional teams with differing technical abilities.
- Choose Share to share the model.
- Choose which model version to share.
- Enter the Studio user to share the model with.
- Add an optional note.
- Choose Share.
Conclusion
In this post, we showed how with just a few clicks in Amazon SageMaker Canvas you can prepare and import your data from Snowflake, join your datasets, analyze estimated accuracy, verify which columns are impactful, train the best performing model, and generate new individual or batch predictions. We’re excited to hear your feedback and help you solve even more business problems with ML. To build your own models, see Getting started with using Amazon SageMaker Canvas.
About the Authors
Nick McCarthy is a Machine Learning Engineer in the AWS Professional Services team. He has worked with AWS clients across various industries including healthcare, finance, sports, telecoms and energy to accelerate their business outcomes through the use of AI/ML. Working with the bpx data science team, Nick recently finished building bpx’s Machine Learning platform on Amazon SageMaker.
Thatcher Thornberry is a Machine Learning Engineer at bpx Energy. He supports bpx’s data scientists by developing and maintaining the company’s core Data Science platform in Amazon SageMaker. In his free time he loves to hack on personal coding projects and spend time outdoors with his wife.
Leave a Reply