Creating high-quality machine learning models for financial services using Amazon SageMaker Autopilot

Machine learning (ML) is used throughout the financial services industry to perform a wide variety of tasks, such as fraud detection, market surveillance, portfolio optimization, loan solvency prediction, direct marketing, and many others. This breadth of use cases has created a need for lines of business to quickly generate high-quality and performant models that can be produced with little to no code. This reduces the long cycles for taking use cases from concept to production and generates business value. In this post, we explore how to use Amazon SageMaker Autopilot for some common use cases in the financial services industry.

Autopilot automatically generates pipelines, trains and tunes the best ML models for classification or regression tasks on tabular data, while allowing you to maintain full control and visibility. Autopilot enables automatic creation of ML models without requiring any ML experience. Autopilot automatically analyzes the dataset, processes the data into features, and trains multiple optimized ML models.

Data scientists in financial services often work on tasks where the datasets are highly imbalanced (heavily skewed towards examples of one class). Examples of such tasks include credit card fraud (where a very small fraction of the transactions are actually fraudulent) or bankruptcy (only few corporations file for bankruptcy). We demonstrate how Autopilot automatically handles class imbalance without requiring any additional inputs from the user.

Autopilot recently announced the ability to tune models using the Area Under a Curve (AUC) metric in addition to F1 as the objective metric (which is the default objective for binary classification tasks), more specifically the area under the Receiver Operating Characteristic (ROC) curve. In this post, we show how using the AUC as the model evaluation metric for highly imbalanced data allows Autopilot to generate high-quality models.

Our first use case is to detect credit card fraud based on various anonymized attributes. The dataset is highly imbalanced, with over 99% of the transactions being non-fraudulent. Our second use case is to predict bankruptcy of Polish companies [2]. Here, bankruptcy is similarly a binary response variable (will bankrupt = 1, will not bankrupt = 0), with 96% of the companies not becoming bankrupt.


To reproduce these steps in your own environment, you must complete the following prerequisites:

Credit card fraud detection

In fraud detection tasks, companies are interested in maintaining a very low false positive rate while correctly identifying the fraudulent transactions to the greatest extent possible. A false positive can lead to a company canceling or placing a hold on a customers’ card over a legitimate transaction, which leads to a poor customer experience. As a result, accuracy is not the best metric to consider for this problem; better metrics are the AUC and the F1 score.

The following code shows data for a credit card fraud task:

import pandas as pd 
fraud_df = pd.read_csv('creditcard.csv') 

Note that you can click on the previous table to for an expanded view of the data.

Class 0 and class 1 correspond to No Fraud and Fraud accordingly. As we can see, other than Amount, other columns are anonymized. A key differentiator of Autopilot is its ability to process raw data directly, without the need for data processing on the part of data scientists. For example, Autopilot automatically converts categorical features into numerical values, handles missing values (as we show in the second example), and performs simple text preprocessing.

Using the AWS boto3 API or the AWS Command Line Interface (AWS CLI), we upload the data to Amazon S3 in CSV format:

import boto3
s3 = boto3.client('s3')
s3.upload_file(file_name, bucket, object_name=None)

fraud_df = pd.read_csv()

Now, we select all columns except Class as features and Class as target:

X = fraud_df[set(fraud_df.columns) - set(['Class'])]
y = fraud_df['Class']
print (y.value_counts())
0    284315
1       492

The binary label column Class is highly imbalanced, which is a typical occurrence in financial use cases. We can verify how well Autopilot handles this highly imbalanced data.

In the following code, we demonstrate how to configure Autopilot in Jupyter notebooks. We have to provide train and test files, and to set TargetAttributeName as Class, this is the target column (the column we predict):

auto_ml_job_name = 'automl-creditcard-fraud'
import boto3
sm = boto3.client('sagemaker')
import sagemaker  
session = sagemaker.Session()

prefix = 'sagemaker/' + auto_ml_job_name
bucket = session.default_bucket()
training_data = pd.DataFrame(X_train)
training_data['Class'] = list(y_train)
test_data = pd.DataFrame(X_test)

train_file = 'train_data.csv';
training_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print('Train data uploaded to: ' + train_data_s3_path)

test_file = 'test_data.csv';
test_data.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print('Test data uploaded to: ' + test_data_s3_path)
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
      'TargetAttributeName': 'Class'

Next, we create the Autopilot job. For this post, we set ProblemType='BinaryClassification' and job_objective='AUC'. If you don’t set these fields, Autopilot automatically determines the type of supervised learning problem by analyzing the data and uses the default metric for that problem type. The default metric for binary classification is F1. We explicitly set these parameters because we want to optimize AUC.

from sagemaker.automl.automl import AutoML
from time import gmtime, strftime, sleep
from sagemaker import get_execution_role

timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
base_job_name = 'automl-card-fraud' 

target_attribute_name = 'Class'
role = get_execution_role()
automl = AutoML(role=role,
                job_objective={'MetricName': 'AUC'},

For more information about the parameters for job configuration, see create-auto-ml-job.

After the Autopilot job is created, we call the fit() function to run it:, job_name=base_job_name, wait=False, logs=False)
describe_response = automl.describe_auto_ml_job()
print (describe_response)
job_run_status = describe_response['AutoMLJobStatus']
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = automl.describe_auto_ml_job()
    job_run_status = describe_response['AutoMLJobStatus']
    print (job_run_status)
print ('completed')

When the job is complete, we can select the best candidate based on the AUC objective metric:

best_candidate = automl.describe_auto_ml_job()['BestCandidate']
best_candidate_name = best_candidate['CandidateName']
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))
CandidateName: tuning-job-1-7e8f6c9dffe840a0bf-009-636d28c2
FinalAutoMLJobObjectiveMetricName: validation:auc
FinalAutoMLJobObjectiveMetricValue: 0.9890000224113464

We now create the Autopilot model object using the model artifacts from the Autopilot job in Amazon S3, and the inference container from the best candidate after running the tuning job. In addition to the predicted label, we’re interested in the probability of the prediction—we use this probability later to plot the AUC and precision and recall graphs.

model_name = 'automl-cardfraud-model-' + timestamp_suffix
inference_response_keys = ['predicted_label', 'probability']
model = automl.create_model(name=best_candidate_name,

After the model is created, we can generate inferences for the test set using the following code. During inference time, Autopilot orchestrates deployment of the inference pipeline, including feature engineering and the ML algorithm on the inference machine.

s3_transform_output_path = 's3://{}/{}/inference-results/'.format(bucket, prefix);
output_path = s3_transform_output_path + best_candidate['CandidateName'] +'/'
transformer.transform(data=test_data_s3_path, split_type='Line', content_type='text/csv', wait=False)

describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
job_run_status = describe_response['TransformJobStatus']
print (job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
    job_run_status = describe_response['TransformJobStatus']
    print (describe_response)
print ('transform job completed with status : ' + job_run_status)

Finally, we read the inference and predicted data into a dataframe:

import json
import io
from urllib.parse import urlparse

def get_csv_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:].strip('/')
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, '{}/{}'.format(prefix, file_name))
    return obj.get()["Body"].read().decode('utf-8')    
pred_csv = get_csv_from_s3(transformer.output_path, '{}.out'.format(test_file))
data_auc=pd.read_csv(io.StringIO(pred_csv), header=None)
data_auc.columns= ['label', 'proba']

Model metrics

Common metrics to compare classifiers are the ROC curve and the precision-recall curve. The ROC curve is a plot of the true positive rate against the false positive rate for various thresholds. The higher the prediction quality of the classification model, the more the ROC curve is skewed toward the top left.

The precision-recall curve demonstrates the trade-off between precision and recall, with the best models having a precision-recall curve that is flat initially and drops steeply as the recall approaches 1. The higher the precision and recall, the more the curve is skewed towards the upper right.

To optimize for the F1 score, we simply repeat the steps from earlier, setting the job_objective={'MetricName': 'F1'} and rerunning the Autopilot job. Because the steps are identical, we don’t repeat them in this section. Please note, F1 objective is default for binary classification problems. The following code plots the ROC curve:

import matplotlib.pyplot as plt
colors = ['blue','green']
model_names = ['Objective : AUC','Objective : F1']
models = [data_auc,data_f1]
from sklearn import metrics
for i in range(0,len(models)):
    fpr, tpr, _ = metrics.roc_curve(y_test, models[i]['proba'])
    fpr, tpr, _  = metrics.roc_curve(y_test, models[i]['proba'])
    auc_score = metrics.auc(fpr, tpr)
    plt.plot(fpr, tpr, label=str('Auto Pilot {:.2f} '+ model_names[i]).format(auc_score),color=colors[i]) 
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc='lower right')
plt.title('ROC Cuve')

The following plot shows the results.

In the preceding AUC ROC plot, Autopilot models provide high AUC when optimizing both objective metrics. We also didn’t select any specific model or tune any hyperparameters; Autopilot did all that heavy lifting for us.

Finally, we plot the precision-recall curves for the trained Autopilot model:

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt
from sklearn import metrics

colors = ['blue','green']
model_names = ['Objective : AUC','Objective : F1']
models = [data_auc,data_f1]

print ('model ', 'F1 ', 'precision ', 'recall ')
for i in range(0,len(models)):
precision, recall, _ = precision_recall_curve(y_test, models[i]['proba'])
print (model_names[i],f1_score(y_test, np.array(models[i]['label'])),precision_score(y_test, models[i]['label']),recall_score(y_test, models[i]['label']) )

plt.title('Precision-Recall Curve')
plt.legend(loc='upper right')

                    F1          precision      recall 
Objective : AUC 0.8164          0.872          0.7676
Objective : F1  0.7968          0.8947         0.7183

The following plot shows the results.

As we can see from the plot, Autopilot models provide good precision and recall, because the graph is heavily skewed toward the top-right corner.

Autopilot outputs

In addition to handling the heavy lifting of building and training the models, Autopilot provides visibility into the steps taken to build the models by generating two notebooks: CandidateDefinitionNotebook and DataExplorationNotebook.

You can use the candidate definition notebook to interactively step through the steps taken by Autopilot to arrive at the best candidate. You can also use this notebook to override various runtime parameters like parallelism, hardware used, algorithms explored, feature engineering scripts, hyperparameter tuning ranges, and more.

You can download the notebook from the following Amazon S3 location:


The notebook also outlines the various feature engineering steps taken to build the models. The models are indexed by their model type and the feature engineering pipeline. For example, as shown in the Tuning Job Result Overview, the winning model corresponds to the pipeline dpp1-xgboost:

best_candidate_name = best_candidate['CandidateName']
print(best_candidate). From there if we look at 
print (describe_response)

If we search for ModelDataUrl, we can find Autopilot used dpp1-xgboost 'ModelDataUrl': 's3://sagemaker-us-east-1-/automl-card-fraud-7/tuning/automl-car-dpp1-xgb/tuning-job-1-7e8f6c9dffe840a0bf-009-636d28c2/output/model.tar.gz'.

dpp1-xgboost is a data transformation strategy that transforms numeric features using RobustImputer. It merges all the generated features and applies RobustPCA followed by RobustStandardScaler. The transformed data is used to tune an XGBoost model.

From the candidate definition notebook, we can also see that Autopilot automatically applied up-weighting to the minority class using scale_pos_weight. This improves prediction quality for imbalanced datasets where the model doesn’t see many examples of the minority class during training. You can change the scale_pos_weight to a different value:

    'xgboost': {
        'objective': 'binary:logistic',
        'scale_pos_weight': 568.6114285714285,

The data exploration notebook generates a report that provides insights about the input dataset, such as the missing values or the data types for the different features:


Having described in detail the use of Autopilot to detect credit card fraud, we now briefly discuss a second task: predicting the bankruptcy of companies.

Predicting bankruptcy of Polish companies

For this post, we explore the various economic attributes in the Polish companies bankruptcy data dataset. There are 64 features and a target attribute class. We rename the column class to bankrupt (not bankrupt = 0, bankrupt = 1) for clarity. As noted before, this dataset is also highly imbalanced, with 96% of the data in the non-bankrupt category.

Note that you can click on the previous table to for an expanded view of the data.

We followed the same process for running and configuring Autopilot as in the credit card fraud use case. However, unlike the credit card fraud dataset, this dataset contains missing values. Because Autopilot automatically handles missing values, we simply pass the raw data to Autopilot.

We don’t repeat the code steps in this section; we merely show the ROC and precision-recall curves. Autopilot again yields high-quality models as evidenced from the AUC, ROC, and precision-recall curves. For bankruptcy prediction, incorrectly predicting false negatives can lead to poor investment decisions, and incorrectly predicting that solvent companies may go bankrupt might lead to missed opportunities.

To boost model performance, Autopilot also automatically up-weights the minority class label, penalizing the model for mis-classifying the minority class during training. The following plot shows the precision-recall curve.

The following plot shows the ROC curve.

As we can see from these plots, for bankruptcy, the AUC objective is slightly better than F1. Autopilot can generate accurate predictions for a complex event like bankruptcy without any specialized manual feature-engineering steps.

Cleaning up

The Autopilot job creates many underlying artifacts, such as dataset splits, preprocessing scripts, and preprocessed data. To avoid incurring costs, delete these resources using the following code:

#s3 = boto3.resource('s3')
#bucket = s3.Bucket(bucket)
#job_outputs_prefix = '{}/output/{}'.format(prefix,auto_ml_job_name)


In this post, we demonstrated how to create ML models without any prior knowledge of algorithms using Autopilot. For imbalanced data, which is common in financial services use cases, we showed that using objective metrics such as AUC and F1 along with the automatic minority class up-weighting can lead to high-quality models. Autopilot provides the flexibility of AutoML with the control and detail of a do-it-yourself approach by unveiling the underlying metadata and the code used to preprocess the data and train the models. Importantly, AutoPilot works on datasets of all sizes ranging from few MBs to hundreds of GBs without you having to set up the underlying infrastructure. Finally, note that Amazon SageMaker Studio provides a UI for you to build, train, and deploy models using Autopilot with little to no code. For more information about tuning, training, and deploying Autopilot models, see Create a machine learning model automatically with Amazon SageMaker Autopilot.


[1] Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

[2] Zieba, M., Tomczak, S. K., & Tomczak, J. M. (2016). Ensemble Boosted Trees with Synthetic Features Generation in Application to Bankruptcy Prediction. Expert Systems with Applications.

About the Authors

Sumon Samanta is a Senior Specialist Architect for Global Financial Services at AWS. Previously, he worked as a Quantitative Developer at several investment banks to develop pricing and risk systems.




Stefan Natu is a Sr. Machine Learning Specialist at Amazon Web Services. He is focused on helping financial services customers build end-to-end machine learning solutions on AWS. In his spare time, he enjoys reading machine learning blogs, playing the guitar, and exploring the food scene in New York City.



Ilya Epshteyn is a solutions architect with AWS. He helps customers to innovate on AWS by building highly available, scalable, and secure architectures. He enjoys spending time outdoors and building Lego creations with his kids.




Miroslav Miladinovic is a Software Development Manager at Amazon SageMaker.





Jean Baptiste Faddoul is an Applied Science Manager working on SageMaker Autopilot and Automatic Model Tuning




Yotam Elor is a Senior Applied Scientist at AWS Sagemaker. He works on Sagemaker Autopilot – AWS’s auto ML solution.

View Original Source ( Here.

Leave a Reply

Your email address will not be published. Required fields are marked *

Shared by: AWS Machine Learning