Transfer learning for TensorFlow text classification models in Amazon SageMaker

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

This post is the third in a series on the new built-in algorithms in SageMaker. In the first post, we showed how SageMaker provides a built-in algorithm for image classification. In the second post, we showed how SageMaker provides a built-in algorithm for object detection. Today, we announce that SageMaker provides a new built-in algorithm for text classification using TensorFlow. This supervised learning algorithm supports transfer learning for many pre-trained models available in TensorFlow hub. It takes a piece of text as input and outputs the probability for each of the class labels. You can fine-tune these pre-trained models using transfer learning even when a large corpus of text isn’t available. It’s available through the SageMaker built-in algorithms, as well as through the SageMaker JumpStart UI in Amazon SageMaker Studio. For more information, refer to Text Classification and the example notebook Introduction to JumpStart – Text Classification.

Text Classification with TensorFlow in SageMaker provides transfer learning on many pre-trained models available in the TensorFlow Hub. According to the number of class labels in the training data, a classification layer is attached to the pre-trained TensorFlow hub model. The classification layer consists of a dropout layer and a dense layer, fully connected layer, with 2-norm regularizer, which is initialized with random weights. The model training has hyper-parameters for the dropout rate of dropout layer, and L2 regularization factor for the dense layer. Then, either the whole network, including the pre-trained model, or only the top classification layer can be fine-tuned on the new training data. In this transfer learning mode, training can be achieved even with a smaller dataset.

How to use the new TensorFlow text classification algorithm

This section describes how to use the TensorFlow text classification algorithm with the SageMaker Python SDK. For information on how to use it from the Studio UI, see SageMaker JumpStart.

The algorithm supports transfer learning for the pre-trained models listed in Tensorflow models. Each model is identified by a unique model_id. The following code shows how to fine-tune BERT base model identified by model_id tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2 on a custom training dataset. For each model_id, to launch a SageMaker training job through the Estimator class of the SageMaker Python SDK, you must fetch the Docker image URI, training script URI, and pre-trained model URI through the utility functions provided in SageMaker. The training script URI contains all of the necessary code for data processing, loading the pre-trained model, model training, and saving the trained model for inference. The pre-trained model URI contains the pre-trained model architecture definition and the model parameters. The pre-trained model URI is specific to the particular model. The pre-trained model tarballs have been pre-downloaded from TensorFlow and saved with the appropriate model signature in Amazon Simple Storage Service (Amazon S3) buckets, so that the training job runs in network isolation. See the following code:

from sagemaker import image_uris, model_uris, script_urisfrom sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2", "*"
training_instance_type = "ml.p3.2xlarge"
# Retrieve the docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")# Retrieve the pre-trained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tensorflow-tc-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

With these model-specific training artifacts, you can construct an object of the Estimator class:

# Create SageMaker Estimator instance
tf_tc_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,)

Next, for transfer learning on your custom dataset, you might need to change the default values of the training hyperparameters, which are listed in Hyperparameters. You can fetch a Python dictionary of these hyperparameters with their default values by calling hyperparameters.retrieve_default, update them as needed, and then pass them to the Estimator class. Note that the default values of some of the hyperparameters are different for different models. For large models, the default batch size is smaller and the train_only_top_layer hyperparameter is set to True. The hyperparameter Train_only_top_layer defines which model parameters change during the fine-tuning process. If train_only_top_layer is True, then parameters of the classification layers change and the rest of the parameters remain constant during the fine-tuning process. On the other hand, if train_only_top_layer is False, then all of the parameters of the model are fine-tuned. See the following code:

from sagemaker import hyperparameters# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

We provide the SST2 as a default dataset for fine-tuning the models. The dataset contains positive and negative movie reviews. It has been downloaded from TensorFlow under Apache 2.0 License. The following code provides the default training dataset hosted in S3 buckets.

# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/SST2/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

Finally, to launch the SageMaker training job for fine-tuning the model, call .fit on the object of the Estimator class, while passing the Amazon S3 location of the training dataset:

# Launch a SageMaker Training job by passing s3 path of the training data
tf_od_estimator.fit({"training": training_dataset_s3_path}, logs=True)

For more information about how to use the new SageMaker TensorFlow text classification algorithm for transfer learning on a custom dataset, deploy the fine-tuned model, run inference on the deployed model, and deploy the pre-trained model as is without first fine-tuning on a custom dataset, see the following example notebook: Introduction to JumpStart – Text Classification.

Input/output interface for the TensorFlow text classification algorithm

You can fine-tune each of the pre-trained models listed in TensorFlow Models to any given dataset made up of text sentences with any number of classes. The pre-trained model attaches a classification layer to the Text Embedding model and initializes the layer parameters to random values. The output dimension of the classification layer is determined based on the number of classes detected in the input data. The objective is to minimize classification errors on the input data. The model returned by fine-tuning can be further deployed for inference.

The following instructions describe how the training data should be formatted for input to the model:

Input – A directory containing a data.csv file. Each row of the first column should have integer class labels between 0 and the number of classes. Each row of the second column should have the corresponding text data.
Output – A fine-tuned model that can be deployed for inference or further trained using incremental training. A file mapping class indexes to class labels is saved along with the models.

The following is an example of an input CSV file. Note that the file should not have any header. The file should be hosted in an S3 bucket with a path similar to the following: s3://bucket_name/input_directory/. Note that the trailing / is required.

|0 |hide new secretions from the parental units|
|0 |contains no wit , only labored gags|
|1 |that loves its characters and communicates something rather beautiful about human nature|
|...|...|

Inference with the TensorFlow text classification algorithm

The generated models can be hosted for inference and support text as the application/x-text content type. The output contains the probability values, class labels for all of the classes, and the predicted label corresponding to the class index with the highest probability encoded in the JSON format. The model processes a single string per request and outputs only one line. The following is an example of a JSON format response:

accept: application/json;verbose
{"probabilities": [prob_0, prob_1, prob_2, ...],
 "labels": [label_0, label_1, label_2, ...],
 "predicted_label": predicted_label}

If accept is set to application/json, then the model only outputs probabilities. For more details on training and inference, see the sample notebook Introduction to Introduction to JumpStart – Text Classification.

Use SageMaker built-in algorithms through the JumpStart UI

You can also use SageMaker TensorFlow text classification and any of the other built-in algorithms with a few clicks via the JumpStart UI. JumpStart is a SageMaker feature that lets you train and deploy built-in algorithms and pre-trained models from various ML frameworks and model hubs through a graphical interface. Furthermore, it lets you deploy fully-fledged ML solutions that string together ML models and various other AWS services to solve a targeted use case.

Following are two videos that show how you can replicate the same fine-tuning and deployment process we just went through with a few clicks via the JumpStart UI.

Fine-tune the pre-trained model

Here is the process to fine-tune the same pre-trained text classification model.

Deploy the finetuned model

After model training is finished, you can directly deploy the model to a persistent, real-time endpoint with one click.

Conclusion

In this post, we announced the launch of the SageMaker TensorFlow text classification built-in algorithm. We provided example code for how to do transfer learning on a custom dataset using a pre-trained model from TensorFlow hub using this algorithm.

For more information, check out the documentation and the example notebook Introduction to JumpStart – Text Classification.

About the authors

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS and SODA conferences.

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use-cases and helping customers optimize deep learning model training and deployment. He is also an active proponent of low-code ML solutions and ML-specialized hardware.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana Champaign. He is an active researcher in machine learning and statistical inference and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.