Text classification for online conversations with machine learning on AWS

Online conversations are ubiquitous in modern life, spanning industries from video games to telecommunications. This has led to an exponential growth in the amount of online conversation data, which has helped in the development of state-of-the-art natural language processing (NLP) systems like chatbots and natural language generation (NLG) models. Over time, various NLP techniques for text analysis have also evolved. This necessitates the requirement for a fully managed service that can be integrated into applications using API calls without the need for extensive machine learning (ML) expertise. AWS offers pre-trained AWS AI services like Amazon Comprehend, which can effectively handle NLP use cases involving classification, text summarization, entity recognition, and more to gather insights from text.

Additionally, online conversations have led to a wide-spread phenomenon of non-traditional usage of language. Traditional NLP techniques often perform poorly on this text data due to the constantly evolving and domain-specific vocabularies that exist within different platforms, as well as the significant lexical deviations of words from proper English, either by accident or intentionally as a form of adversarial attack.

In this post, we describe multiple ML approaches for text classification of online conversations with tools and services available on AWS.

Prerequisites

Before diving deep into this use case, please complete the following prerequisites:

Set up an AWS account and create an IAM user.
Set up the AWS CLI and AWS SDKs.
(Optional) Set up your Cloud9 IDE environment.

Dataset

For this post, we use the Jigsaw Unintended Bias in Toxicity Classification dataset, a benchmark for the specific problem of classification of toxicity in online conversations. The dataset provides toxicity labels as well as several subgroup attributes such as obscene, identity attack, insult, threat, and sexually explicit. Labels are provided as fractional values, which represent the proportion of human annotators who believed the attribute applied to a given piece of text, which are rarely unanimous. To generate binary labels (for example, toxic or non-toxic), a threshold of 0.5 is applied to the fractional values, and comments with values greater than the threshold are treated as the positive class for that label.

Subword embedding and RNNs

For our first modeling approach, we use a combination of subword embedding and recurrent neural networks (RNNs) to train text classification models. Subword embeddings were introduced by Bojanowski et al. in 2017 as an improvement upon previous word-level embedding methods. Traditional Word2Vec skip-gram models are trained to learn a static vector representation of a target word that optimally predicts that word’s context. Subword models, on the other hand, represent each target word as a bag of the character n-grams that make up the word, where an n-gram is composed of a set of n consecutive characters. This method allows for the embedding model to better represent the underlying morphology of related words in the corpus as well as the computation of embeddings for novel, out-of-vocabulary (OOV) words. This is particularly important in the context of online conversations, a problem space in which users often misspell words (sometimes intentionally to evade detection) and also use a unique, constantly evolving vocabulary that might not be captured by a general training corpus.

Amazon SageMaker makes it easy to train and optimize an unsupervised subword embedding model on your own corpus of domain-specific text data with the built-in BlazingText algorithm. We can also download existing general-purpose models trained on large datasets of online text, such as the following English language models available directly from fastText. From your SageMaker notebook instance, simply run the following to download a pretrained fastText model:

!wget -O vectors.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip

Whether you’ve trained your own embeddings with BlazingText or downloaded a pretrained model, the result is a zipped model binary that you can use with the gensim library to embed a given target word as a vector based on its constituent subwords:

# Imports
import os
from zipfile import ZipFile
from gensim.models.fasttext import load_facebook_vectors

# Unzip the model binary into 'dir_path'
with ZipFile('vectors.zip', 'r') as zipObj:
    zipObj.extractall(path=)

# Load embedding model into memory
embed_model = load_facebook_vectors(os.path.join(, 'vectors.bin'))

# Compute embedding vector for 'word'
word_embedding = embed_model[word]

After we preprocess a given segment of text, we can use this approach to generate a vector representation for each of the constituent words (as separated by spaces). We then use SageMaker and a deep learning framework such as PyTorch to train a customized RNN with a binary or multilabel classification objective to predict whether the text is toxic or not and the specific sub-type of toxicity based on labeled training examples.

To upload your preprocessed text to Amazon Simple Storage Service (Amazon S3), use the following code:

import boto3
s3 = boto3.client('s3')

bucket = 
prefix = 

s3.upload_file('train.pkl', bucket, os.path.join(prefix, 'train/train.pkl'))
s3.upload_file('valid.pkl', bucket, os.path.join(prefix, 'valid/valid.pkl'))
s3.upload_file('test.pkl', bucket, os.path.join(prefix, 'test/test.pkl'))

To initiate scalable, multi-GPU model training with SageMaker, enter the following code:

import sagemaker
sess = sagemaker.Session()
role = iam.get_role(RoleName= ‘AmazonSageMakerFullAccess’)['Role']['Arn']

from sagemaker.pytorch import PyTorch

# hyperparameters, which are passed into the training job
hyperparameters = {
    'epochs': 20, # Maximum number of epochs to train model
    'train-batch-size': 128, # Training batch size (No. sentences)
    'eval-batch-size': 1024, # Evaluation batch size (No. sentences)
    'embed-size': 300, # Vector dimension of word embeddings (Must match embedding model)
    'lstm-hidden-size': 200, # Number of neurons in LSTM hidden layer
    'lstm-num-layers': 2, # Number of stacked LSTM layers
    'proj-size': 100, # Number of neurons in intermediate projection layer
    'num-targets': len(), # Number of targets for classification
    'class-weight': ' '.join([str(c) for c in ]), # Weight to apply to each target during training
    'total-length':,
    'metric-for-best-model': 'ap_score_weighted', # Metric on which to select the best model
}

# create the Estimator
pytorch_estimator = PyTorch(
    entry_point='train.py',
    source_dir=,
    instance_type=,
    volume_size=200,
    instance_count=1,
    role=role,
    framework_version='1.6.0’,
    py_version='py36',
    hyperparameters=hyperparameters,
    metric_definitions=[
        {'Name': 'validation:accuracy', 'Regex': 'eval_accuracy = (.*?);'},
        {'Name': 'validation:f1-micro', 'Regex': 'eval_f1_score_micro = (.*?);'},
        {'Name': 'validation:f1-macro', 'Regex': 'eval_f1_score_macro = (.*?);'},
        {'Name': 'validation:f1-weighted', 'Regex': 'eval_f1_score_weighted = (.*?);'},
        {'Name': 'validation:ap-micro', 'Regex': 'eval_ap_score_micro = (.*?);'},
        {'Name': 'validation:ap-macro', 'Regex': 'eval_ap_score_macro = (.*?);'},
        {'Name': 'validation:ap-weighted', 'Regex': 'eval_ap_score_weighted = (.*?);'},
        {'Name': 'validation:auc-micro', 'Regex': 'eval_auc_score_micro = (.*?);'},
        {'Name': 'validation:auc-macro', 'Regex': 'eval_auc_score_macro = (.*?);'},
        {'Name': 'validation:auc-weighted', 'Regex': 'eval_auc_score_weighted = (.*?);'}
    ]
)

pytorch_estimator.fit(
    {
        'train': 's3:////train',
        'valid': 's3:////valid',
        'test': 's3:////test'
    }
)

Within , we define a PyTorch Dataset that is used by train.py to prepare the text data for training and evaluation of the model:

def pad_matrix(m: torch.Tensor, max_len: int =100)-> tuple[int, torch.Tensor] :
    """Pads an embedding matrix to a specified maximum length."""
    if m.ndim == 1:
        m = m.reshape(1, -1)
    mask = np.ones_like(m)
    if m.shape[0] > max_len:
        m = m[:max_len, :]
        mask = mask[:max_len, :]
    else:
        m = np.pad(m, ((0, max_len - m.shape[0]), (0,0)))
        mask = np.pad(mask, ((0, max_len - mask.shape[0]), (0,0)))
    return m, mask


class EmbeddingDataset(Dataset: torch.utils.data.Dataset):
    """PyTorch dataset representing pretrained sentence embeddings, masks, and labels."""
    def __init__(self, text: str, labels: int, max_len: int=100):
        self.text = text
        self.labels = labels
        self.max_len = max_len

    def __len__(self) -> int:
        return len(self.labels)

    def __getitem__(self, idx: int) -> dict:   
        e = embed_line(self.text[idx])
        length = e.shape[0]
        m, mask = pad_matrix(e, max_len=self.max_len)
        
        item = {}
        item['embeddings'] = torch.from_numpy(m)
        item['mask'] = torch.from_numpy(mask)
        item['labels'] = torch.tensor(self.labels[idx])
        if length > self.max_len:
            item['lengths'] = torch.tensor(self.max_len)
        else:
            item['lengths'] = torch.tensor(length)
        
        return item

Note that this code anticipates that the vectors.zip file containing your fastText or BlazingText embeddings will be stored in .

Additionally, you can easily deploy pretrained fastText models on their own to live SageMaker endpoints to compute embedding vectors on the fly for use in relevant word-level tasks. See the following GitHub example for more details.

Transformers with Hugging Face

For our second modeling approach, we transition to the usage of Transformers, introduced in the paper Attention Is All You Need. Transformers are deep learning models designed to deliberately avoid the pitfalls of RNNs by relying on a self-attention mechanism to draw global dependencies between input and output. The Transformer model architecture allows for significantly better parallelization and can achieve high performance in relatively short training time.

Built on the success of Transformers, BERT, introduced in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, added bidirectional pre-training for language representation. Inspired by the Cloze task, BERT is pre-trained with masked language modeling (MLM), in which the model learns to recover the original words for randomly masked tokens. The BERT model is also pretrained on the next sentence prediction (NSP) task to predict if two sentences are in correct reading order. Since its advent in 2018, BERT and its variations have been widely used in text classification tasks.

Our solution uses a variant of BERT known as RoBERTa, which was introduced in the paper RoBERTa: A Robustly Optimized BERT Pretraining Approach. RoBERTa further improves BERT performance on a variety of natural language tasks by optimized model training, including training models longer on a 10 times larger bigger corpus, using optimized hyperparameters, dynamic random masking, removing the NSP task, and more.

Our RoBERTa-based models use the Hugging Face Transformers library, which is a popular open-source Python framework that provides high-quality implementations of all kinds of state-of-the-art Transformer models for a variety of NLP tasks. Hugging Face has partnered with AWS to enable you to easily train and deploy Transformer models on SageMaker. This functionality is available through Hugging Face AWS Deep Learning Container images, which include the Transformers, Tokenizers, and Datasets libraries, and optimized integration with SageMaker for model training and inference.

In our implementation, we inherit the RoBERTa architecture backbone from the Hugging Face Transformers framework and use SageMaker to train and deploy our own text classification model, which we call RoBERTox. RoBERTox uses byte pair encoding (BPE), introduced in Neural Machine Translation of Rare Words with Subword Units, to tokenize input text into subword representations. We can then train our models and tokenizers on the Jigsaw data or any large domain-specific corpus (such as the chat logs from a specific game) and use them for customized text classification. We define our custom classification model class in the following code:

class RoBERToxForSequenceClassification(CustomLossMixIn, RobertaPreTrainedModel):
    _keys_to_ignore_on_load_missing = [r"position_ids"]

    def __init__(self, config: PretrainedConfig, *inputs, **kwargs):
        """Initialize the RoBERToxForSequenceClassification instance

        Parameters
        ----------
        config : PretrainedConfig
        num_labels : Optional[int]
            if not None, overwrite the default classification head in pretrained model.
        mode : Optional[str]
            'MULTI_CLASS', 'MULTI_LABEL' or "REGRESSION". Used to determine loss
        class_weight : Optional[List[float]]
            If not None, add class weight to BCEWithLogitsLoss or CrossEntropyLoss
        """
        super().__init__(config, *inputs, **kwargs)
        # Define model architecture
        self.roberta = RobertaModel(self.config, add_pooling_layer=False)
        self.classifier = RobertaClassificationHead(self.config)
        self.init_weights()

    @modeling_roberta.add_start_docstrings_to_model_forward(
        modeling_roberta.ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length")
    )
    @modeling_roberta.add_code_sample_docstrings(
        tokenizer_class=modeling_roberta._TOKENIZER_FOR_DOC,
        checkpoint=modeling_roberta._CHECKPOINT_FOR_DOC,
        output_type=SequenceClassifierOutput,
        config_class=modeling_roberta._CONFIG_FOR_DOC,
    )
    def forward(
            self,
            input_ids: torch.Tensor = None,
            attention_mask: torch.Tensor = None,
            token_type_ids: torch.Tensor = None,
            position_ids: torch.Tensor =None,
            head_mask: torch.Tensor =None,
            inputs_embeds: torch.Tensor =None,
            labels: torch.Tensor =None,
            output_attentions: torch.Tensor =None,
            output_hidden_states: torch.Tensor =None,
            return_dict: bool =None,
            sample_weights: torch.Tensor =None,
    ) -> : dict:
        """Forward pass to return loss, logits, ...

        Returns
        --------
        output : SequenceClassifierOutput
            has those keys: loss, logits, hidden states, attentions
        """
        return_dict = return_dict or self.config.use_return_dict

        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = outputs[0]  # [CLS] embedding
        logits = self.classifier(sequence_output)
        loss = self.compute_loss(logits, labels, sample_weights=sample_weights)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def compute_loss(self, logits: torch.Tensor, labels: torch.Tensor, sample_weights: Optional[torch.Tensor] = None) -> torch.FloatTensor:
        return super().compute_loss(logits, labels, sample_weights)

Before training, we prepare our text data and labels using Hugging Face’s datasets library and upload the result to Amazon S3:

from datasets import Dataset
import multiprocessing

data_train = Dataset.from_pandas(df_train)
…

tokenizer = 

def preprocess_function(examples: examples) -> torch.Tensor:
    result = tokenizer(examples["text"], padding="max_length", max_length=128, truncation=True)
    return result

num_proc = multiprocessing.cpu_count()
print("Number of CPUs =", num_proc)

data_train = data_train.map(
    preprocess_function,
    batched=True,
    load_from_cache_file=False,
    num_proc=num_proc
)
…

import botocore
from datasets.filesystems import S3FileSystem

s3_session = botocore.session.Session()

# create S3FileSystem instance with s3_session
s3 = S3FileSystem(session=s3_session)  

# saves encoded_dataset to your s3 bucket
data_train.save_to_disk(f's3:////train', fs=s3)
…

We initiate training of the model in a similar fashion to the RNN:

import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters = {
    'model-name': ,
    'epochs': 10,
    'train-batch-size': 32,
    'eval-batch-size': 64,
    'num-labels': len(),
    'class-weight': ' '.join([str(c) for c in ]),
    'metric-for-best-model': 'ap_score_weighted',
    'save-total-limit': 1,
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir=,
    instance_type=,
    instance_count=1,
    role=role,
    transformers_version='4.6.1',
    pytorch_version='1.7.1',
    py_version='py36',
    hyperparameters=hyperparameters,
    metric_definitions=[
        {'Name': 'validation:accuracy', 'Regex': 'eval_accuracy = (.*?);'},
        {'Name': 'validation:f1-micro', 'Regex': 'eval_f1_score_micro = (.*?);'},
        {'Name': 'validation:f1-macro', 'Regex': 'eval_f1_score_macro = (.*?);'},
        {'Name': 'validation:f1-weighted', 'Regex': 'eval_f1_score_weighted = (.*?);'},
        {'Name': 'validation:ap-micro', 'Regex': 'eval_ap_score_micro = (.*?);'},
        {'Name': 'validation:ap-macro', 'Regex': 'eval_ap_score_macro = (.*?);'},
        {'Name': 'validation:ap-weighted', 'Regex': 'eval_ap_score_weighted = (.*?);'},
        {'Name': 'validation:auc-micro', 'Regex': 'eval_auc_score_micro = (.*?);'},
        {'Name': 'validation:auc-macro', 'Regex': 'eval_auc_score_macro = (.*?);'},
        {'Name': 'validation:auc-weighted', 'Regex': 'eval_auc_score_weighted = (.*?);'}
    ]
)

huggingface_estimator.fit(
    {
        'train': 's3:////train',
        'valid': 's3:////valid',
        'test': 's3:////test'
)

Finally, the following Python code snippet illustrates the process of serving RoBERTox via a live SageMaker endpoint for real-time text classification for a JSON request:

from sagemaker.huggingface import HuggingFaceModel
from sagemaker import get_execution_role
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

class Classifier(Predictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super().__init__(endpoint_name, sagemaker_session,
                         serializer=JSONSerializer(),
                         deserializer=JSONDeserializer())


hf_model = HuggingFaceModel(
    role=get_execution_role(),
    model_data=,
    entry_point="inference.py",
    transformers_version="4.6.1",
    pytorch_version="1.7.1",
    py_version="py36",
    predictor_cls=Classifier
)

predictor = hf_model.deploy(instance_type=, initial_instance_count=1)

Evaluation of model performance: Jigsaw unintended bias dataset

The following table contains performance metrics for models trained and evaluated on data from the Jigsaw Unintended Bias in Toxicity Detection Kaggle competition. We trained models for three different but interrelated tasks:

Binary case – The model was trained on the full training dataset to predict the toxicity label only
Fine-grained case – The subset of the training data for which toxicity>=0.5 was used to predict other toxicity sub-type labels (obscene, threat, insult, identity_attack, sexual_explicit)
Multitask case – The full training dataset was used to predict all six labels simultaneously

We trained RNN and RoBERTa models for each of these three tasks using the Jigsaw-provided fractional labels, which correspond to the proportion of annotators who thought the label was appropriate for the text, as well as with binary labels combined with class weights in the network loss function. In the binary labeling scheme, the proportions were thresholded at 0.5 for each available label (1 if label>=0.5, 0 otherwise), and the model loss functions were weighted based on the relative proportions of each binary label in the training dataset. In all cases, we found that using the fractional labels directly resulted in the best performance, indicating the added value of the information inherent in the degree of agreement between annotators.

We display two model metrics: the average precision (AP), which provides a summary of the precision-recall curve by computing the weighted mean of the precision values achieved at each classification threshold, and the area under the receiver operating characteristic curve (AUC), which aggregates model performance across classification thresholds with respect to the true positive rate and false positive rate. Note that the true class for a given text instance in the test set corresponds to whether the true proportion is greater than or equal to 0.5 (1 if label>=0.5, 0 otherwise).

.	Subword Embedding + RNN	RoBERTa
.	Fractional labels	Binary labels + Class weighting	Fractional labels	Binary labels + Class weighting
Binary	AP=0.746, AUC=0.966	AP=0.730, AUC=0.963	AP=0.758, AUC=0.966	AP=0.747, AUC=0.963
Fine-grained	AP=0.906, AUC=0.909	AP=0.850, AUC=0.851	AP=0.913, AUC=0.913	AP=0.911, AUC=0.912
Multitask	AP=0.721, AUC=0.972	AP=0.535, AUC=0.907	AP=0.740, AUC=0.972	AP=0.711, AUC=0.961

Conclusion

In this post, we presented two text classification approaches for online conversations using AWS ML services. You can generalize these solutions across online communication platforms, with industries such as gaming particularly likely to benefit from improved ability to detect harmful content. In future posts, we plan to further discuss an end-to-end architecture for seamless deployment of models into your AWS account.

If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.

About the Authors

Ryan Brand is a Data Scientist in the Amazon Machine Learning Solutions Lab. He has specific experience in applying machine learning to problems in healthcare and the life sciences, and in his free time he enjoys reading history and science fiction.

Sourav Bhabesh is a Data Scientist at the Amazon ML Solutions Lab. He develops AI/ML solutions for AWS customers across various industries. His specialty is Natural Language Processing (NLP) and is passionate about deep learning. Outside of work he enjoys reading books and traveling.

Liutong Zhou is an Applied Scientist at the Amazon ML Solutions Lab. He builds bespoke AI/ML solutions for AWS customers across various industries. He specializes in Natural Language Processing (NLP) and is passionate about multi-modal deep learning. He is a lyric tenor and enjoys singing operas outside of work.

Sia Gholami is a Senior Data Scientist at the Amazon ML Solutions Lab, where he builds AI/ML solutions for customers across various industries. He is passionate about natural language processing (NLP) and deep learning. Outside of work, Sia enjoys spending time in nature and playing tennis.

Daniel Horowitz is an Applied AI Science Manager. He leads a team of scientists on the Amazon ML Solutions Lab working to solve customer problems and drive cloud adoption with ML.