Deploy large language models for a healthtech use case on Amazon SageM …

In 2021, the pharmaceutical industry generated $550 billion in US revenue. Pharmaceutical companies sell a variety of different, often novel, drugs on the market, where sometimes unintended but serious adverse events can occur.
These events can be reported anywhere, from hospitals or at home, and must be responsibly and efficiently monitored. Traditional manual processing of adverse events is made challenging by the increasing amount of health data and costs. Overall, $384 billion is projected as the cost of pharmacovigilance activities to the overall healthcare industry by 2022. To support overarching pharmacovigilance activities, our pharmaceutical customers want to use the power of machine learning (ML) to automate the adverse event detection from various data sources, such as social media feeds, phone calls, emails, and handwritten notes, and trigger appropriate actions.
In this post, we show how to develop an ML-driven solution using Amazon SageMaker for detecting adverse events using the publicly available Adverse Drug Reaction Dataset on Hugging Face. In this solution, we fine-tune a variety of models on Hugging Face that were pre-trained on medical data and use the BioBERT model, which was pre-trained on the Pubmed dataset and performs the best out of those tried.
We implemented the solution using the AWS Cloud Development Kit (AWS CDK). However, we don’t cover the specifics of building the solution in this post. For more information on the implementation of this solution, refer to Build a system for catching adverse events in real-time using Amazon SageMaker and Amazon QuickSight.
This post delves into several key areas, providing a comprehensive exploration of the following topics:

The data challenges encountered by AWS Professional Services
The landscape and application of large language models (LLMs):

Transformers, BERT, and GPT
Hugging Face

The fine-tuned LLM solution and its components:

Data preparation
Model training

Data challenge
Data skew is often a problem when coming up with classification tasks. You would ideally like to have a balanced dataset, and this use case is no exception.
We address this skew with generative AI models (Falcon-7B and Falcon-40B), which were prompted to generate event samples based on five examples from the training set to increase the semantic diversity and increase the sample size of labeled adverse events. It’s advantageous to us to use the Falcon models here because, unlike some LLMs on Hugging Face, Falcon gives you the training dataset they use, so you can be sure that none of your test set examples are contained within the Falcon training set and avoid data contamination.
The other data challenge for healthcare customers are HIPAA compliance requirements. Encryption at rest and in transit has to be incorporated into the solution to meet these requirements.
Transformers, BERT, and GPT
The transformer architecture is a neural network architecture that is used for natural language processing (NLP) tasks. It was first introduced in the paper “Attention Is All You Need” by Vaswani et al. (2017). The transformer architecture is based on the attention mechanism, which allows the model to learn long-range dependencies between words. Transformers, as laid out in the original paper, consist of two main components: the encoder and the decoder. The encoder takes the input sequence as input and produces a sequence of hidden states. The decoder then takes these hidden states as input and produces the output sequence. The attention mechanism is used in both the encoder and the decoder. The attention mechanism allows the model to attend to specific words in the input sequence when generating the output sequence. This allows the model to learn long-range dependencies between words, which is essential for many NLP tasks, such as machine translation and text summarization.
One of the more popular and useful of the transformer architectures, Bidirectional Encoder Representations from Transformers (BERT), is a language representation model that was introduced in 2018. BERT is trained on sequences where some of the words in a sentence are masked, and it has to fill in those words taking into account both the words before and after the masked words. BERT can be fine-tuned for a variety of NLP tasks, including question answering, natural language inference, and sentiment analysis.
The other popular transformer architecture that has taken the world by storm is Generative Pre-trained Transformer (GPT). The first GPT model was introduced in 2018 by OpenAI. It works by being trained to strictly predict the next word in a sequence, only aware of the context before the word. GPT models are trained on a massive dataset of text and code, and they can be fine-tuned for a range of NLP tasks, including text generation, question answering, and summarization.
In general, BERT is better at tasks that require deeper understanding of the context of words, whereas GPT is better suited for tasks that require generating text.
Hugging Face
Hugging Face is an artificial intelligence company that specializes in NLP. It provides a platform with tools and resources that enable developers to build, train, and deploy ML models focused on NLP tasks. One of the key offerings of Hugging Face is its library, Transformers, which includes pre-trained models that can be fine-tuned for various language tasks such as text classification, translation, summarization, and question answering.
Hugging Face integrates seamlessly with SageMaker, which is a fully managed service that enables developers and data scientists to build, train, and deploy ML models at scale. This synergy benefits users by providing a robust and scalable infrastructure to handle NLP tasks with the state-of-the-art models that Hugging Face offers, combined with the powerful and flexible ML services from AWS. You can also access Hugging Face models directly from Amazon SageMaker JumpStart, making it convenient to start with pre-built solutions.
Solution overview
We used the Hugging Face Transformers library to fine-tune transformer models on SageMaker for the task of adverse event classification. The training job is built using the SageMaker PyTorch estimator. SageMaker JumpStart also has some complementary integrations with Hugging Face that makes straightforward to implement. In this section, we describe the major steps involved in data preparation and model training.
Data preparation
We used the Adverse Drug Reaction Data (ade_corpus_v2) within the Hugging Face dataset with an 80/20 training/test split. The required data structure for our model training and inference has two columns:

One column for text content as model input data.
Another column for the label class. We have two possible classes for a text: Not_AE and Adverse_Event.

Model training and experimentation
In order to efficiently explore the space of possible Hugging Face models to fine-tune on our combined data of adverse events, we constructed a SageMaker hyperparameter optimization (HPO) job and passed in different Hugging Face models as a hyperparameter, along with other important hyperparameters such as training batch size, sequence length, models, and learning rate. The training jobs used an ml.p3dn.24xlarge instance and took an average of 30 minutes per job with that instance type. Training metrics were captured though the Amazon SageMaker Experiments tool, and each training job ran through 10 epochs.
We specify the following in our code:

Training batch size – Number of samples that are processed together before the model weights are updated
Sequence length – Maximum length of the input sequence that BERT can process
Learning rate – How quickly the model updates its weights during training
Models – Hugging Face pretrained models

# we use the Hyperparameter Tuner
from sagemaker.tuner import IntegerParameter,ContinuousParameter, CategoricalParameter
tuning_job_name = ‘ade-hpo’
# Define exploration boundaries
hyperparameter_ranges = {
‘learning_rate’: ContinuousParameter(5e-6,5e-4),
‘max_seq_length’: CategoricalParameter([’16’, ’32’, ’64’, ‘128’, ‘256’]),
‘train_batch_size’: CategoricalParameter([’16’, ’32’, ’64’, ‘128’, ‘256’]),
‘model_name’: CategoricalParameter([“emilyalsentzer/Bio_ClinicalBERT”,
“dmis-lab/biobert-base-cased-v1.2”, “monologg/biobert_v1.1_pubmed”, “pritamdeka/BioBert-PubMed200kRCT”, “saidhr20/pubmed-biobert-text-classification” ])

# create Optimizer
Optimizer = sagemaker.tuner.HyperparameterTuner(
{‘Name’: ‘f1’,
‘Regex’: “f1: ([0-9.]+).*$”}],
){‘training’: inputs_data}, wait=False)

The model that performed the best in our use case was the monologg/biobert_v1.1_pubmed model hosted on Hugging Face, which is a version of the BERT architecture that has been pre-trained on the Pubmed dataset, which consists of 19,717 scientific publications. Pre-training BERT on this dataset gives this model extra expertise when it comes to identifying context around medically related scientific terms. This boosts the model’s performance for the adverse event detection task because it has been pre-trained on medically specific syntax that shows up often in our dataset.
The following table summarizes our evaluation metrics.




BioBERT with HPO

BioBERT with HPO and synthetically generated adverse event

Although these are relatively small and incremental improvements over the base BERT model, this nevertheless demonstrates some viable strategies to improve model performance through these methods. Synthetic data generation with Falcon seems to hold a lot of promise and potential for performance improvements, especially as these generative AI models get better over time.
Clean up
To avoid incurring future charges, delete any resources created like the model and model endpoints you created with the following code:

# Delete resources

Many pharmaceutical companies today would like to automate the process of identifying adverse events from their customer interactions in a systematic way in order to help improve customer safety and outcomes. As we showed in this post, the fine-tuned LLM BioBERT with synthetically generated adverse events added to the data classifies the adverse events with high F1 scores and can be used to build a HIPAA-compliant solution for our customers.
As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section.

About the authors
Zack Peterson is a data scientist in AWS Professional Services. He has been hands on delivering machine learning solutions to customers for many years and has a master’s degree in Economics.
Dr. Adewale Akinfaderin is a senior data scientist in Healthcare and Life Sciences at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global healthcare customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in Physics and a doctorate degree in Engineering.
Ekta Walia Bhullar, PhD, is a senior AI/ML consultant with the AWS Healthcare and Life Sciences (HCLS) Professional Services business unit. She has extensive experience in the application of AI/ML within the healthcare domain, especially in radiology. Outside of work, when not discussing AI in radiology, she likes to run and hike.
Han Man is a Senior Data Science & Machine Learning Manager with AWS Professional Services based in San Diego, CA. He has a PhD in Engineering from Northwestern University and has several years of experience as a management consultant advising clients in manufacturing, financial services, and energy. Today, he is passionately working with key customers from a variety of industry verticals to develop and implement ML and generative AI solutions on AWS.