Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with A …

In this post, we showcase fine-tuning a Llama 2 model using a Parameter-Efficient Fine-Tuning (PEFT) method and deploy the fine-tuned model on AWS Inferentia2. We use the AWS Neuron software development kit (SDK) to access the AWS Inferentia2 device and benefit from its high performance. We then use a large model inference container powered by Deep Java Library (DJLServing) as our model serving solution.
Solution overview
Efficient Fine-tuning Llama2 using QLoRa
The Llama 2 family of large language models (LLMs) is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources. AWS customers sometimes choose to fine-tune Llama 2 models using customers’ own data to achieve better performance for downstream tasks. However, due to Llama 2 model’s large number of parameters, full fine-tuning could be prohibitively expensive and time consuming. Parameter-Efficient Fine-Tuning (PEFT) approach can address this problem by only fine-tune a small number of extra model parameters while freezing most parameters of the pre-trained model. For more information on PEFT, one can read this post. In this post, we use QLoRa to fine-tune a Llama 2 7B model.
Deploy a fine-tuned Model on Inf2 using Amazon SageMaker
AWS Inferentia2 is purpose-built machine learning (ML) accelerator designed for inference workloads and delivers high-performance at up to 40% lower cost for generative AI and LLM workloads over other inference optimized instances on AWS. In this post, we use Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance, featuring AWS Inferentia2, the second generation Inferentia2 accelerators, each containing two NeuronCores-v2. Each NeuronCore-v2 is an independent, heterogenous compute-unit, with four main engines: Tensor, Vector, Scalar, and GPSIMD engines. It includes an on-chip software-managed SRAM memory for maximizing data locality. Since several blogs on Inf2 has been published, the reader can refer to this post and our documentation for more information on Inf2.
To deploy models on Inf2, we need AWS Neuron SDK as the software layer running on top of the Inf2 hardware. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances. It enables end-to-end ML development lifecycle to build new models, train and optimize these models, and deploy them for production. AWS Neuron includes a deep learning compiler, runtime, and tools that are natively integrated with popular frameworks like TensorFlow and PyTorch. In this blog, we are going to use transformers-neuronx, which is part of the AWS Neuron SDK for transformer decoder inference workflows. It supports a range of popular models, including Llama 2.
To deploy models on Amazon SageMaker, we usually use a container that contains the required libraries, such as Neuron SDK and transformers-neuronx as well as the model serving component. Amazon SageMaker maintains deep learning containers (DLCs) with popular open source libraries for hosting large models. In this post, we use the Large Model Inference Container for Neuron. This container has everything you need to deploy your Llama 2 model on Inf2. For resources to get started with LMI on Amazon SageMaker, please refer to many of our existing posts (blog 1, blog 2, blog 3) on this topic. In short, you can run the container without writing any additional code. You can use the default handler for a seamless user experience and pass in one of the supported model names and any load time configurable parameters. This compiles and serve an LLM on an Inf2 instance. For example, to deploy OpenAssistant/llama2-13b-orca-8k-3319, you can provide the follow configuration (as file). In, we specify the model type as llama2-13b-orca-8k-3319, the batch size as 4, the tensor parallel degree as 2, and that is it. For the full list of configurable parameters, refer to All DJL configuration options.

# Engine to use: MXNet, PyTorch, TensorFlow, ONNX, PaddlePaddle, DeepSpeed, etc.
engine = Python
# default handler for model serving
option.entryPoint = djl_python.transformers_neuronx
# The Hugging Face ID of a model or the s3 url of the model artifacts.
option.model_id = meta-llama/Llama-2-7b-chat-hf
#the dynamic batch size, default is 1.
# This option specifies number of tensor parallel partitions performed on the model.
# The input sequence length
#Enable iteration level batching using one of “auto”, “scheduler”, “lmi-dist”
# The data type to which you plan to cast the model default
# worker load model timeout

Alternatively, you can write your own model handler file as shown in this example, but that requires implementing the model loading and inference methods to serve as a bridge between the DJLServing APIs.
The following list outlines the prerequisites for deploying the model described in this blog post. You can implement either from the AWS Management Console or using the latest version of the AWS Command Line Interface (AWS CLI).

Amazon SageMaker
Amazon SageMaker Domain
Amazon SageMaker Python SDK

In the following section, we’ll walkthrough the code in two parts:

Fine-tuning a Llama2-7b model, and upload the model artifacts to a specified Amazon S3 bucket location.
Deploy the model into an Inferentia2 using DJL serving container hosted in Amazon SageMaker.

The complete code samples with instructions can be found in this GitHub repository.
Part 1: Fine-tune a Llama2-7b model using PEFT
We are going to use the recently introduced method in the paper QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during fine-tuning, without sacrificing performance.
Note: The fine-tuning of llama2-7b model shown in the following was tested on an Amazon SageMaker Studio Notebook with Python 2.0 GPU Optimized Kernel using a ml.g5.2xlarge instance type. As a best practice, we recommend using an Amazon SageMaker Studio Integrated Development Environment (IDE) launched in your own Amazon Virtual Private Cloud (Amazon VPC). This allows you to control, monitor, and inspect network traffic within and outside your VPC using standard AWS networking and security capabilities. For more information, see Securing Amazon SageMaker Studio connectivity using a private VPC.
Quantize the base model
We first load a quantized model with 4-bit quantization using Huggingface transformers library as follows:

# The base pretrained model for fine-tuning
model_name = “NousResearch/Llama-2-7b-chat-hf”

# The instruction dataset to use
dataset_name = “mlabonne/guanaco-llama2-1k”

#Activate 4-bit precision base model loading
use_4bit = True
bnb_4bit_compute_dtype = “float16”
bnb_4bit_quant_type = “nf4″
use_nested_quant = False

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(

# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

Load training dataset
Next, we load the dataset to feed the model for fine-tuning step shown as followed:

# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split=”train”)

Attach an adapter layer
Here we attach a small, trainable adapter layer, configured as LoraConfig defined in the Hugging Face’s peft library.

# include linear layers to apply LoRA to.
modules = find_all_linear_names(model)

## Setting up LoRA configuration
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

peft_config = LoraConfig(

Train a model
Using the LoRA configuration shown above, we’ll fine-tune the Llama2 model along with hyper-parameters. A code snippet for training the model is shown in the following:

# Set training parameters
training_arguments = TrainingArguments(…)

trainer = SFTTrainer(
peft_config=peft_config, # LoRA config

# Train model

# Save trained model

Merge model weight
The fine-tuned model executed above created a new model containing the trained LoRA adapter weights. In the following code snippet, we’ll merge the adapter with the base model so that we could use the fine-tuned model for inference.

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

save_dir = “merged_model”
model.save_pretrained(save_dir, safe_serialization=True, max_shard_size=”2GB”)

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = “right”

Upload model weight to Amazon S3
In the final step of part 1, we’ll save the merged model weights to a specified Amazon S3 location. The model weight will be used by a model serving container in Amazon SageMaker to host the model using an Inferentia2 instance.

model_data_s3_location = “s3://<bucket_name>/<prefix>/”
!cd {save_dir} && aws s3 cp —recursive . {model_data_s3_location}

Part 2: Host QLoRA model for inference with AWS Inf2 using SageMaker LMI Container
In this section, we’ll walk through the steps of deploying a QLoRA fine-tuned model into an Amazon SageMaker hosting environment. We’ll use a DJL serving container from SageMaker DLC, which integrates with the transformers-neuronx library to host this model. The setup facilitates the loading of models onto AWS Inferentia2 accelerators, parallelizes the model across multiple NeuronCores, and enables serving via HTTP endpoints.
Prepare model artifacts
DJL supports many deep learning optimization libraries, including DeepSpeed, FasterTransformer and more. For model specific configurations, we provide a with key parameters, such as tensor_parallel_degree and model_id to define the model loading options. The model_id could be a Hugging Face model ID, or an Amazon S3 path where the model weights are stored. In our example, we provide the Amazon S3 location of our fine-tuned model. The following code snippet shows the properties used for the model serving:

option.model_id=<model data s3 location>

Please refer to this documentation for more information about the configurable options available via Please note that we use option.n_position=512 in this blog for faster AWS Neuron compilation. If you want to try larger input token length, then we recommend the reader to pre-compile the model ahead of time (see AOT Pre-Compile Model on EC2). Otherwise, you might run into timeout error if the compilation time is too much.
After the file is defined, we’ll package the file into a tar.gz format, as follows:

mkdir mymodel
mv mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

Then, we’ll upload the tar.gz to an Amazon S3 bucket location:

s3_code_prefix = “large-model-lmi/code”
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data(“mymodel.tar.gz”, bucket, s3_code_prefix)
print(f”S3 Code or Model tar ball uploaded to — > {code_artifact}”)

Create an Amazon SageMaker model endpoint
To use an Inf2 instance for serving, we use an Amazon SageMaker LMI container with DJL neuronX support. Please refer to this post for more information about using a DJL NeuronX container for inference. The following code shows how to deploy a model using Amazon SageMaker Python SDK:

# Retrieves the DJL-neuronx docker image URI
image_uri = image_uris.retrieve(

# Define inf2 instance type to use for serving
instance_type = “ml.inf2.48xlarge”

endpoint_name = sagemaker.utils.name_from_base(“lmi-model”)

# Deploy the model for inference

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(

Test model endpoint
After the model is deployed successfully, we can validate the endpoint by sending a sample request to the predictor:

prompt=”What is machine learning?”
input_data = f”<s>[INST] <<SYS>>nAs a data scientistn<</SYS>>n{prompt} [/INST]”

response = predictor.predict(
{“inputs”: input_data, “parameters”: {“max_new_tokens”:300, “do_sample”:”True”}}


The sample output is shown as follows:

In the context of data analysis, Machine Learning (ML) refers to a statistical technique capable of extracting predictive power from a dataset with an increasing complexity and accuracy by iteratively narrowing down the scope of a statistic.
Machine Learning is not a new statistical technique, but rather a combination of existing techniques. Furthermore, it has not been designed to be used with a specific dataset or to produce a specific outcome. Rather, it was designed to be flexible enough to adapt to any dataset and to make predictions about any outcome.

Clean up
If you decide that you no longer want to keep the SageMaker endpoint running, you can delete it using AWS SDK for Python (boto3), AWS CLI or Amazon SageMaker Console. Additionally, you can also shutdown the Amazon SageMaker Studio Resources that are no longer required.
In this post, we showed you how to fine-tune a Llama2-7b model using LoRA adaptor with 4-bit quantization using a single GPU instance. Then we deployed the model to an Inf2 instance hosted in Amazon SageMaker using a DJL serving container. Finally, we validated the Amazon SageMaker model endpoint with a text generation prediction using the SageMaker Python SDK. Go ahead and give it a try, we love to hear your feedback. Stay tuned for updates on more capabilities and new innovations with AWS Inferentia.
For more examples about AWS Neuron, see aws-neuron-samples.

About the Authors
Wei Teh is a Senior AI/ML Specialist Solutions Architect at AWS. He is passionate about helping customers advance their AWS journey, focusing on Amazon Machine Learning services and machine learning-based solutions. Outside of work, he enjoys outdoor activities like camping, fishing, and hiking with his family.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.