Host ML models on Amazon SageMaker using Triton: CV model with PyTorch …

PyTorch is a machine learning (ML) framework based on the Torch library, used for applications such as computer vision and natural language processing. One of the primary reasons that customers are choosing a PyTorch framework is its simplicity and the fact that it’s designed and assembled to work with Python. PyTorch supports dynamic computational graphs, enabling network behavior to be changed at runtime. This provides a major flexibility advantage over the majority of ML frameworks, which require neural networks to be defined as static objects before runtime. In this post, we dive deep to see how Amazon SageMaker can serve these PyTorch models using NVIDIA Triton Inference Server.
SageMaker provides several options for customers who are looking to host their ML models. One of the key available features is SageMaker real-time inference endpoints. Real-time workloads can have varying levels of performance expectations and service level agreements (SLAs), which materialize as latency and throughput requirements.
With real-time endpoints, different deployment options adjust to different tiers of expected performance. For example, your business may rely on a model that must meet very strict SLAs for latency and throughput with predictable performance. In this case, SageMaker provides single-model endpoints (SMEs), allowing you to deploy a single ML model to a logical endpoint, which will use the underlying server’s networking and compute capacity. For other use cases where you need a better balance between performance and cost, multi-model endpoints (MMEs) allows you to deploy multiple models behind a logical endpoint and invoke them individually, while abstracting their loading and unloading from memory.
SageMaker provides support for single-model and multi-model endpoints through NVIDIA Triton Inference Server. Triton supports various backends as engines to power the running and serving of different framework models, like PyTorch, TensorFlow, TensorRT, or ONNX Runtime. For any Triton deployment, it’s crucial to understand how the backend behavior impacts your workload and what to expect from its unique configuration parameters. In this post, we help you understand the Triton PyTorch backend in depth.
Triton with PyTorch backend
The PyTorch backend is designed to run TorchScript models using the PyTorch C++ API. TorchScript is a static subset of Python that captures the structure of a PyTorch model. To use this backend, you need to convert your PyTorch model to TorchScript using Just-In-Time (JIT) compilation. JIT compiles the TorchScript code into an optimized intermediate representation, making it suitable for deployment in non-Python environments. Triton uses TorchScript for improved performance and flexibility.
Each model deployed with Triton requires a configuration file (config.pbtxt) that specifies model metadata, such as input and output tensors, model name, and platform. The configuration file is essential for Triton to understand how to load, run, and optimize the model. For PyTorch models, the platform field in the configuration file should be set to pytorch_libtorch. You can load Triton PyTorch models on GPU and CPU (see Multiple Model Instances) and model weights will be kept either in GPU memory/VRAM or in host memory/RAM correspondingly.
Note that only the model’s forward method will be called when using the Pytorch backend; if you rely on more complex logic to prepare, iterate, and transform your raw model’s predictions to respond to a request, you should wrap it as a custom model forward. Alternatively, you can use ensemble models or business logic scripting.
You can optimize PyTorch model performance on Triton by using a combination of available configuration-based features. Some of these are backend-agnostic, like dynamic batching and concurrent model runs (see Achieve hyperscale performance for model serving using NVIDIA Triton Inference Server on Amazon SageMaker to learn more), and some are PyTorch-specific. Let’s take a deeper look into these configuration parameters and how you should use them:

DISABLE_OPTIMIZED_EXECUTION – Use this parameter to optimize running TorchScript models. This parameter slows down the initial call to a loaded TorchScript model, and might not benefit or even hinder model performance in some cases. Set to false if your tolerance to scaling or cold start latency is very low.
INFERENCE_MODE – Use this parameter to toggle PyTorch inference mode. In inference mode, computations aren’t recorded in the backward graph, and it allows PyTorch to speed up your model. This better runtime comes with a drawback: you won’t be able to use tensors created in inference mode in computations to be recorded by autograd after exiting inference mode. Set to true if the preceding conditions apply to your use case (mostly true for inference workloads).
ENABLE_NVFUSER – Use this parameter to enable NvFuser (CUDA Graph Fuser) optimization for TorchScript models. If not specified, the default PyTorch fuser is used.
ENABLE_WEIGHT_SHARING – Use this parameter to allow model instances (copies) on the same device to share weights. This can reduce memory usage of model loading and inference. It should not be used with models that maintain state.
ENABLE_CACHE_CLEANING – Use this parameter to enable CUDA cache cleaning after each model run (only has an effect if the model is on GPU). Setting this flag to true will negatively impact the performance due to additional CUDA cache cleaning operations after each model run. You should only use this flag if you serve multiple models with Triton and encounter CUDA out of memory issues during model runs.
ENABLE_JIT_EXECUTOR, ENABLE_JIT_PROFILING, and ENABLE_TENSOR_FUSER – Use these parameters to disable certain PyTorch optimizations that can sometimes cause latency regressions in models with complex run modes and dynamic shapes.

Triton Inference on SageMaker
SageMaker allows you to deploy both SMEs and MMEs with NVIDIA Triton Inference Server. The following figure shows Triton’s high-level architecture. The model repository is a file system-based repository of the models that Triton will make available for inferencing. Inference requests arrive at the server via HTTPS and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The backend performs inferencing using the inputs provided in the batched requests and the outputs are then returned.
When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling group. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. Note that for SMEs, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For MMEs, we recommend deploying similar models behind a given endpoint to have more steady, predictable performance. In use cases where models of varying sizes and requirements are used, you may want to separate those workloads across multiple MMEs, or spend added time fine-tuning their auto scaling group policy to obtain the best cost and performance balance. See Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints for more information on auto scaling policy considerations for MMEs. (Note that although the MMS configurations don’t apply in this case, the policy considerations still do.)

For a list of NVIDIA Triton Deep Learning Containers (DLCs) supported by SageMaker inference, refer to Available Deep Learning Containers Images.
Solution overview
In the following sections, we walk through an example available on GitHub to understand how we can use Triton and SageMaker MMEs on GPU to deploy a ResNet model for image classification. For demonstration purposes, we use a pre-trained ResNet50 model that can classify images into 1,000 categories.
Prerequisites
You first need an AWS account and an AWS Identity and Access Management (IAM) administrator user. For instructions on how to set up an AWS account, see How do I create and activate a new AWS account. For instructions on how to secure your account with an IAM administrator user, see Creating your first IAM admin user and user group.
SageMaker needs access to the Amazon Simple Storage Service (Amazon S3) bucket that stores your model. Create an IAM role with a policy that gives SageMaker read access to your bucket.
If you plan to run the notebook in Amazon SageMaker Studio, refer to Get Started for setup instructions.
Set up your environment
To set up your environment, complete the following steps:
Launch a SageMaker notebook instance with a g5.xlarge instance.
You can also run this example on a Studio notebook instance.

Select Clone a public git repository to this notebook instance only and specify the GitHub repository URL.
When JupyterLab is ready, launch the resnet_pytorch_python_backend_MME.ipynb notebook with the conda_python3 conda kernel and run through this notebook step by step.

Install the dependencies and import the required library
Use the following code to install dependencies and import the required library:

!pip install nvidia-pyindex –quiet
!pip install tritonclient[http] –quiet

# imports
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import numpy as np
from PIL import Image
import tritonclient.http as httpclient
# variables
s3_client = boto3.client(“s3″)

# sagemaker variables
role = get_execution_role()
sm_client = boto3.client(service_name=”sagemaker”)
runtime_sm_client = boto3.client(“sagemaker-runtime”)
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
bucket = sagemaker_session.default_bucket()

Prepare the model artifacts
The generate_model_pytorch.sh file in the workspace directory contains scripts to load and save a PyTorch model. First, we load a pre-trained ResNet50 model using the torchvision models package. We save the model as a model.pt file in TorchScript optimized and serialized format. TorchScript needs example inputs to do a model forward pass, so we pass one instance of an RGB image with three color channels of dimension 224X224. The script for exporting this model can be found on the GitHub repo.

!docker run –gpus=all –rm -it 
-v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:23.02-py3 
            /bin/bash generate_model_pytorch.sh

Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model, as shown in the following example. The value 1 represents version 1 of our Pytorch model. Each model is run by its specific backend, so each version subdirectory must contain the model artifact required by that backend. Because we’re using a PyTorch backend, a model.pt file is required within the version directory. For more details on naming conventions for model files, refer to Model Files.

Every Triton model must also provide a config.pbtxt file describing the model configuration. To learn more about the config settings, refer to Model Configuration. Out config.pbtxt file specifies the backend as pytorch_libtorch, and defines input and output tensor shapes and data type information. We also specify that we want to run this model on the GPU via the instance_group parameter. See the following code:

name: “resnet”
platform: “pytorch_libtorch”

max_batch_size: 128
input {
  name: “INPUT__0”
  data_type: TYPE_FP32
  dims: 3
  dims: 224
  dims: 224
}
output {
  name: “OUTPUT__0″
  data_type: TYPE_FP32
  dims: 1000
}

instance_group [
{
count: 1
kind: KIND_GPU
}

For the instance_group config, when simply a count is specified, Triton loads x counts of the model on each available GPU device. If you want to control which GPU devices to load your models on, you can do so explicitly by specifying the GPU device IDs. Note that for MMEs, explicitly specifying such GPU device IDs might lead to poor memory management because multiple models may explicitly try to allocate the same GPU device.
We then tar.gz the model artifacts, which is the format expected by SageMaker:

!tar -C triton-serve-pt/ -czf resnet_pt_v0.tar.gz 
resnetmodel_uri_pt = sagemaker_session.upload_data(path=”resnet_pt_v0.tar.gz”, key_prefix=prefix)

Now that we have uploaded the model artifacts to Amazon S3, we can create a SageMaker multi-model endpoint.
Deploy the model
We now deploy the Triton model to a SageMaker MME. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker MME will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker will create the endpoint with MME container specifications. We set the container with an image that supports deploying MMEs with GPU (refer to the MME container images for more details). Note that the parameter  mode is set to MultiModel. This is the key differentiator.

container = {“Image”: mme_triton_image_uri, “ModelDataUrl”: model_data_url, “Mode”: “MultiModel”}

Using the SageMaker Boto3 client, create the model using the create_model API. We pass the container definition to the create_model API along with ModelName and ExecutionRoleArn:

create_model_response = sm_client.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)
print(“Model Arn: ” + create_model_response[“ModelArn”])

Create MME configurations using the create_endpoint_config Boto3 API. Specify an accelerated GPU computing instance in InstanceType (for this post, we use a g4dn.4xlarge instance). We recommend configuring your endpoints with at least two instances. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.

create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
“InstanceType”: “ml.g4dn.4xlarge”,
“InitialVariantWeight”: 1,
“InitialInstanceCount”: 1,
“ModelName”: sm_model_name,
“VariantName”: “AllTraffic”,
}
],
)
print(“Endpoint Config Arn: ” + create_endpoint_config_response[“EndpointConfigArn”])

Using the preceding endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService when the deployment is successful.

create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(“Endpoint Arn: ” + create_endpoint_response[“EndpointArn”])

Invoke the model and run predictions
The following method transforms a sample image we will be using for inference into the payload that can be sent for inference to the Triton server.
The tritonclient package provides utility methods to generate the payload without having to know the details of the specification. We use the following methods to convert our inference request into a binary format, which provides lower latencies for inference:

s3_client.download_file(
    “sagemaker-sample-files”, “datasets/image/pets/shiba_inu_dog.jpg”, “shiba_inu_dog.jpg”
)

def get_sample_image():
    image_path = “./shiba_inu_dog.jpg”
    img = Image.open(image_path).convert(“RGB”)
    img = img.resize((224, 224))
    img = (np.array(img).astype(np.float32) / 255) – np.array(
        [0.485, 0.456, 0.406], dtype=np.float32
    ).reshape(1, 1, 3)
    img = img / np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 1, 3)
    img = np.transpose(img, (2, 0, 1))
    return img.tolist()

def _get_sample_image_binary(input_name, output_name):
    inputs = []
    outputs = []
    inputs.append(httpclient.InferInput(input_name, [1, 3, 224, 224], “FP32”))
    input_data = np.array(get_sample_image(), dtype=np.float32)
    input_data = np.expand_dims(input_data, axis=0)
    inputs[0].set_data_from_numpy(input_data, binary_data=True)
    outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))
    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
        inputs, outputs=outputs
    )
    return request_body, header_length

def get_sample_image_binary_pt():
    return _get_sample_image_binary(“INPUT__0”, “OUTPUT__0″)

After the endpoint is successfully created, we can send inference requests to the MME using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type:

request_body, header_length = get_sample_image_binary_pt()
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=”application/vnd.sagemaker-triton.binary+json;json-header-size={}”.format(
header_length
),
Body=request_body,
TargetModel=”resnet_pt_v0.tar.gz”,
)
# Parse json header size length from the response
header_length_prefix = “application/vnd.sagemaker-triton.binary+json;json-header-size=”
header_length_str = response[“ContentType”][len(header_length_prefix) :]
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
response[“Body”].read(), header_length=int(header_length_str)
)
output0_data = result.as_numpy(“OUTPUT__0”)
print(output0_data)

Additionally, SageMaker MMEs provide instance-level metrics to monitor using Amazon CloudWatch:

LoadedModelCount – Number of models loaded in the containers
GPUUtilization – Percentage of GPU units that are used by the containers
GPUMemoryUtilization – Percentage of GPU memory used by the containers
DiskUtilization – Percentage of disk space used by the containers

SageMaker MMEs also provides model loading metrics such as the following:

ModelLoadingWaitTime – Time interval for the model to be downloaded or loaded
ModelUnloadingTime – Time interval to unload the model from the container
ModelDownloadingTime – Time to download the model from Amazon S3
ModelCacheHit – Number of invocations to the model that are already loaded onto the container to get model invocation-level insights

For more details, refer to Monitor Amazon SageMaker with Amazon CloudWatch.
Clean up
In order to avoid incurring charges, delete the model endpoint:

sm_client.delete_model(ModelName=sm_model_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_endpoint(EndpointName=endpoint_name)

Best practices
When using the PyTorch backend, most optimization decisions will depend on your specific workload latency or throughput requirements and what model architecture you are using. In general, in order to do a data-driven comparison of configuration parameters to improve performance, you should use Triton’s Performance Analyzer. With this tool, you should adopt the following decision logic:

Experiment and check if your model architecture can be transformed into a TensorRT engine and deployed with the Triton TensorRT backend. This is the preferable way to deploy models with NVIDIA GPUs because both the TensorRT model format and runtime make the best use of the underlying hardware capabilities.
Always set INFERENCE_MODE to true for pure inference workloads where no autograd calculations are required.
If deploying SMEs, maximize hardware utilization by properly defining instance group configuration according to the available GPU memory or RAM (use the Performance Analyzer tool to find the right size).

For more MME-specific best practices, refer to Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints.
Conclusion
In this post, we dove deep into the PyTorch backend supported by Triton Inference Server, which provides acceleration for both CPU and GPU based models. We went through some of the configuration parameters you can adjust to optimize model performance. Finally, we provided a walkthrough of an example notebook to demonstrate deploying a SageMaker multi-model endpoint deployment. Be sure to try it out!

About the Authors
Neelam Koshiya is an Enterprise Solutions Architect at AWS. With a background in software engineering, she organically moved into an architecture role. Her current focus is helping enterprise customers with their cloud adoption journey for strategic business outcomes with the area of depth being AI/ML. She is passionate about innovation and inclusion. In her spare time, she enjoys reading and being outdoors.
João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model training and inference optimization, and more broadly building large-scale ML platforms on AWS. He is also an active proponent of ML-specialized hardware and low-code ML solutions.
Vivek Gangasani is a Senior Machine Learning Solutions Architect at Amazon Web Services. He works with machine learning startups to build and deploy AI/ML applications on AWS. He is currently focused on delivering solutions for MLOps, ML inference, and low-code ML. He has worked on projects in different domains, including natural language processing and computer vision.

<