The success of generative AI applications across a wide range of industries has attracted the attention and interest of companies worldwide who are looking to reproduce and surpass the achievements of competitors or solve new and exciting use cases. These customers are looking into foundation models, such as TII Falcon, Stable Diffusion XL, or OpenAI’s GPT-3.5, as the engines that power the generative AI innovation.
Foundation models are a class of generative AI models that are capable of understanding and generating human-like content, thanks to the vast amounts of unstructured data they have been trained on. These models have revolutionized various computer vision (CV) and natural language processing (NLP) tasks, including image generation, translation, and question answering. They serve as the building blocks for many AI applications and have become a crucial component in the development of advanced intelligent systems.
However, the deployment of foundation models can come with significant challenges, particularly in terms of cost and resource requirements. These models are known for their size, often ranging from hundreds of millions to billions of parameters. Their large size demands extensive computational resources, including powerful hardware and significant memory capacity. In fact, deploying foundation models usually requires at least one (often more) GPUs to handle the computational load efficiently. For example, the TII Falcon-40B Instruct model requires at least an ml.g5.12xlarge instance to be loaded into memory successfully, but performs best with bigger instances. As a result, the return on investment (ROI) of deploying and maintaining these models can be too low to prove business value, especially during development cycles or for spiky workloads. This is due to the running costs of having GPU-powered instances for long sessions, potentially 24/7.
Earlier this year, we announced Amazon Bedrock, a serverless API to access foundation models from Amazon and our generative AI partners. Although it’s currently in Private Preview, its serverless API allows you to use foundation models from Amazon, Anthropic, Stability AI, and AI21, without having to deploy any endpoints yourself. However, open-source models from communities such as Hugging Face have been growing a lot, and not every one of them has been made available through Amazon Bedrock.
In this post, we target these situations and solve the problem of risking high costs by deploying large foundation models to Amazon SageMaker asynchronous endpoints from Amazon SageMaker JumpStart. This can help cut costs of the architecture, allowing the endpoint to run only when requests are in the queue and for a short time-to-live, while scaling down to zero when no requests are waiting to be serviced. This sounds great for a lot of use cases; however, an endpoint that has scaled down to zero will introduce a cold start time before being able to serve inferences.
Solution overview
The following diagram illustrates our solution architecture.
The architecture we deploy is very straightforward:
The user interface is a notebook, which can be replaced by a web UI built on Streamlit or similar technology. In our case, the notebook is an Amazon SageMaker Studio notebook, running on an ml.m5.large instance with the PyTorch 2.0 Python 3.10 CPU kernel.
The notebook queries the endpoint in three ways: the SageMaker Python SDK, the AWS SDK for Python (Boto3), and LangChain.
The endpoint is running asynchronously on SageMaker, and on the endpoint, we deploy the Falcon-40B Instruct model. It’s currently the state of the art in terms of instruct models and available in SageMaker JumpStart. A single API call allows us to deploy the model on the endpoint.
What is SageMaker asynchronous inference
SageMaker asynchronous inference is one of the four deployment options in SageMaker, together with real-time endpoints, batch inference, and serverless inference. To learn more about the different deployment options, refer to Deploy models for Inference.
SageMaker asynchronous inference queues incoming requests and processes them asynchronously, making this option ideal for requests with large payload sizes up to 1 GB, long processing times, and near-real-time latency requirements. However, the main advantage that it provides when dealing with large foundation models, especially during a proof of concept (POC) or during development, is the capability to configure asynchronous inference to scale in to an instance count of zero when there are no requests to process, thereby saving costs. For more information about SageMaker asynchronous inference, refer to Asynchronous inference. The following diagram illustrates this architecture.
To deploy an asynchronous inference endpoint, you need to create an AsyncInferenceConfig object. If you create AsyncInferenceConfig without specifying its arguments, the default S3OutputPath will be s3://sagemaker-{REGION}-{ACCOUNTID}/async-endpoint-outputs/{UNIQUE-JOB-NAME} and S3FailurePath will be s3://sagemaker-{REGION}-{ACCOUNTID}/async-endpoint-failures/{UNIQUE-JOB-NAME}.
What is SageMaker JumpStart
Our model comes from SageMaker JumpStart, a feature of SageMaker that accelerates the machine learning (ML) journey by offering pre-trained models, solution templates, and example notebooks. It provides access to a wide range of pre-trained models for different problem types, allowing you to start your ML tasks with a solid foundation. SageMaker JumpStart also offers solution templates for common use cases and example notebooks for learning. With SageMaker JumpStart, you can reduce the time and effort required to start your ML projects with one-click solution launches and comprehensive resources for practical ML experience.
The following screenshot shows an example of just some of the models available on the SageMaker JumpStart UI.
Deploy the model
Our first step is to deploy the model to SageMaker. To do that, we can use the UI for SageMaker JumpStart or the SageMaker Python SDK, which provides an API that we can use to deploy the model to the asynchronous endpoint:
%%time
from sagemaker.jumpstart.model import JumpStartModel, AsyncInferenceConfig
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
model_id, model_version = “huggingface-llm-falcon-40b-instruct-bf16”, “*”
my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy(
initial_instance_count=0,
instance_type=”ml.g5.12xlarge”,
async_inference_config=AsyncInferenceConfig()
)
This call can take approximately10 minutes to complete. During this time, the endpoint is spun up, the container together with the model artifacts are downloaded to the endpoint, the model configuration is loaded from SageMaker JumpStart, then the asynchronous endpoint is exposed via a DNS endpoint. To make sure that our endpoint can scale down to zero, we need to configure auto scaling on the asynchronous endpoint using Application Auto Scaling. You need to first register your endpoint variant with Application Auto Scaling, define a scaling policy, and then apply the scaling policy. In this configuration, we use a custom metric using CustomizedMetricSpecification, called ApproximateBacklogSizePerInstance, as shown in the following code. For a detailed list of Amazon CloudWatch metrics available with your asynchronous inference endpoint, refer to Monitoring with CloudWatch.
import boto3
client = boto3.client(“application-autoscaling”)
resource_id = “endpoint/” + my_model.endpoint_name + “/variant/” + “AllTraffic”
# Configure Autoscaling on asynchronous endpoint down to zero instances
response = client.register_scalable_target(
ServiceNamespace=”sagemaker”,
ResourceId=resource_id,
ScalableDimension=”sagemaker:variant:DesiredInstanceCount”,
MinCapacity=0, # Miminum number of instances we want to scale down to – scale down to 0 to stop incurring in costs
MaxCapacity=1, # Maximum number of instances we want to scale up to – scale up to 1 max is good enough for dev
)
response = client.put_scaling_policy(
PolicyName=”Invocations-ScalingPolicy”,
ServiceNamespace=”sagemaker”, # The namespace of the AWS service that provides the resource.
ResourceId=resource_id, # Endpoint name
ScalableDimension=”sagemaker:variant:DesiredInstanceCount”, # SageMaker supports only Instance Count
PolicyType=”TargetTrackingScaling”, # ‘StepScaling’|’TargetTrackingScaling’
TargetTrackingScalingPolicyConfiguration={
“TargetValue”: 5.0, # The target value for the metric. – here the metric is – SageMakerVariantInvocationsPerInstance
“CustomizedMetricSpecification”: {
“MetricName”: “ApproximateBacklogSizePerInstance”,
“Namespace”: “AWS/SageMaker”,
“Dimensions”: [{“Name”: “EndpointName”, “Value”: my_model.endpoint_name}],
“Statistic”: “Average”,
},
“ScaleInCooldown”: 600, # The amount of time, in seconds, after a scale in activity completes before another scale in activity can start.
“ScaleOutCooldown”: 300, # ScaleOutCooldown – The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
# ‘DisableScaleIn’: True|False – indicates whether scale in by the target tracking policy is disabled.
# If the value is true, scale in is disabled and the target tracking policy won’t remove capacity from the scalable resource.
},
)
You can verify that this policy has been set successfully by navigating to the SageMaker console, choosing Endpoints under Inference in the navigation pane, and looking for the endpoint we just deployed.
Invoke the asynchronous endpoint
To invoke the endpoint, you need to place the request payload in Amazon Simple Storage Service (Amazon S3) and provide a pointer to this payload as a part of the InvokeEndpointAsync request. Upon invocation, SageMaker queues the request for processing and returns an identifier and output location as a response. Upon processing, SageMaker places the result in the Amazon S3 location. You can optionally choose to receive success or error notifications with Amazon Simple Notification Service (Amazon SNS).
SageMaker Python SDK
After deployment is complete, it will return an AsyncPredictor object. To perform asynchronous inference, you need to upload data to Amazon S3 and use the predict_async() method with the S3 URI as the input. It will return an AsyncInferenceResponse object, and you can check the result using the get_response() method.
Alternatively, if you would like to check for a result periodically and return it upon generation, use the predict() method. We use this second method in the following code:
import time
# Invoking the asynchronous endpoint with the SageMaker Python SDK
def query_endpoint(payload):
“””Query endpoint and print the response”””
response = predictor.predict_async(
data=payload,
input_path=”s3://{}/{}”.format(bucket, prefix),
)
while True:
try:
response = response.get_result()
break
except:
print(“Inference is not ready …”)
time.sleep(5)
print(f”