Elevating the generative AI experience: Introducing streaming support …

We’re excited to announce the availability of response streaming through Amazon SageMaker real-time inference. Now you can continuously stream inference responses back to the client when using SageMaker real-time inference to help you build interactive experiences for generative AI applications such as chatbots, virtual assistants, and music generators. With this new feature, you can start streaming the responses immediately when they’re available instead of waiting for the entire response to be generated. This lowers the time-to-first-byte for your generative AI applications.
In this post, we’ll show how to build a streaming web application using SageMaker real-time endpoints with the new response streaming feature for an interactive chat use case. We use Streamlit for the sample demo application UI.
Solution overview
To get responses streamed back from SageMaker, you can use our new InvokeEndpointWithResponseStream API. It helps enhance customer satisfaction by delivering a faster time-to-first-response-byte. This reduction in customer-perceived latency is particularly crucial for applications built with generative AI models, where immediate processing is valued over waiting for the entire payload. Moreover, it introduces a sticky session that will enable continuity in interactions, benefiting use cases such as chatbots, to create more natural and efficient user experiences.
The implementation of response streaming in SageMaker real-time endpoints is achieved through HTTP 1.1 chunked encoding, which is a mechanism for sending multiple responses. This is a HTTP standard that supports binary content and is supported by most client/server frameworks. HTTP chunked encoding supports both text and image data streaming, which means the models hosted on SageMaker endpoints can send back streamed responses as text or image, such as Falcon, Llama 2, and Stable Diffusion models. In terms of security, both the input and output are secured using TLS using AWS Sigv4 Auth. Other streaming techniques like Server-Sent Events (SSE) are also implemented using the same HTTP chunked encoding mechanism. To take advantage of the new streaming API, you need to make sure the model container returns the streamed response as chunked encoded data.
The following diagram illustrates the high-level architecture for response streaming with a SageMaker inference endpoint.

One of the use cases that will benefit from streaming response is generative AI model-powered chatbots. Traditionally, users send a query and wait for the entire response to be generated before receiving an answer. This could take precious seconds or even longer, which can potentially degrade the performance of the application. With response streaming, the chatbot can begin sending back partial inference results as they are generated. This means that users can see the initial response almost instantaneously, even as the AI continues refining its answer in the background. This creates a seamless and engaging conversation flow, where users feel like they’re chatting with an AI that understands and responds in real time.
In this post, we showcase two container options to create a SageMaker endpoint with response streaming: using an AWS Large Model Inference (LMI) and Hugging Face Text Generation Inference (TGI) container. In the following sections, we walk you through the detailed implementation steps to deploy and test the Falcon-7B-Instruct model using both LMI and TGI containers on SageMaker. We chose Falcon 7B as an example, but any model can take advantage of this new streaming feature.
Prerequisites
You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, refer to Creating an AWS account. If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you may need to request a service quota increase for the corresponding SageMaker hosting instances. For the Falcon-7B-Instruct model, we use an ml.g5.2xlarge SageMaker hosting instance. For hosting a Falcon-40B-Instruct model, we use an ml.g5.48xlarge SageMaker hosting instance. You can request a quota increase from the Service Quotas UI. For more information, refer to Requesting a quota increase.
Option 1: Deploy a real-time streaming endpoint using an LMI container
The LMI container is one of the Deep Learning Containers for large model inference hosted by SageMaker to facilitate hosting large language models (LLMs) on AWS infrastructure for low-latency inference use cases. The LMI container uses Deep Java Library (DJL) Serving, which is an open-source, high-level, engine-agnostic Java framework for deep learning. With these containers, you can use corresponding open-source libraries such as DeepSpeed, Accelerate, Transformers-neuronx, and FasterTransformer to partition model parameters using model parallelism techniques to use the memory of multiple GPUs or accelerators for inference. For more details on the benefits using the LMI container to deploy large models on SageMaker, refer to Deploy large models at high performance using FasterTransformer on Amazon SageMaker and Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference. You can also find more examples of hosting open-source LLMs on SageMaker using the LMI containers in this GitHub repo.
For the LMI container, we expect the following artifacts to help set up the model for inference:

serving.properties (required) – Defines the model server settings
model.py (optional) – A Python file to define the core inference logic
requirements.txt (optional) – Any additional pip wheel will need to install

LMI containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom preprocessing of the input data or postprocessing of the model’s predictions. We use the following configuration:

For this example, we host the Falcon-7B-Instruct model. We need to create a serving.properties configuration file with our desired hosting options and package it up into a tar.gz artifact. Response streaming can be enabled in DJL Serving by setting the enable_streaming option in the serving.properties file. For all the supported parameters, refer to Streaming Python configuration.
In this example, we use the default handlers in DJL Serving to stream responses, so we only care about sending requests and parsing the output response. You can also provide an entrypoint code with a custom handler in a model.py file to customize input and output handlers. For more details on the custom handler, refer to Custom model.py handler.
Because we’re hosting the Falcon-7B-Instruct model on a single GPU instance (ml.g5.2xlarge), we set option.tensor_parallel_degree to 1. If you plan to run in multiple GPUs, use this to set the number of GPUs per worker.
We use option.output_formatter to control the output content type. The default output content type is application/json, so if your application requires a different output, you can overwrite this value. For more information on the available options, refer to Configurations and settings and All DJL configuration options.

%%writefile serving.properties
engine=MPI
option.model_id=tiiuae/falcon-7b-instruct
option.trust_remote_code=true
option.tensor_parallel_degree=1
option.max_rolling_batch_size=32
option.rolling_batch=auto
option.output_formatter=jsonlines
option.paged_attention=false
option.enable_streaming=true

To create the SageMaker model, retrieve the container image URI:

image_uri = image_uris.retrieve(
framework=”djl-deepspeed”,
region=sess.boto_session.region_name,
version=”0.23.0″
)

Use the SageMaker Python SDK to create the SageMaker model and deploy it to a SageMaker real-time endpoint using the deploy method:

instance_type = “ml.g5.2xlarge”
endpoint_name = sagemaker.utils.name_from_base(“lmi-model-falcon-7b”)

model = Model(sagemaker_session=sess,
image_uri=image_uri,
model_data=code_artifact,
role=role)

model.deploy(
initial_instance_count=1,
instance_type=instance_type,
endpoint_name=endpoint_name,
container_startup_health_check_timeout=900
)

When the endpoint is in service, you can use the InvokeEndpointWithResponseStream API call to invoke the model. This API allows the model to respond as a stream of parts of the full response payload. This enables models to respond with responses of larger size and enables faster-time-to-first-byte for models where there is a significant difference between the generation of the first and last byte of the response.
The response content type shown in x-amzn-sagemaker-content-type for the LMI container is application/jsonlines as specified in the model properties configuration. Because it’s part of the common data formats supported for inference, we can use the default deserializer provided by the SageMaker Python SDK to deserialize the JSON lines data. We create a helper LineIterator class to parse the response stream received from the inference request:

class LineIterator:
“””
A helper class for parsing the byte stream input.

The output of the model will be in the following format:
“`
b'{“outputs”: [” a”]}n’
b'{“outputs”: [” challenging”]}n’
b'{“outputs”: [” problem”]}n’

“`

While usually each PayloadPart event from the event stream will contain a byte array
with a full json, this is not guaranteed and some of the json objects may be split across
PayloadPart events. For example:
“`
{‘PayloadPart’: {‘Bytes’: b'{“outputs”: ‘}}
{‘PayloadPart’: {‘Bytes’: b'[” problem”]}n’}}
“`

This class accounts for this by concatenating bytes written via the ‘write’ function
and then exposing a method which will return lines (ending with a ‘n’ character) within
the buffer via the ‘scan_lines’ function. It maintains the position of the last read
position to ensure that previous bytes are not exposed again.
“””

def __init__(self, stream):
self.byte_iterator = iter(stream)
self.buffer = io.BytesIO()
self.read_pos = 0

def __iter__(self):
return self

def __next__(self):
while True:
self.buffer.seek(self.read_pos)
line = self.buffer.readline()
if line and line[-1] == ord(‘n’):
self.read_pos += len(line)
return line[:-1]
try:
chunk = next(self.byte_iterator)
except StopIteration:
if self.read_pos < self.buffer.getbuffer().nbytes:
continue
raise
if ‘PayloadPart’ not in chunk:
print(‘Unknown event type:’ + chunk)
continue
self.buffer.seek(0, io.SEEK_END)
self.buffer.write(chunk[‘PayloadPart’][‘Bytes’])

With the class in the preceding code, each time a response is streamed, it will return a binary string (for example, b'{“outputs”: [” a”]}n’) that can be deserialized again into a Python dictionary using JSON package. We can use the following code to iterate through each streamed line of text and return text response:

body = {“inputs”: “what is life”, “parameters”: {“max_new_tokens”:400}}
resp = smr.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(body), ContentType=”application/json”)
event_stream = resp[‘Body’]

for line in LineIterator(event_stream):
resp = json.loads(line)
print(resp.get(“outputs”)[0], end=”)

The following screenshot shows what it would look like if you invoked the model through the SageMaker notebook using an LMI container.

Option 2: Implement a chatbot using a Hugging Face TGI container
In the previous section, you saw how to deploy the Falcon-7B-Instruct model using an LMI container. In this section, we show how to do the same using a Hugging Face Text Generation Inference (TGI) container on SageMaker. TGI is an open source, purpose-built solution for deploying LLMs. It incorporates optimizations including tensor parallelism for faster multi-GPU inference, dynamic batching to boost overall throughput, and optimized transformers code using flash-attention for popular model architectures including BLOOM, T5, GPT-NeoX, StarCoder, and LLaMa.
TGI deep learning containers support token streaming using Server-Sent Events (SSE). With token streaming, the server can start answering after the first prefill pass directly, without waiting for all the generation to be done. For extremely long queries, this means clients can start to see something happening orders of magnitude before the work is done. The following diagram shows a high-level end-to-end request/response workflow for hosting LLMs on a SageMaker endpoint using the TGI container.

To deploy the Falcon-7B-Instruct model on a SageMaker endpoint, we use the HuggingFaceModel class from the SageMaker Python SDK. We start by setting our parameters as follows:

hf_model_id = “tiiuae/falcon-7b-instruct” # model id from huggingface.co/models
number_of_gpus = 1 # number of gpus to use for inference and tensor parallelism
health_check_timeout = 300 # Increase the timeout for the health check to 5 minutes for downloading the model
instance_type = “ml.g5.2xlarge” # instance type to use for deployment

Compared to deploying regular Hugging Face models, we first need to retrieve the container URI and provide it to our HuggingFaceModel model class with image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in SageMaker, we can use the get_huggingface_llm_image_uri method provided by the SageMaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, Region, and version. For more details on the available versions, refer to HuggingFace Text Generation Inference Containers.

llm_image = get_huggingface_llm_image_uri(
“huggingface”,
version=”0.9.3″
)

We then create the HuggingFaceModel and deploy it to SageMaker using the deploy method:

endpoint_name = sagemaker.utils.name_from_base(“tgi-model-falcon-7b”)
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env={
‘HF_MODEL_ID’: hf_model_id,
# ‘HF_MODEL_QUANTIZE’: “bitsandbytes”, # comment in to quantize
‘SM_NUM_GPUS’: str(number_of_gpus),
‘MAX_INPUT_LENGTH’: “1900”, # Max length of input text
‘MAX_TOTAL_TOKENS’: “2048”, # Max length of the generation (including input text)
}
)

llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout,
endpoint_name=endpoint_name,
)

The main difference compared to the LMI container is that you enable response streaming when you invoke the endpoint by supplying stream=true as part of the invocation request payload. The following code is an example of the payload used to invoke the TGI container with streaming:

body = {
“inputs”:”tell me one sentence”,
“parameters”:{
“max_new_tokens”:400,
“return_full_text”: False
},
“stream”: True
}

Then you can invoke the endpoint and receive a streamed response using the following command:

from sagemaker.base_deserializers import StreamDeserializer

llm.deserializer=StreamDeserializer()
resp = smr.invoke_endpoint_with_response_stream(EndpointName=llm.endpoint_name, Body=json.dumps(body), ContentType=’application/json’)

The response content type shown in x-amzn-sagemaker-content-type for the TGI container is text/event-stream. We use StreamDeserializer to deserialize the response into the EventStream class and parse the response body using the same LineIterator class as that used in the LMI container section.
Note that the streamed response from the TGI containers will return a binary string (for example, b`data:{“token”: {“text”: ” sometext”}}`), which can be deserialized again into a Python dictionary using a JSON package. We can use the following code to iterate through each streamed line of text and return a text response:

event_stream = resp[‘Body’]
start_json = b'{‘
for line in LineIterator(event_stream):
if line != b” and start_json in line:
data = json.loads(line[line.find(start_json):].decode(‘utf-8’))
if data[‘token’][‘text’] != stop_token:
print(data[‘token’][‘text’],end=”)

The following screenshot shows what it would look like if you invoked the model through the SageMaker notebook using a TGI container.

Run the chatbot app on SageMaker Studio
In this use case, we build a dynamic chatbot on SageMaker Studio using Streamlit, which invokes the Falcon-7B-Instruct model hosted on a SageMaker real-time endpoint to provide streaming responses. First, you can test that the streaming responses work in the notebook as shown in the previous section. Then, you can set up the Streamlit application in the SageMaker Studio JupyterServer terminal and access the chatbot UI from your browser by completing the following steps:

Open a system terminal in SageMaker Studio.
On the top menu of the SageMaker Studio console, choose File, then New, then Terminal.
Install the required Python packages that are specified in the requirements.txt file:

$ pip install -r requirements.txt

Set up the environment variable with the endpoint name deployed in your account:

$ export endpoint_name=<Falcon-7B-instruct endpoint name deployed in your account>

Launch the Streamlit app from the streamlit_chatbot_<LMI or TGI>.py file, which will automatically update the endpoint names in the script based on the environment variable that was set up earlier:

$ streamlit run streamlit_chatbot_LMI.py –server.port 6006

To access the Streamlit UI, copy your SageMaker Studio URL to another tab in your browser and replace lab? with proxy/[PORT NUMBER]/. Because we specified the server port to 6006, the URL should look as follows:

https://<domain ID>.studio.<region>.sagemaker.aws/jupyter/default/proxy/6006/

Replace the domain ID and Region in the preceding URL with your account and Region to access the chatbot UI. You can find some suggested prompts in the left pane to get started.
The following demo shows how response streaming revolutionizes the user experience. It can make interactions feel fluid and responsive, ultimately enhancing user satisfaction and engagement. Refer to the GitHub repo for more details of the chatbot implementation.

Clean up
When you’re done testing the models, as a best practice, delete the endpoint to save costs if the endpoint is no longer required:

# – Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

Conclusion
In this post, we provided an overview of building applications with generative AI, the challenges, and how SageMaker real-time response streaming helps you address these challenges. We showcased how to build a chatbot application to deploy the Falcon-7B-Instruct model to use response streaming using both SageMaker LMI and HuggingFace TGI containers using an example available on GitHub.
Start building your own cutting-edge streaming applications with LLMs and SageMaker today! Reach out to us for expert guidance and unlock the potential of large model streaming for your projects.

About the Authors
Raghu Ramesha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.
Abhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as artificial intelligence, distributed computing, networking, and storage. His expertise lies in deep learning in the domains of natural language processing (NLP) and computer vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.
Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.
Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers build solutions using state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing ML solutions with best practices. In her spare time, she loves to explore nature and spend time with family and friends.
Sam Edwards, is a Cloud Engineer (AI/ML) at AWS Sydney specialized in machine learning and Amazon SageMaker. He is passionate about helping customers solve issues related to machine learning workflows and creating new solutions for them. Outside of work, he enjoys playing racquet sports and traveling.
James Sanders is a Senior Software Engineer at Amazon Web Services. He works on the real-time inference platform for Amazon SageMaker.

<