Move Beyond Chain-of-Thought with Chain-of-Draft on Amazon Bedrock

As organizations scale their generative AI implementations, the critical challenge of balancing quality, cost, and latency becomes increasingly complex. With inference costs dominating 70–90% of large language model (LLM) operational expenses, and verbose prompting strategies inflating token volume by 3–5x, organizations are actively seeking more efficient approaches to model interaction. Traditional prompting methods, while effective, often create unnecessary overhead that impacts both cost efficiency and response times.
This post explores Chain-of-Draft (CoD), an innovative prompting technique introduced in a Zoom AI Research paper Chain of Draft: Thinking Faster by Writing Less, that revolutionizes how models approach reasoning tasks. While Chain-of-Thought (CoT) prompting has been the go-to method for enhancing model reasoning, CoD offers a more efficient alternative that mirrors human problem-solving patterns—using concise, high-signal thinking steps rather than verbose explanations.
Using Amazon Bedrock and AWS Lambda, we demonstrate a practical implementation of CoD that can achieve remarkable efficiency gains: up to 75%reduction in token usage and over 78% decrease in latency, all while maintaining the accuracy levels of traditional CoT approaches. Through detailed examples, code samples, and performance metrics, we walk through deploying CoD in an AWS environment and measuring its impact on AI implementations. This approach not only optimizes costs but also enhances the overall user experience through faster response times.

Understanding Chain-of-Thought prompting
Chain-of-Thought (CoT) prompting is a technique that guides large language models to reason through problems step by step, rather than jumping directly to an answer. This method has proven particularly effective for complex tasks such as logical puzzles, mathematical problems, and common-sense reasoning scenarios. By mimicking human problem-solving patterns, CoT helps models break down complex problems into manageable steps, improving both accuracy and transparency.
Example of CoT prompting:
Question: If there are 5 apples and you eat 2 apples, how many apples remain?
CoT response: Start with 5 apples. I eat 2 apples. Subtract 2 from 5. 5 – 2 = 3 apples remaining.
However, as the example above shows, this approach comes with some drawbacks in production environments. The verbose nature of CoT responses leads to increased token usage and higher costs. The extended processing time required for generating detailed explanations results in higher latency, making it in some cases less suitable for real-time applications. Additionally, the detailed outputs can complicate downstream processing and integration with other systems.
Introducing Chain-of-Draft prompting
Chain-of-Draft (CoD) is a novel prompting technique that aims to reduce verbosity by limiting the number of words used in each reasoning step, focusing only on the essential calculations or transformations needed to progress, while significantly reducing token usage and inference latency. CoD draws inspiration from how humans solve problems with brief mental notes rather than verbose explanations—encouraging LLMs to generate compact, high-signal reasoning steps.
The key innovation of CoD lies in its constraint: each reasoning step is limited to five words or less. This limitation forces the model to focus on essential logical components while minimizing unnecessary verbosity. For instance, when solving a mathematical word problem, instead of generating full sentences explaining each step, CoD produces concise numerical operations and key logical markers.
Consider this example:
Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A CoT response might include several sentences explaining the reasoning process like, “Jason had 20 lollipops. He gave some to Denny and now has 12 left. So he gave away 8.”
In contrast, a CoD response would simply state “Start: 20, End: 12, 20 – 12 = 8.”
This minimalist approach achieves the same logical reasoning while using significantly fewer tokens.
Why CoD works
The key idea behind CoD is that most reasoning chains contain high redundancy. By distilling steps to their semantic core, CoD helps the model focus on the logical structure of the task rather than language fluency. This results in lower inference latency due to shorter outputs, reduced token cost from minimized generation and cleaner output for downstream parsing or automation.
This minimalism is achieved without sacrificing accuracy. In fact, according to the original Zoom AI paper, CoD “achieved 91.4% accuracy on GSM8K (vs. 95.3% for CoT), while reducing output tokens by up to 92.1%, and cutting latency nearly in half in several models tested.”
Under the hood, the CoD technique uses natural language prompts that instruct the model to “think step by step” while explicitly limiting the length of each reasoning step: “Only keep a minimum draft for each thinking step, with 5 words at most.”
The researchers found that models like GPT-4, Claude, and Cohere Command R+ performed especially well under these constraints, particularly when using few-shot examples to demonstrate the concise reasoning pattern.

Beyond arithmetic tasks, CoD has demonstrated strong performance in commonsense reasoning tasks. In the original Zoom AI paper, the authors evaluated CoD using big-bench benchmarks, specifically focused on date understanding and sports understanding tasks. The same system prompts were used as in arithmetic evaluations, maintaining consistency across experiments. The results revealed that CoD not only significantly reduces token generation and latency, but in several cases, outperforms CoT in accuracy—especially when verbose output isn’t necessary.
One notable finding was with a large language model on the sports understanding task: CoT produced long, verbose responses with an average of 172.5 output tokens, while CoD reduced this to 31.3 tokens, achieving an ~82% reduction. Interestingly, accuracy improved slightly, demonstrating that CoD can be more effective with fewer words.
Here’s a snapshot from the original paper showing the evaluation across two LLMs:

Model
Prompt
Accuracy
Token
Latency

LLM-1
Standard
72.60%
5.2
0.6s

Chain-of-Thought
90.20%
75.7
1.7s

Chain-of-Draft
88.10%
30.2
1.3s

LLM-2
Standard
84.30%
5.2
1.0s

Chain-of-Thought
87%
172.5
3.2s

Chain-of-Draft
89.70%
31.3
1.4s

Table 1. Date understanding evaluation results. (Chain of Draft: Thinking Faster by Writing Less)
These results further validate CoD’s value in real-world reasoning scenarios, showing that models can reason effectively with fewer, smarter tokens. The implication for production use is clear: faster responses and lower cost, without a trade-off in quality.

In the next section, we show how we implemented this prompting strategy using Amazon Bedrock and AWS Lambda, and how CoD compares to CoT across foundation models in real-world conditions.
Implementation and evaluation on AWS
To evaluate the efficiency of CoD prompting techniques, we run a test in Amazon Bedrock and solve the “Red, Blue, and Green Balls” puzzle using an LLM.
The Puzzle: You have three boxes. Each box contains three balls, but the balls can be red, blue, or green. Box 1 is labelled “Red Balls Only.” Box 2 is labelled “Blue Balls Only.” Box 3 is labelled “Red and Blue Balls Only.” The labels on the boxes are all incorrect. The task is that you must determine the contents of each box, knowing that all labels are incorrect. You can only take a single ball from one box and observe its color. Then you must deduce the contents of all three boxes.
We chose this puzzle because solving it requires a measurable number of tokens, as the problem needs to be broken down into several logical steps, each requiring the LLM to process and retain information. The LLM needs to handle “if-then” statements and consider different possibilities leading to logical reasoning. The LLM also needs to maintain the context of the puzzle throughout the reasoning process, and lastly, the LLM needs to understand the symbols and relationships between the colors, labels, and balls.
Prerequisites
To test and compare the prompting techniques in Amazon Bedrock, verify you have the following prerequisites:

AWS account with permission to create and execute Lambda functions
Amazon Bedrock access enabled in your AWS Region (for example, us-east-1) along with Model Access for example, Model-1 and Model-2; select any model of your choice
AWS IAM role for the Lambda function execution
Permissions to invoke Amazon Bedrock models (bedrock:Converse)
Permissions to put custom metrics in Amazon CloudWatch (cloudwatch:PutMetricData)
(Optional) CloudWatch Logs permissions for logging
Necessary Python libraries (boto3), included in the AWS Lambda runtime environment for Python 3.9 or later

Evaluation with Amazon Bedrock Converse API
We start by creating a Python Lambda function designed to interact with models using Amazon Bedrock to solve the puzzle. This AWS Lambda function uses the Amazon Bedrock Converse API, which provides a unified, consistent interface to interact with various foundation models. The Converse API simplifies sending conversational messages to models and receiving their replies, supporting multi-turn dialogue and advanced features while managing AWS authentication and infrastructure. The Lambda function initializes clients for Amazon Bedrock Runtime and CloudWatch and send a static puzzle prompt as a user message to the Converse API, retrieve the response text, and calculate latency and token usage for both input and output. These metrics are published to CloudWatch, and relevant logs are recorded. Finally, the function returns the model’s answer along with input/output token counts. Errors are logged and returned with proper HTTP error code.
The Lambda function
import json
import boto3
import time
import logging
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

bedrock = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)
cloudwatch = boto3.client(‘cloudwatch’)
MODEL_ID = “model1-id” # Replace with actual Model 1 ID
PROMPT = (
“You have three boxes. Each box contains three balls, but the balls can be red, blue, or green. ”
“Box 1 is labeled as ‘Red Balls Only’. Box 2 is labeled ‘Blue Balls Only’. ”
“Box 3 is labeled ‘Red and Blue Balls Only’. The labels on the boxes are all incorrect. ”
“The Task: You must determine the contents of each box, knowing that all labels are incorrect. ”
“You can only take a single ball from one box and observe its color. ”
“Then you must deduce the contents of all three boxes. ”
“Think step by step to answer the question, but only keep a minimum draft for each thinking step, with 5 words at most. ”
“Return the answer at the end of the response after separator ###.”
)

def lambda_handler(event, context):
conversation = [{“role”: “user”, “content”: [{“text”: PROMPT}]}]
start_time = time.time()
try:
response = bedrock.converse(
modelId=MODEL_ID,
messages=conversation,
inferenceConfig={“maxTokens”: 2000, “temperature”: 0.7}
)
response_text = response[“output”][“message”][“content”][0][“text”]
latency = time.time() – start_time
input_tokens = len(PROMPT.split())
output_tokens = len(response_text.split())

cloudwatch.put_metric_data(
Namespace=’ChainOfDraft’,
MetricData=[
{“MetricName”: “Latency”, “Value”: latency, “Unit”: “Seconds”},
{“MetricName”: “TokensUsed”, “Value”: input_tokens + output_tokens, “Unit”: “Count”},
]
)

logger.info({
“request_id”: context.aws_request_id,
“latency_seconds”: round(latency, 2),
“total_tokens”: input_tokens + output_tokens
})

return {
“statuscode”: 200,
“body”: json.dumps({
“response”: response_text,
“input_tokens”: input_tokens,
“output_tokens”: output_tokens,
“metrics”: {
“latency_seconds”: round(latency, 2),
“total_tokens”: input_tokens + output_tokens,
},
}),
}

except ClientError as e:
logger.error(f”AWS service error: {e}”)
return {“statuscode”: 500, “body”: json.dumps(“Service error occurred”)}

except Exception as e:
logger.error(f”Unexpected error: {e}”)
return {“statusCode”: 500, “body”: json.dumps(f”Internal error occurred: {e}”)}
If you’re using Model 2, change the MODEL_ID in the above code to Model 2 id.  The rest of the code remains the same.
Testing
Here are the three prompts used with the models to test the Lambda function. Change the PROMPT in the Lambda function to test out the prompting techniques.
Standard prompt:
“You have three boxes. Each box contains three balls, but the balls can be red, blue, or green. Box 1 is labelled as ‘Red Balls Only’. Box 2 is labelled ‘Blue Balls Only’. Box 3 is labelled ‘Red and Blue Balls Only’. The labels on the boxes are all incorrect. The Task: You must determine the contents of each box, knowing that all labels are incorrect. You can only take a single ball from one box and observe its color. Then you must deduce the contents of all three boxes. Answer the question directly. Do not return any preamble explanation or reasoning.”
Chain-of-Thought prompt:
“You have three boxes. Each box contains three balls, but the balls can be red, blue, or green. Box 1 is labelled as ‘Red Balls Only’. Box 2 is labelled ‘Blue Balls Only’. Box 3 is labelled ‘Red and Blue Balls Only’. The labels on the boxes are all incorrect. The Task: You must determine the contents of each box, knowing that all labels are incorrect. You can only take a single ball from one box and observe its color. Then you must deduce the contents of all three boxes. Think step by step to answer the question. Return the answer at the end of the response after separator.”
Chain-of-Draft prompt:
“You have three boxes. Each box contains three balls, but the balls can be red, blue, or green. Box 1 is labelled as ‘Red Balls Only’. Box 2 is labelled ‘Blue Balls Only’. Box 3 is labelled ‘Red and Blue Balls Only’. The labels on the boxes are all incorrect. The Task: You must determine the contents of each box, knowing that all labels are incorrect. You can only take a single ball from one box and observe its color. Then you must deduce the contents of all three boxes. Think step by step to answer the question but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after separator.”
Results
On testing the lambda function with the above prompts with the two models, the results are as follows:

Model
Prompt technique
Input tokens
Output tokens
Total Tokens
Tokens Reduction COD vs. COT
Latency in seconds
Latency Reduction COD vs. COT

Model-1
Standard Prompt
102
23
125

0.8

Chain of Thought
109
241
350

3.28

Chain of Draft
123
93
216
((350-216)/350) × 100 = 39% reduction
1.58
((3.28-1.58)/3.28) × 100 = 52% reduction

Model-2
Standard Prompt
102
17
119

0.6

Chain of Thought
109
492
601

3.81

Chain of Draft
123
19
142
((601-142)/601) × 100 = 76% reduction
0.79
((3.81-0.79)/3.81) × 100 = 79% reduction

Table 2: Results of Testing with Standard prompt, CoD prompt and CoT prompt across the models
The comparison shows that Chain of Draft (CoD) is far more efficient than Chain of Thought (CoT) across both models. For Model-1, CoD reduces total token usage from 350 to 216 (a 39% reduction) and cuts latency from 3.28 to 1.58 seconds (a 52% reduction). The gains are even greater for Model-2 where COD lowers tokens from 601 to 142 (a 76% reduction) and latency from 3.81 to 0.79 seconds (a 79% reduction). Overall, COD delivers significant improvements in speed and token efficiency compared to COT, with especially strong results on Model-2.
When to avoid using CoD
While CoD prompting offers compelling benefits in terms of efficiency and performance, it’s not universally applicable. There are scenarios where traditional CoT or even more verbose reasoning may be more effective or appropriate. Based on our experimentation and findings from the original research, here are some key considerations:

Zero-shot or prompt-only use cases: CoD performs best when paired with strong few-shot examples. In zero-shot scenarios—where no reasoning patterns are provided—models often struggle to adopt the minimalist drafting style on their own. This can lead to lower accuracy or incomplete reasoning steps.
Tasks requiring high interpretability: For use cases like legal or medical document review, audit trails, or regulated environments, verbose reasoning may be essential. In such cases, CoT’s more transparent, step-by-step explanations provide better traceability and trust.
Small language models: CoD underperformed on models with fewer than 3 billion parameters. These models lack the instruction-following fidelity and reasoning power needed to execute CoD-style prompts effectively. CoT may yield better results in these cases.
Creative or open-ended tasks: Tasks that benefit from elaboration—like writing, ideation, or user-facing conversations—may lose value if too condensed. CoD is best suited for structured reasoning, logic, and deterministic tasks where brevity improves performance.

In short, CoD shines when the goal is efficient reasoning with minimal overhead—but careful prompt design, model selection, and task fit are key to success.
Conclusion and key takeaways
CoD prompting emerges as an efficient technique for organizations seeking to optimize their generative AI implementations. By encouraging language models to reason in concise, focused steps, CoD achieves remarkable improvements in both performance and resource utilization. Our implementation using Amazon Bedrock and AWS Lambda demonstrated significant benefits in token usage and improvement in latency compared to traditional CoT prompting, while maintaining comparable accuracy across various foundation models and complex reasoning tasks. As AI continues to evolve, CoD represents a significant step towards more efficient and performant language models. It’s particularly valuable for structured reasoning tasks where speed and token efficiency are critical, though it’s not a one-size-fits-all solution. We encourage practitioners to explore CoD in their own AI workflows, leveraging its potential to reduce costs, improve response times, and enhance scalability. The future of AI lies in smarter, more efficient reasoning approaches, and CoD prompting is at the forefront of this transformation.
To learn more about prompt engineering and CoD technique, refer to the following resources:

What is prompt engineering?
Chain of Draft: Thinking Faster by Writing Less
Prompt Engineering concepts

About the authors
Ahmed Raafat is a Senior Manager at AWS leading the AI/ML Specialist team in the UK & Ireland, with over 20 years of technology experience helping major companies transform through AI and cloud technologies. As a trusted C-suite advisor and thought leader, he guides organizations in AI strategy and adoption, helping them use emerging technologies for innovation and growth.
Kiranpreet Chawla is a Solutions Architect at Amazon Web Services, leveraging over 15 years of diverse technology experience to drive cloud and AI transformations. Kiranpreet’s expertise spans from cloud modernization to AI/ML implementations, enabling her to provide comprehensive guidance to customers across various industries.

Deploy Mistral AI’s Voxtral on Amazon SageMaker AI

Mistral AI’s Voxtral models combine text and audio processing capabilities in a single framework. The Voxtral family includes two distinct variants designed for different use cases and resource requirements. The Voxtral-Mini-3B-2507 is a compact 3-billion-parameter model optimized for efficient audio transcription and basic multimodal understanding, making it ideal for applications where speed and resource efficiency are priorities. The Voxtral-Small-24B-2507 is 24-billion-parameter model built on the Mistral Small 3 backbone that supports advanced chat capabilities, function calling directly from voice input, and complex audio-text intelligence, perfect for enterprise applications requiring nuanced understanding and multilingual audio processing. Both models support long-form audio context of up to 30–40 minutes, feature automatic language detection, and maintain a 32,000-token context length. They are released under the Apache 2.0 license, making them readily available for both commercial and research applications.
Voxtral models feature multimodal intelligence that processes spoken and written communication within a unified pipeline, alleviating the need for separate transcription and processing stages. The models demonstrate advanced audio understanding by extracting context and sentiment directly from audio inputs and can handle multiple audio files within single conversation threads. Voxtral Small includes function calling capabilities that convert audio inputs into executable tool calls. These capabilities enable applications such as contextual voice assistants, automated meeting transcription with insight extraction, intelligent call processing for customer service, accessibility tools, and multilingual communication systems for global organizations.
In this post, we demonstrate hosting Voxtral models on Amazon SageMaker AI endpoints using vLLM and the Bring Your Own Container (BYOC) approach. vLLM is a high-performance library for serving large language models (LLMs) that features paged attention for improved memory management and tensor parallelism for distributing models across multiple GPUs. The BYOC capability of SageMaker supports deployment with custom container images, providing precise version control for vLLM 0.10.0+ compatibility, optimization flexibility for Voxtral’s multimodal processing requirements (including specialized audio libraries and custom memory management), and support for both Voxtral-Mini and Voxtral-Small models through simple configuration updates.
Solution overview
In this solution, the SageMaker notebook environment serves as the central orchestration point for the entire deployment process. It manages the building and pushing of custom Docker images to Amazon Elastic Container Registry (Amazon ECR), handles model configuration and deployment workflows, and provides testing and validation capabilities to facilitate successful model deployment.
A key part of this solution is a custom Docker container that builds on the official vLLM server by adding specialized audio processing libraries (librosa, soundfile, pydub) and mistral_common for Voxtral tokenization, with everything set up to work seamlessly with the SageMaker BYOC approach. Amazon ECR provides secure storage and scalable distribution of this container image, integrating seamlessly with the SageMaker deployment mechanisms. The SageMaker inference endpoint serves as the production runtime where the Voxtral model is hosted, offering automatic scaling and load balancing with recommended instance types of ml.g6.4xlarge for Voxtral-Mini and ml.g6.12xlarge for Voxtral-Small deployments. Amazon Simple Storage Service (Amazon S3) completes the architecture by storing three critical files from our vLLM-BYOC implementation: the custom inference handler (model.py), model configuration (serving.properties), and dependencies (requirements.txt), creating a modular approach that separates configuration from container images to enable flexible model updates and configuration changes without container rebuilds, so teams can seamlessly switch between Voxtral-Mini and Voxtral-Small deployments by simply updating the serving.properties file.
The following diagram illustrates the solution architecture.

A three-step workflow diagram showing how to deploy Voxtral models on Amazon SageMaker using custom Docker containers, S3 storage, and multi-GPU endpoints.

The solution supports multiple use case patterns for different organizational needs. Text-only processing uses the standard chat completion API for traditional conversational AI where audio processing isn’t required. Transcription-only mode provides accurate audio file transcription, ideal for meeting notes or searchable audio archives. More sophisticated applications combine audio and text intelligence, where audio provides context while text delivers specific instructions, enabling voice-controlled applications with written clarifications. The advanced pattern involves function calling from audio inputs, where spoken commands directly trigger automated actions. For example, saying “Calculate the square root of 144” automatically executes the calculator tool and returns results, creating hands-free workflows.
This post also demonstrates integrating the Voxtral model deployed on SageMaker with Strands Agents to build agentic applications with minimal code.
The following sections provide a complete implementation guide to get your Voxtral model running on SageMaker endpoints.
Prerequisites
To get started, you must have the following prerequisites:

The following software requirements:

vLLM >= 0.10.0
mistral_common >= 1.8.1

AWS account setup, including:

A SageMaker notebook using ml.m5.4xlarge with 100 GB storage.
AWS Identity and Access Management (IAM) permissions. Add the EC2InstanceProfileForImageBuilderECRContainerBuilds policy to the SageMaker execution role.
Service quotas: ml.g6.4xlarge (Voxtral-Mini) and ml.g6.12xlarge (Voxtral-Small) instances available. Refer to Requesting a quota increase to request a service quota increase in your account.

Deploy Voxtral models
Complete the following steps to quickly deploy and test Voxtral models:

Download the code from the GitHub repo:

git clone https://github.com/aws-samples/mistral-on-aws.git
cd mistral-on-aws/Mistral Voxtral/Voxtral-vllm-byoc

Build your container:

chmod +x build_and_push.sh
./build_and_push.sh

Configure your model in code/serving.properties:

To deploy Voxtral-Mini, use the following code:

option.model_id=mistralai/Voxtral-Mini-3B-2507
option.tensor_parallel_degree=1

To deploy Voxtral-Small, use the following code:

option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4

Open and run Voxtral-vLLM-BYOC-SageMaker.ipynb to deploy your endpoint and test with text, audio, and function calling capabilities.
Docker container configuration The GitHub repo contains the full Dockerfile. The following code snippet highlights the key parts:

# Custom vLLM Container for Voxtral Model Deployment on SageMaker
FROM –platform=linux/amd64 vllm/vllm-openai:latest
# Set environment variables for SageMaker
ENV MODEL_CACHE_DIR=/opt/ml/model
ENV TRANSFORMERS_CACHE=/tmp/transformers_cache
ENV HF_HOME=/tmp/hf_home
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
# Install audio processing dependencies
RUN pip install –no-cache-dir
“mistral_common>=1.8.1”
librosa>=0.10.2
soundfile>=0.12.1
pydub>=0.25.1
This Dockerfile creates a specialized container that extends the official vLLM server with Voxtral-specific capabilities by adding essential audio processing libraries (mistral_common for tokenization, librosa/soundfile/pydub for audio handling) while configuring the proper SageMaker environment variables for model loading and caching. The approach separates infrastructure from business logic by keeping the container generic and allowing SageMaker to dynamically inject model-specific code (model.py and serving.properties) from Amazon S3 at runtime, enabling flexible deployment of different Voxtral variants without requiring container rebuilds. Model configurations The full model configurations are in the serving.properties file located in the code folder. The following code snippet highlights the key configurations:

# Model configuration
option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4
option.dtype=bfloat16
# Voxtral-specific settings (as per official documentation)
option.tokenizer_mode=mistral
option.config_format=mistral
option.load_format=mistral
option.trust_remote_code=true
# Audio processing (Voxtral specifications)
option.limit_mm_per_prompt=audio:8
option.mm_processor_kwargs={“audio_sampling_rate”: 16000, “audio_max_length”: 1800.0}
# Performance optimizations (vLLM v0.10.0+ features)
option.enable_chunked_prefill=true
option.enable_prefix_caching=true
option.use_v2_block_manager=true
This configuration file provides Voxtral-specific optimizations that follow Mistral’s official recommendations for vLLM server deployment, setting up proper tokenization modes, audio processing parameters (supporting up to eight audio files per prompt with 30-minute transcription capability), and using the latest vLLM v0.10.0+ performance features like chunked prefill and prefix caching. The modular design supports seamless switching between Voxtral-Mini and Voxtral-Small by simply changing the model_id and tensor_parallel_degree parameters, while maintaining optimal memory utilization and enabling advanced caching mechanisms for improved inference performance. Custom inference handler The full custom inference code is in the model.py file located in the code folder. The following code snippet highlights the key functions:

# FastAPI app for SageMaker compatibility
app = FastAPI(title=”Voxtral vLLM Inference Server”, version=”1.1.0″)
model_engine = None
# vLLM Server Initialization for Voxtral
def start_vllm_server():
“””Start vLLM server with Voxtral-specific configuration”””
config = load_serving_properties()

cmd = [
“vllm”, “serve”, config.get(“option.model_id”),
“–tokenizer-mode”, “mistral”,
“–config-format”, “mistral”,
“–tensor-parallel-size”, config.get(“option.tensor_parallel_degree”),
“–host”, “127.0.0.1”,
“–port”, “8000”
]

vllm_server_process = subprocess.Popen(cmd, env=vllm_env)
server_ready = wait_for_server()
return server_ready
@app.post(“/invocations”)
async def invoke_model(request: Request):
“””Handle chat, transcription, and function calling”””
# Transcription requests
if “transcription” in request_data:
audio_source = request_data[“transcription”][“audio”]
return transcribe_audio(audio_source)

# Chat requests with multimodal support
messages = format_messages_for_openai(request_data[“messages”])
tools = request_data.get(“tools”)

# Generate via vLLM OpenAI client
response = openai_client.chat.completions.create(
model=model_config[“model_id”],
messages=messages,
tools=tools if supports_function_calling() else None
)
return response
This custom inference handler creates a FastAPI-based server that directly integrates with the vLLM server for optimal Voxtral performance. The handler processes multimodal content including base64-encoded audio and audio URLs, dynamically loads model configurations from the serving.properties file, and supports advanced features like function calling for Voxtral-Small deployments. SageMaker deployment code The Voxtral-vLLM-BYOC-SageMaker.ipynb notebook included in the Voxtral-vllm-byoc folder orchestrates the entire deployment process for both Voxtral models:

import boto3
import sagemaker
from sagemaker.model import Model
# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = “your-s3-bucket”
# Upload model artifacts to S3
byoc_config_uri = sagemaker_session.upload_data(
path=”./code”,
bucket=bucket,
key_prefix=”voxtral-vllm-byoc/code”
)
# Configure custom container image
account_id = boto3.client(‘sts’).get_caller_identity()[‘Account’]
region = boto3.Session().region_name
image_uri = f”{account_id}.dkr.ecr.{region}.amazonaws.com/voxtral-vllm-byoc:latest”
# Create SageMaker model
voxtral_model = Model(
image_uri=image_uri,
model_data={
“S3DataSource”: {
“S3Uri”: f”{byoc_config_uri}/”,
“S3DataType”: “S3Prefix”,
“CompressionType”: “None”
}
},
role=role,
env={
‘MODEL_CACHE_DIR’: ‘/opt/ml/model’,
‘TRANSFORMERS_CACHE’: ‘/tmp/transformers_cache’,
‘SAGEMAKER_BIND_TO_PORT’: ‘8080’
}
)
# Deploy to endpoint
predictor = voxtral_model.deploy(
initial_instance_count=1,
instance_type=”ml.g6.12xlarge”, # For Voxtral-Small
container_startup_health_check_timeout=1200,
wait=True
)
Model use cases The Voxtral models support various text and speech-to-text use cases, and the Voxtral-Small model supports tool use with voice input. Refer to the GitHub repository for the complete code. In this section, we provide code snippets for different use cases that the model supports. Text-only The following code shows a basic text-based conversation with the model. The user sends a text query and receives a structured response:

payload = {
“messages”: [
{
“role”: “user”,
“content”: “Hello! Can you tell me about the advantages of using vLLM for model inference?”
}
],
“max_tokens”: 200,
“temperature”: 0.2,
“top_p”: 0.95
}
response = predictor.predict(payload)
Transcription-only The following example focuses on speech-to-text transcription by setting temperature to 0 for deterministic output. The model processes an audio file URL or audio file converted to base64 code, then returns the transcribed text without additional interpretation:

payload = {
“transcription”: {
“audio”: “https://audiocdn.frenchtoday.com/file/ft-public-files/audiobook-samples/AMPFE/AMP%20FE%20Ch%2002%20Story%20Slower.mp3”,
“language”: “fr”,
“temperature”: 0.0
}
}
response = predictor.predict(payload)
Text and audio understanding The following code combines both text instructions and audio input for multimodal processing. The model can follow specific text commands while analyzing the provided audio file in one inference pass, enabling more complex interactions like guided transcription or audio analysis tasks:

payload = {
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “Can you summarise this audio file”
},
{
“type”: “audio”,
“path”: “https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3”
}
]
}
],
“max_tokens”: 300,
“temperature”: 0.2,
“top_p”: 0.95
}
response = predictor.predict(payload)
Tool use The following code showcases function calling capabilities, where the model can interpret voice commands and execute predefined tools. The example demonstrates weather queries through voice input, with the model automatically calling the appropriate function and returning structured results:

# Define weather tool configuration
WEATHER_TOOL = {
“type”: “function”,
“function”: {
“name”: “get_current_weather”,
“description”: “Get the current weather for a specific location”,
“parameters”: {
“type”: “object”,
“properties”: {
“location”: {
“type”: “string”,
“description”: “The city and state, e.g. San Francisco, CA”
},
“format”: {
“type”: “string”,
“enum”: [“celsius”, “fahrenheit”],
“description”: “The temperature unit to use.”
}
},
“required”: [“location”, “format”]
}
}
}
# Mock weather function
def mock_weather(location, format=”celsius”):
“””Always returns sunny weather at 25°C/77°F”””
temp = 77 if format.lower() == “fahrenheit” else 25
unit = “°F” if format.lower() == “fahrenheit” else “°C”
return f”It’s sunny in {location} with {temp}{unit}”
# Test payload with audio
payload = {
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “audio”,
“path”: “https://huggingface.co/datasets/patrickvonplaten/audio_samples/resolve/main/fn_calling.wav”
}
]
}
],
“temperature”: 0.2,
“top_p”: 0.95,
“tools”: [WEATHER_TOOL]
}
response = predictor.predict(payload)
Strands Agents integration The following example shows how to integrate Voxtral with the Strands framework to create intelligent agents capable of using multiple tools. The agent can automatically select and execute appropriate tools (such as calculator, file operations, or shell commands from Strands prebuilt tools) based on user queries, enabling complex multi-step workflows through natural language interaction:

# SageMaker integration with Strands agents
# from strands import Agent
from strands import Agent
from strands.models.sagemaker import SageMakerAIModel
from strands_tools import calculator, current_time, file_read, shell
model = SageMakerAIModel(
endpoint_config={
“endpoint_name”: endpoint_name,
“region_name”: “us-west-2”,
},
payload_config={
“max_tokens”: 1000,
“temperature”: 0.7,
“stream”: False,
}
)
agent = Agent(model=model, tools=[calculator, current_time, file_read, shell])
response = agent(“What is the square root of 12?”)
Clean up When you finish experimenting with this example, delete the SageMaker endpoints that you created in the notebook to avoid unnecessary costs:

# Delete SageMaker endpoint
print(f” Deleting endpoint: {endpoint_name}”)
predictor.delete_endpoint(delete_endpoint_config=True)
print(” Endpoint deleted successfully”)
Conclusion In this post, we demonstrated how to successfully self-host Mistral’s open source Voxtral models on SageMaker using the BYOC approach. We’ve created a production-ready system that uses the latest vLLM framework and official Voxtral optimizations for both Mini and Small model variants. The solution supports the full spectrum of Voxtral capabilities, including text-only conversations, audio transcription, sophisticated multimodal understanding, and function calling directly from voice input. With this flexible architecture, you can switch between Voxtral-Mini and Voxtral-Small models through simple configuration updates without requiring container rebuilds. Take your multimodal AI applications to the next level by trying out the complete code from the GitHub repository to host the Voxtral model on SageMaker and start building your own voice-enabled applications. Explore Voxtral’s full potential by visiting Mistral’s official website to discover detailed capabilities, performance benchmarks, and technical specifications. Finally, explore the Strands Agents framework to seamlessly create agentic applications that can execute complex workflows.
About the authors Ying Hou, PhD, is a Sr. Specialist Solution Architect for GenAI at AWS, where she collaborates with model providers to onboard the latest and most intelligent AI models onto AWS platforms. With deep expertise in Gen AI, ASR, computer vision, NLP, and time-series forecasting models, she works closely with customers to design and build cutting-edge ML and GenAI applications.

Enhance document analytics with Strands AI Agents for the GenAI IDP Ac …

Extracting structured information from unstructured data is a critical first step to unlocking business value. Our Generative AI Intelligent Document Processing (GenAI IDP) Accelerator has been at the forefront of this transformation, already having processed tens of millions of documents for hundreds of customers.
Although organizations can use intelligent document processing (IDP) solutions to digitize their documents by extracting structured data, the methods to efficiently analyze this processed data remains elusive. After documents are processed and structured, a new challenge emerges: how can businesses quickly analyze this wealth of information and unlock actionable insights?
To address this need, we are announcing Analytics Agent, a new feature that is seamlessly integrated into the GenAI IDP Accelerator. With this feature, users can perform advanced searches and complex analyses using natural language queries without SQL or data analysis expertise.
In this post, we discuss how non-technical users can use this tool to analyze and understand the documents they have processed at scale with natural language.
GenAI IDP Accelerator
The GenAI IDP Accelerator, an open source solution, helps organizations use generative AI to automatically extract information from various document types. The accelerator combines Amazon Bedrock and other AWS services, including AWS Lambda, AWS Step Functions, Amazon Simple Queue Service (Amazon SQS), and Amazon DynamoDB, to create a serverless system. The GenAI IDP Accelerator is designed to work at scale and can handle thousands of documents daily. It offers three processing patterns for users to build custom solutions for complex document processing workflows. The accelerator can be deployed using AWS CloudFormation templates, and users can start processing documents immediately through either the web interface or by uploading files directly to Amazon Simple Storage Service (Amazon S3). The accelerator consists of multiple modules like document classification, data extraction, assessment, summarization, and evaluation. To learn more about the GenAI IDP Accelerator, see Accelerate intelligent document processing with generative AI on AWS.
Now, using natural language queries through the Analytics Agent feature, you can extract valuable information to understand the performance of the solution. To access this feature, simply deploy the latest version of the GenAI IDP Accelerator and choose Agent Companion Chat in the navigation pane, as shown in the following screenshot (from accelerator version 0.4.7). Queries related to analytics automatically get routed to the Analytics Agent.

The Analytics Agent acts as an intelligent interface between business users and their processed document data. It can handle intricate queries that would typically require a skilled data scientist, making advanced analytics accessible to the average business user. For example, a healthcare provider could ask, “What percentage of insurance claims were denied last month? Of those, how many were due to incomplete documentation? Show me a trend of denial reasons over the past six months.” Or a tax accounting firm could ask, “Which of my clients are paying state tax in more than one state on their W2 forms?”
The following screenshot is an example of an analysis using the Analytics Agent feature through the Agent Companion Chat interface. A user in the accounting vertical queried “Make a histogram of gross earnings from all uploaded W2s in the last 180 days with 25 bins between $0 and $300,000,” and the agent analyzed data extracted from over 1,000 W2 forms in under a minute.

Analytics Agent
The Analytics Agent is built using Strands Agents, an open source SDK with a model-driven approach for building AI agents. The agent, using several tools, is designed to make working with enterprise data more intuitive by providing natural language to data and visualization conversion. The Analytics Agent workflow consists of the following steps:

The agent uses a database exploration tool if needed to understand data structures stored in Amazon Athena tables within the IDP solution. This is required because the tables within the IDP solution can have different schemas based on how users have configured the processing pipeline.
The agent converts natural language queries into optimized SQL queries compatible with the available databases and tables. These queries can scale to tables of arbitrary size.
The agent runs SQL against Athena and stores query results in Amazon S3. These results can be thousands of rows long. It automatically fixes and reruns potential failed queries based on the error message generated by Athena.
The agent securely transfers query results from Amazon S3 into an AWS Bedrock AgentCore Code Interpreter sandbox.
The agent writes Python code designed to analyze the query results and generate charts or tables in a structured output compatible with the UI. The code is copied into the sandbox and is executed securely there.
Lastly, final visualizations are presented in the web interface for straightforward interpretation.

The following diagram illustrates the workflow of the Analytics Agent.

Solution overview
The following architecture diagram illustrates the serverless Analytics Agent deployment and its integration with the existing IDP solution through the AWS AppSync API.

The Analytics Agent is deployed primarily within Lambda functions. When a user query is provided to the AppSync API from the IDP frontend, an ephemeral request handler Lambda function creates and stores a unique job ID in DynamoDB to track the asynchronous processing flow, and launches a long-running agent request processor Lambda function that instantiates a Strands agent and launches it. The frontend polls the job status and retrieves final results (including from prior jobs) from DynamoDB. The agent request processor Lambda function has AWS Identity and Access Management (IAM) permissions to access the IDP tables in Athena as well as to launch and execute an AgentCore Code Interpreter sandbox for more secure Python code execution.
The architecture follows a security-first design:

Sandboxed execution – The Python code runs in AgentCore Code Interpreter, completely isolated from the rest of the AWS environment and the internet
Secure data transfer – Query results are transferred through Amazon S3 and AgentCore APIs, not through the context window of an LLM
Session management – AgentCore Code Interpreter sessions are properly managed and cleaned up after use
Minimal permissions – Each component requests only the necessary AWS permissions
Audit trail – The solution offers comprehensive logging and monitoring for security reviews

Intelligent document insights with the Analytics Agent
To demonstrate the capabilities of the Analytics Agent, we processed 10,000 documents from the RVL-CDIP dataset using the GenAI IDP Accelerator. The dataset, containing diverse document types including memos, letters, forms, and reports, was processed using Pattern 2 configuration to extract structured information including document type, sender, recipient, and department details. In the following sections, we walk through the details of a single sample user query.
Real-world query: Departmental memo analysis
A business user posed a straightforward question in natural language: “Which departments generate the most memos?” This seemingly simple query would traditionally require a data analyst to complete the following steps:

Obtain credentials and connect to an internal database
Understand the database schema by executing exploratory queries or reading internal documentation
Write complex SQL with proper Athena syntax
Execute and validate the query
Process results and create visualizations
Format findings for presentation

The Analytics Agent handled this entire workflow autonomously in under 60 seconds.
Generated visualization using the Analytics Agent
The following figure shows the visualization the agent generated based on a single natural language query.

The analysis revealed that Lorillard generated the most memos (11 documents), followed by INBIFO, Corporate Affairs, and Philip Morris departments (10 documents each). The visualization showed the distribution across major organizational units, with tobacco research and corporate departments dominating memo generation. If the user wants a different visualization style, they can quickly toggle through various options like pie charts, line charts, and bar charts. They can also display the results as a table. We toggled the original bar chart it created to a doughnut chart for aesthetic purposes in this blog post.
Agent thought process
The agent’s transparent reasoning process reveals the comprehensive orchestration happening behind the scenes.

The agent first explored the database structure, identifying the document_sections_memo table and discovering the inference_result.department column containing the needed information.
The agent crafted an optimized Athena query with proper column quoting and null handling, which can be displayed by clicking “View Details” in the chat window:

After retrieving unique departments from the query results, the agent automatically performed the following actions:

Generated Python code to analyze and visualize the data
Copied the Python code and SQL query results into a secure AgentCore Code Interpreter sandbox
Executed the Python code within the sandbox, returning a JSON dictionary with chart data
Identified and fixed an issue with a NaN value in the data
Created a horizontal bar chart highlighting the top 15 departments
Formatted the output for seamless web display

The python code it wrote to load the query results into sandbox memory and generate a plot to display in the frontend can be displayed by clicking “View Details” in the chat window (screenshot cropped for brevity):

Agent capabilities
This example showcases three transformative capabilities:

Autonomous problem-solving – The agent independently discovered the database schema, identified the correct table and columns, and handled data quality issues (null values) without human intervention. This means that the agent can work on different documents analyzed by the IDP solution, regardless of document type or IDP processing configurations.
Adaptive reasoning – When the agent detected null values in the initial visualization, it automatically corrected the issue by filtering the data and regenerating the chart, demonstrating self-correction capabilities.
End-to-end interpretability – The entire workflow, from natural language query to polished visualization, executed in 90 seconds with complete transparency. Users can review each decision the agent made through the detailed thought process log.

The Analytics Agent transforms processed document data into actionable intelligence, helping business users explore their document corpus with the same ease as asking a colleague a question. This democratization of data analysis makes sure valuable insights aren’t locked away behind technical barriers, and are immediately accessible to decision-makers across the organization.
How customers can use this feature
The power of this feature lies in its ability to democratize data analysis, turning business users into data analysts through the simple power of conversation. Customers can use this feature in the following use cases:

Instant business insights:

Ask complex questions in plain English, like “What percentage of invoices exceeded $50,000 last quarter?”
Get immediate visualizations of trends and patterns with queries like “How has the average value of invoices trended over the past 12 months?”
Make data-driven decisions without waiting for IT or data science teams with queries like “Show me which employees based out of the Seattle office submitted the most invoices.”

Risk and compliance monitoring:

Detect anomalies in real time with queries like “Show me all contracts missing mandatory clauses.”
Track compliance rates across document types.
Identify high-risk documents requiring immediate attention.

Operational excellence:

Monitor processing bottlenecks with queries like “Which document types have the longest processing times?”
Track accuracy rates across different document categories.
Optimize resource allocation based on volume patterns.

Customer experience enhancement:

Analyze customer-specific processing metrics with queries like “How close are we to using up our monthly processing allocation budget of $100 this month?”
Identify opportunities for process automation.
Track SLA compliance in real time with queries like “Which processed invoices don’t have an associated processed pay slip associated with them yet?”

Strategic planning:

Forecast processing volumes based on historical patterns with queries like “We are expecting our number of uploaded documents to increase 20% year over year. How many documents will we expect to process in the next five years?”
Identify seasonal trends and plan accordingly.
Track ROI metrics for document processing investments.
Make data-backed decisions for system scaling.

Best practices
Consider the following best practices when using the Analytics Agent:

Start broad – Begin with general questions before diving into specifics.
Be specific – Clearly state what information you’re looking for. Don’t be afraid to provide an entire paragraph describing what you need if necessary.
Use follow-up queries – Build on what you learned in previous questions to explore topics in depth. Chat messages sent in the Agent Companion Chat are stateful, enabling you to ask followup questions.
Check results – Verify visualizations make sense for your data, and read through the displayed agent thought process to validate the decisions it made.

Integration with external agentic AI systems
The Analytics Agent can be easily integrated into other agentic AI systems, such as Amazon Quick Suite, through the IDP Accelerator’s new Model Context Protocol (MCP) Server. Organizations can incorporate document analytics capabilities into their broader AI workflows and automation platforms using this integration. For implementation guidance and technical details, see the MCP integration documentation.
Clean up
When you’re finished experimenting with the Agent Analysis feature, you have two cleanup options depending on your needs:

Remove individual analytics queries – Navigate to the Agent Analysis section in the web UI and use the “load previous chat” pane to delete specific queries. Alternatively, you can remove query entries directly from the DynamoDB analytics jobs table associated with your stack.
Delete the entire IDP deployment – Use the CloudFormation console to delete the IDP stack. For automated cleanup with S3 bucket emptying, you can use the IDP CLI:

idp-cli delete –stack-name my-idp-stack –empty-buckets –force
For more detailed cleanup procedures and options, see the IDP CLI documentation.
Conclusion
In this post, we discussed the new Analytics Agent feature for the GenAI IDP Accelerator, an autonomous agent built on Strands that helps non-technical users analyze and understand the documents they have processed at scale with natural language. With this agent, users no longer need SQL expertise or knowledge of underlying database structures to retrieve data or generate visualizations.
Visit the GenAI IDP Accelerator GitHub repository for detailed guides and examples and choose Watch to stay informed on new releases and features. AWS Professional Services and AWS Partners are available to help with implementation. You can also join the GitHub community to contribute improvements and share your experiences.

About the authors
David Kaleko is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he leads applied research efforts into cutting-edge generative AI implementation strategies for AWS customers. He holds a PhD in particle physics from Columbia University.
Tryambak Gangopadhyay is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he collaborates with organizations across a diverse spectrum of industries. His role involves researching and developing generative AI solutions to address crucial business challenges and accelerate AI adoption. Prior to joining AWS, Tryambak completed his PhD at Iowa State University.
Mofijul Islam is an Applied Scientist II and Tech Lead at the AWS Generative AI Innovation Center, where he helps customers tackle customer-centric research and business challenges using generative AI, large language models, multi-agent learning, code generation, and multimodal learning. He holds a PhD in machine learning from the University of Virginia, where his work focused on multimodal machine learning, multilingual NLP, and multitask learning. His research has been published in top-tier conferences like NeurIPS, ICLR, EMNLP, AISTATS, and AAAI, as well as IEEE and ACM Transactions.
Jordan Ratner is a Senior Generative AI Strategist at Amazon Web Services, where he helps companies of different sizes design, deploy, and scale AI solutions. He previously co-founded Deloitte’s global AI practice and led OneReach.ai as Managing Partner, scaling conversational and generative AI deployments worldwide. Jordan now focuses on turning fast-moving AI trends into reusable products and frameworks, driving real adoption across industries.
Bob Strahan is a Principal Solutions Architect in the AWS Generative AI Innovation Center.

Anthropic AI Releases Bloom: An Open-Source Agentic Framework for Auto …

Anthropic has released Bloom, an open source agentic framework that automates behavioral evaluations for frontier AI models. The system takes a researcher specified behavior and builds targeted evaluations that measure how often and how strongly that behavior appears in realistic scenarios.

Why Bloom?

Behavioral evaluations for safety and alignment are expensive to design and maintain. Teams must hand creative scenarios, run many interactions, read long transcripts and aggregate scores. As models evolve, old benchmarks can become obsolete or leak into training data. Anthropic’s research team frames this as a scalability problem, they need a way to generate fresh evaluations for misaligned behaviors faster while keeping metrics meaningful.

Bloom targets this gap. Instead of a fixed benchmark with a small set of prompts, Bloom grows an evaluation suite from a seed configuration. The seed anchors what behavior to study, how many scenarios to generate and what interaction style to use. The framework then produces new but behavior consistent scenarios on each run, while still allowing reproducibility through the recorded seed.

https://www.anthropic.com/research/bloom

Seed configuration and system design

Bloom is implemented as a Python pipeline and is released under the MIT license on GitHub. The core input is the evaluation “seed”, defined in seed.yaml. This file references a behavior key in behaviors/behaviors.json, optional example transcripts and global parameters that shape the whole run.

Key configuration elements include:

behavior, a unique identifier defined in behaviors.json for the target behavior, for example sycophancy or self preservation

examples, zero or more few shot transcripts stored under behaviors/examples/

total_evals, the number of rollouts to generate in the suite

rollout.target, the model under evaluation such as claude-sonnet-4

controls such as diversity, max_turns, modality, reasoning effort and additional judgment qualities

Bloom uses LiteLLM as a backend for model API calls and can talk to Anthropic and OpenAI models through a single interface. It integrates with Weights and Biases for large sweeps and exports Inspect compatible transcripts.

Four stage agentic pipeline

Bloom’s evaluation process is organized into four agent stages that run in sequence:

Understanding agent: This agent reads the behavior description and example conversations. It builds a structured summary of what counts as a positive instance of the behavior and why this behavior matters. It attributes specific spans in the examples to successful behavior demonstrations so that later stages know what to look for.

Ideation agent: The ideation stage generates candidate evaluation scenarios. Each scenario describes a situation, the user persona, the tools that the target model can access and what a successful rollout looks like. Bloom batches scenario generation to use token budgets efficiently and uses the diversity parameter to trade off between more distinct scenarios and more variations per scenario.

Rollout agent: The rollout agent instantiates these scenarios with the target model. It can run multi turn conversations or simulated environments, and it records all messages and tool calls. Configuration parameters such as max_turns, modality and no_user_mode control how autonomous the target model is during this phase.

Judgment and meta judgment agents: A judge model scores each transcript for behavior presence on a numerical scale and can also rate additional qualities like realism or evaluator forcefulness. A meta judge then reads summaries of all rollouts and produces a suite level report that highlights the most important cases and patterns. The main metric is an elicitation rate, the share of rollouts that score at least 7 out of 10 for behavior presence.

Validation on frontier models

Anthropic used Bloom to build four alignment relevant evaluation suites, for delusional sycophancy, instructed long horizon sabotage, self preservation and self preferential bias. Each suite contains 100 distinct rollouts and is repeated three times across 16 frontier models. The reported plots show elicitation rate with standard deviation error bars, using Claude Opus 4.1 as the evaluator across all stages.

Bloom is also tested on intentionally misaligned ‘model organisms’ from earlier alignment work. Across 10 quirky behaviors, Bloom separates the organism from the baseline production model in 9 cases. In the remaining self promotion quirk, manual inspection shows that the baseline model exhibits similar behavior frequency, which explains the overlap in scores. A separate validation exercise compares human labels on 40 transcripts against 11 candidate judge models. Claude Opus 4.1 reaches a Spearman correlation of 0.86 with human scores, and Claude Sonnet 4.5 reaches 0.75, with especially strong agreement at high and low scores where thresholds matter.

https://alignment.anthropic.com/2025/bloom-auto-evals/

Relationship to Petri and Positioning

Anthropic positions Bloom as complementary to Petri. Petri is a broad coverage auditing tool that takes seed instructions describing many scenarios and behaviors, then uses automated agents to probe models through multi turn interactions and summarize diverse safety relevant dimensions. Bloom instead starts from one behavior definition and automates the engineering needed to turn that into a large, targeted evaluation suite with quantitative metrics like elicitation rate.

Key Takeaways

Bloom is an open source agentic framework that turns a single behavior specification into a complete behavioral evaluation suite for large models, using a four stage pipeline of understanding, ideation, rollout and judgment.

The system is driven by a seed configuration in seed.yaml and behaviors/behaviors.json, where researchers specify the target behavior, example transcripts, total evaluations, rollout model and controls such as diversity, max turns and modality.

Bloom relies on LiteLLM for unified access to Anthropic and OpenAI models, integrates with Weights and Biases for experiment tracking and exports Inspect compatible JSON plus an interactive viewer for inspecting transcripts and scores.

Anthropic validates Bloom on 4 alignment focused behaviors across 16 frontier models with 100 rollouts repeated 3 times, and on 10 model organism quirks, where Bloom separates intentionally misaligned organisms from baseline models in 9 cases and judge models match human labels with Spearman correlation up to 0.86.

Check out the Github Repo, Technical report and Blog. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anthropic AI Releases Bloom: An Open-Source Agentic Framework for Automated Behavioral Evaluations of Frontier AI Models appeared first on MarkTechPost.

AI Interview Series #4: Explain KV Caching

Question:

You’re deploying an LLM in production. Generating the first few tokens is fast, but as the sequence grows, each additional token takes progressively longer to generate—even though the model architecture and hardware remain the same.

If compute isn’t the primary bottleneck, what inefficiency is causing this slowdown, and how would you redesign the inference process to make token generation significantly faster?

What is KV Caching and how does it make token generation faster?

KV caching is an optimization technique used during text generation in large language models to avoid redundant computation. In autoregressive generation, the model produces text one token at a time, and at each step it normally recomputes attention over all previous tokens. However, the keys (K) and values (V) computed for earlier tokens never change.

With KV caching, the model stores these keys and values the first time they are computed. When generating the next token, it reuses the cached K and V instead of recomputing them from scratch, and only computes the query (Q), key, and value for the new token. Attention is then calculated using the cached information plus the new token.

This reuse of past computations significantly reduces redundant work, making inference faster and more efficient—especially for long sequences—at the cost of additional memory to store the cache. Check out the Practice Notebook here

Evaluating the Impact of KV Caching on Inference Speed

In this code, we benchmark the impact of KV caching during autoregressive text generation. We run the same prompt through the model multiple times, once with KV caching enabled and once without it, and measure the average generation time. By keeping the model, prompt, and generation length constant, this experiment isolates how reusing cached keys and values significantly reduces redundant attention computation and speeds up inference. Check out the Practice Notebook here

Copy CodeCopiedUse a different Browserimport numpy as np
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = “cuda” if torch.cuda.is_available() else “cpu”

model_name = “gpt2-medium”
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

prompt = “Explain KV caching in transformers.”

inputs = tokenizer(prompt, return_tensors=”pt”).to(device)

for use_cache in (True, False):
times = []
for _ in range(5):
start = time.time()
model.generate(
**inputs,
use_cache=use_cache,
max_new_tokens=1000
)
times.append(time.time() – start)

print(
f”{‘with’ if use_cache else ‘without’} KV caching: ”
f”{round(np.mean(times), 3)} ± {round(np.std(times), 3)} seconds”
)

The results clearly demonstrate the impact of KV caching on inference speed. With KV caching enabled, generating 1000 tokens takes around 21.7 seconds, whereas disabling KV caching increases the generation time to over 107 seconds—nearly a 5× slowdown. This sharp difference occurs because, without KV caching, the model recomputes attention over all previously generated tokens at every step, leading to quadratic growth in computation. Check out the Practice Notebook here

With KV caching, past keys and values are reused, eliminating redundant work and keeping generation time nearly linear as the sequence grows. This experiment highlights why KV caching is essential for efficient, real-world deployment of autoregressive language models.

Check out the Practice Notebook here

AI Interview Series #3: Explain Federated Learning

The post AI Interview Series #4: Explain KV Caching appeared first on MarkTechPost.

NVIDIA AI Releases Nemotron 3: A Hybrid Mamba Transformer MoE Stack fo …

NVIDIA has released the Nemotron 3 family of open models as part of a full stack for agentic AI, including model weights, datasets and reinforcement learning tools. The family has three sizes, Nano, Super and Ultra, and targets multi agent systems that need long context reasoning with tight control over inference cost. Nano has about 30 billion parameters with about 3 billion active per token, Super has about 100 billion parameters with up to 10 billion active per token, and Ultra has about 500 billion parameters with up to 50 billion active per token.

https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

Model family and target workloads

Nemotron 3 is presented as an efficient open model family for agentic applications. The line consists of Nano, Super and Ultra models, each tuned for different workload profiles.

Nemotron 3 Nano is a Mixture of Experts hybrid Mamba Transformer language model with about 31.6 billion parameters. Only about 3.2 billion parameters are active per forward pass, or 3.6 billion including embeddings. This sparse activation allows the model to keep high representational capacity while keeping compute low.

Nemotron 3 Super has about 100 billion parameters with up to 10 billion active per token. Nemotron 3 Ultra scales this design to about 500 billion parameters with up to 50 billion active per token. Super targets high accuracy reasoning for large multi agent applications, while Ultra is intended for complex research and planning workflows.

Nemotron 3 Nano is available now with open weights and recipes, on Hugging Face and as an NVIDIA NIM microservice. Super and Ultra are scheduled for the first half of 2026.

NVIDIA Nemotron 3 Nano delivers about 4 times higher token throughput than Nemotron 2 Nano and reduces reasoning token usage significantly, while supporting a native context length of up to 1 million tokens. This combination is intended for multi agent systems that operate on large workspaces such as long documents and large code bases.

https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

Hybrid Mamba Transformer MoE architecture

The core design of Nemotron 3 is a Mixture of Experts hybrid Mamba Transformer architecture. The models mix Mamba sequence blocks, attention blocks and sparse expert blocks inside a single stack.

For Nemotron 3 Nano, the research team describes a pattern that interleaves Mamba 2 blocks, attention blocks and MoE blocks. Standard feedforward layers from earlier Nemotron generations are replaced by MoE layers. A learned router selects a small subset of experts per token, for example 6 out of 128 routable experts for Nano, which keeps the active parameter count close to 3.2 billion while the full model holds 31.6 billion parameters.

https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

Mamba 2 handles long range sequence modeling with state space style updates, attention layers provide direct token to token interactions for structure sensitive tasks, and MoE provides parameter scaling without proportional compute scaling. The important point is that most layers are either fast sequence or sparse expert computations, and full attention is used only where it matters most for reasoning.

For Nemotron 3 Super and Ultra, NVIDIA adds LatentMoE. Tokens are projected into a lower dimensional latent space, experts operate in that latent space, then outputs are projected back. This design allows several times more experts at similar communication and compute cost, which supports more specialization across tasks and languages.

Super and Ultra also include multi token prediction. Multiple output heads share a common trunk and predict several future tokens in a single pass. During training this improves optimization, and at inference it enables speculative decoding like execution with fewer full forward passes.

Training data, precision format and context window

Nemotron 3 is trained on large scale text and code data. The research team reports pretraining on about 25 trillion tokens, with more than 3 trillion new unique tokens over the Nemotron 2 generation. Nemotron 3 Nano uses Nemotron Common Crawl v2 point 1, Nemotron CC Code and Nemotron Pretraining Code v2, plus specialized datasets for scientific and reasoning content.

Super and Ultra are trained mostly in NVFP4, a 4 bit floating point format optimized for NVIDIA accelerators. Matrix multiply operations run in NVFP4 while accumulations use higher precision. This reduces memory pressure and improves throughput while keeping accuracy close to standard formats.

All Nemotron 3 models support context windows up to 1 million tokens. The architecture and training pipeline are tuned for long horizon reasoning across this length, which is essential for multi agent environments that pass large traces and shared working memory between agents.

Key Takeaways

Nemotron 3 is a three tier open model family for agentic AI: Nemotron 3 comes in Nano, Super and Ultra variants. Nano has about 30 billion parameters with about 3 billion active per token, Super has about 100 billion parameters with up to 10 billion active per token, and Ultra has about 500 billion parameters with up to 50 billion active per token. The family targets multi agent applications that need efficient long context reasoning.

Hybrid Mamba Transformer MoE with 1 million token context: Nemotron 3 models use a hybrid Mamba 2 plus Transformer architecture with sparse Mixture of Experts and support a 1 million token context window. This design gives long context handling with high throughput, where only a small subset of experts is active per token and attention is used where it is most useful for reasoning.

Latent MoE and multi token prediction in Super and Ultra: The Super and Ultra variants add latent MoE where expert computation happens in a reduced latent space, which lowers communication cost and allows more experts, and multi token prediction heads that generate several future tokens per forward pass. These changes improve quality and enable speculative style speedups for long text and chain of thought workloads.

Large scale training data and NVFP4 precision for efficiency: Nemotron 3 is pretrained on about 25 trillion tokens, with more than 3 trillion new tokens over the previous generation, and Super and Ultra are trained mainly in NVFP4, a 4 bit floating point format for NVIDIA GPUs. This combination improves throughput and reduces memory use while keeping accuracy close to standard precision.

Check out the Paper, Technical blog and Model Weights on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post NVIDIA AI Releases Nemotron 3: A Hybrid Mamba Transformer MoE Stack for Long Context Agentic AI appeared first on MarkTechPost.

A Coding Guide to Design a Complete Agentic Workflow in Gemini for Aut …

In this tutorial, we devise how to orchestrate a fully functional, tool-using medical prior-authorization agent powered by Gemini. We walk through each component step by step, from securely configuring the model to building realistic external tools and finally constructing an intelligent agent loop that reasons, acts, and responds entirely through structured JSON. As we progress, we see how the system thinks, retrieves evidence, and interacts with simulated medical systems to complete a complex workflow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q -U google-generative-ai

import google.generativeai as genai
from google.colab import userdata
import os
import getpass
import json
import time

try:
GOOGLE_API_KEY = userdata.get(‘GOOGLE_API_KEY’)
except:
print(“Please enter your Google API Key:”)
GOOGLE_API_KEY = getpass.getpass(“API Key: “)

genai.configure(api_key=GOOGLE_API_KEY)

print(“n Scanning for available models…”)
available_models = [m.name for m in genai.list_models()]
target_model = “”

if ‘models/gemini-1.5-flash’ in available_models:
target_model = ‘gemini-1.5-flash’
elif ‘models/gemini-1.5-flash-001’ in available_models:
target_model = ‘gemini-1.5-flash-001’
elif ‘models/gemini-pro’ in available_models:
target_model = ‘gemini-pro’
else:
for m in available_models:
if ‘generateContent’ in genai.get_model(m).supported_generation_methods:
target_model = m
break

if not target_model:
raise ValueError(” No text generation models found for this API key.”)

print(f” Selected Model: {target_model}”)
model = genai.GenerativeModel(target_model)

We set up our environment and automatically detect the best available Gemini model. We configure the API key securely and let the system choose the most capable model without hardcoding anything. This ensures that we start the tutorial with a clean, flexible, and reliable foundation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MedicalTools:
def __init__(self):
self.ehr_docs = [
“Patient: John Doe | DOB: 1980-05-12”,
“Visit 2023-01-10: Diagnosed with Type 2 Diabetes. Prescribed Metformin.”,
“Visit 2023-04-15: Patient reports severe GI distress with Metformin. Discontinued.”,
“Visit 2023-04-20: BMI recorded at 32.5. A1C is 8.4%.”,
“Visit 2023-05-01: Doctor recommends starting Ozempic (Semaglutide).”
]

def search_ehr(self, query):
print(f” [Tool] Searching EHR for: ‘{query}’…”)
results = [doc for doc in self.ehr_docs if any(q.lower() in doc.lower() for q in query.split())]
if not results:
return “No records found.”
return “n”.join(results)

def submit_prior_auth(self, drug_name, justification):
print(f” [Tool] Submitting claim for {drug_name}…”)
justification_lower = justification.lower()
if “metformin” in justification_lower and (“discontinued” in justification_lower or “intolerance” in justification_lower):
if “bmi” in justification_lower and “32” in justification_lower:
return “SUCCESS: Authorization Approved. Auth ID: #998877”
return “DENIED: Policy requires proof of (1) Metformin failure and (2) BMI > 30.”

We define the medical tools that our agent can use during the workflow. We simulate an EHR search and a prior-authorization submission system so the agent has real actions to perform. By doing this, we ground the agent’s reasoning in tool-enabled interactions rather than plain text generation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AgenticSystem:
def __init__(self, model, tools):
self.model = model
self.tools = tools
self.history = []
self.max_steps = 6

self.system_prompt = “””
You are an expert Medical Prior Authorization Agent.
Your goal is to get approval for a medical procedure/drug.

You have access to these tools:
1. search_ehr(query)
2. submit_prior_auth(drug_name, justification)

RULES:
1. ALWAYS think before you act.
2. You MUST output your response in STRICT JSON format:
{
“thought”: “Your reasoning here”,
“action”: “tool_name_or_finish”,
“action_input”: “argument_string_or_dict”
}
3. Do not guess patient data. Use ‘search_ehr’.
4. If you have the evidence, use ‘submit_prior_auth’.
5. If the task is done, use action “finish”.
“””

We initialize the agent and provide its full system prompt. We define the rules, the JSON response format, and the expectation that the agent must think before acting. This gives us a controlled, deterministic structure for building a safe and traceable agent loop. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def execute_tool(self, action_name, action_input):
if action_name == “search_ehr”:
return self.tools.search_ehr(action_input)
elif action_name == “submit_prior_auth”:
if isinstance(action_input, str):
return “Error: submit_prior_auth requires a dictionary.”
return self.tools.submit_prior_auth(**action_input)
else:
return “Error: Unknown tool.”

def run(self, objective):
print(f” AGENT STARTING. Objective: {objective}n” + “-“*50)
self.history.append(f”User: {objective}”)

for i in range(self.max_steps):
print(f”n STEP {i+1}”)
prompt = self.system_prompt + “nnHistory:n” + “n”.join(self.history) + “nnNext JSON:”

try:
response = self.model.generate_content(prompt)
text_response = response.text.strip().replace(““`json”, “”).replace(““`”, “”)
agent_decision = json.loads(text_response)
except Exception as e:
print(f” Error parsing AI response. Retrying… ({e})”)
continue

print(f” THOUGHT: {agent_decision[‘thought’]}”)
print(f” ACTION: {agent_decision[‘action’]}”)

if agent_decision[‘action’] == “finish”:
print(f”n TASK COMPLETED: {agent_decision[‘action_input’]}”)
break

tool_result = self.execute_tool(agent_decision[‘action’], agent_decision[‘action_input’])
print(f” OBSERVATION: {tool_result}”)

self.history.append(f”Assistant: {text_response}”)
self.history.append(f”System: {tool_result}”)

if “SUCCESS” in str(tool_result):
print(“n SUCCESS! The Agent successfully navigated the insurance portal.”)
break

We implement the core agent loop where reasoning, tool execution, and observations happen step by step. We watch the agent decide its next action, execute tools, update history, and evaluate success conditions. This is where the agent truly comes alive and performs iterative reasoning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsertools_instance = MedicalTools()
agent = AgenticSystem(model, tools_instance)
agent.run(“Please get prior authorization for Ozempic for patient John Doe.”)

We instantiate the tools and agent, then run the entire system end-to-end with a real objective. We see the full workflow unfold as the agent navigates through medical history, validates evidence, and attempts prior authorization. This final snippet demonstrates the complete pipeline working seamlessly.

In conclusion, we reflect on how this compact yet powerful framework enables us to design real-world agentic behaviors that go beyond simple text responses. We watch our agent plan, consult tools, gather evidence, and ultimately complete a structured insurance authorization task, entirely through autonomous reasoning. It provides confidence that we can now expand the system with additional tools, stronger policies, domain-specific logic, or even multi-agent collaboration.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Design a Complete Agentic Workflow in Gemini for Automated Medical Evidence Gathering and Prior Authorization Submission appeared first on MarkTechPost.

Mistral AI Releases OCR 3: A Smaller Optical Character Recognition (OC …

Mistral AI has released Mistral OCR 3, its latest optical character recognition service that powers the company’s Document AI stack. The model, named as mistral-ocr-2512, is built to extract interleaved text and images from PDFs and other documents while preserving structure, and it does this at an aggressive price of $2 per 1,000 pages with a 50% discount when used through the Batch API.

What Mistral OCR 3 is Optimized for?

Mistral OCR 3 targets typical enterprise document workloads. The model is tuned for forms, scanned documents, complex tables, and handwriting. It is evaluated on internal benchmarks drawn from real business use cases, where it achieves a 74% overall win rate over Mistral OCR 2 across these document categories using a fuzzy match metric against ground truth.

The model outputs markdown that preserves document layout, and when table formatting is enabled, it enriches the output with HTML based table representations. This combination gives downstream systems both the content and the structural information that is needed for retrieval pipelines, analytics, and agent workflows.

Role in Mistral Document AI

OCR 3 sits inside Mistral Document AI, the company’s document processing capability that combines OCR with structured data extraction and Document QnA.

It now powers the Document AI Playground in Mistral AI Studio. In this interface, users upload PDFs or images and get back either clean text or structured JSON without writing code. The same underlying OCR pipeline is accessible via the public API, which allows teams to move from interactive exploration to production workloads without changing the core model.

Inputs, Outputs, And Structure

The OCR processor accepts multiple document formats through a single API. The document field can point to:

document_url for PDFs, pptx, docx and more

image_url for image types such as png, jpeg or avif

Uploaded or base64 encoded PDFs or images through the same schema

This is documented in the OCR Processor section of Mistral’s Document AI docs.

The response is a JSON object with a pages array. Each page contains an index, a markdown string, a list of images, a list of tables when table_format=”html” is used, detected hyperlinks, optional header and footer fields when header or footer extraction is enabled, and a dimensions object with page size. There is also a document_annotation field for structured annotations and a usage_info block for accounting information.

When images and HTML tables are extracted, the markdown includes placeholders such as ![img-0.jpeg](img-0.jpeg) and [tbl-3.html](tbl-3.html). These placeholders are mapped back to actual content using the images and tables arrays in the response, which simplifies downstream reconstruction.

Upgrades Over Mistral OCR 2

Mistral OCR 3 introduces several concrete upgrades relative to OCR 2. The public release notes emphasize four main areas.

Handwriting Mistral OCR 3 more accurately interprets cursive, mixed content annotations, and handwritten text placed on top of printed templates.

Forms It improves detection of boxes, labels, and handwritten entries in dense layouts such as invoices, receipts, compliance forms, and government documents.

Scanned and complex documents The model is more robust to compression artifacts, skew, distortion, low DPI, and background noise in scanned pages.

Complex tables It reconstructs table structures with headers, merged cells, multi row blocks, and column hierarchies, and it can return HTML tables with proper colspan and rowspan tags so that layout is preserved.

https://mistral.ai/news/mistral-ocr-3

Pricing, Batch Inference, And Annotations

The OCR 3 model card lists pricing at $2 per 1,000 pages for standard OCR and $3 per 1,000 annotated pages when structured annotations are used.

Mistral also exposes OCR 3 through its Batch Inference API /v1/batch, which is documented under the batching section of the platform. Batch processing halves the effective OCR price to $1 per 1,000 pages by applying a 50% discount for jobs that run through the batch pipeline.

The model integrates with two important features on the same endpoint, Annotations – Structured and BBox Extraction. These allow developers to attach schema driven labels to regions of a document and get bounding boxes for text and other elements, which is useful when mapping content into downstream systems or UI overlays.

Key Takeaways

Model and role: Mistral OCR 3, named as mistral-ocr-2512, is the new OCR service that powers Mistral’s Document AI stack for page based document understanding.

Accuracy gains: On internal benchmarks covering forms, scanned documents, complex tables, and handwriting, OCR 3 achieves a 74% overall win rate over Mistral OCR 2, and Mistral positions it as state of the art against both traditional and AI native OCR systems.

Structured outputs for RAG: The service extracts interleaved text and embedded images and returns markdown enriched with HTML reconstructed tables, preserving layout and table structure so outputs can feed directly into RAG, agents, and search pipelines with minimal extra parsing.

API and document formats: Developers access OCR 3 via the /v1/ocr endpoint or SDK, passing PDFs as document_url and images such as png or jpeg as image_url, and can enable options like HTML table output, header or footer extraction, and base64 images in the response.

Pricing and batch processing: OCR 3 is priced at 2 dollars per 1,000 pages and 3 dollars per 1,000 annotated pages, and when used through the Batch API the effective price for standard OCR drops to 1 dollar per 1,000 pages for large scale processing.

Check out the TECHNICAL DETAILS. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Mistral AI Releases OCR 3: A Smaller Optical Character Recognition (OCR) Model for Structured Document AI at Scale appeared first on MarkTechPost.

How to Build a High-Performance Distributed Task Routing System Using …

In this tutorial, we build a fully functional event-driven workflow using Kombu, treating messaging as a core architectural capability. We walk through step by step the setup of exchanges, routing keys, background workers, and concurrent producers, allowing us to observe a real distributed system. As we implement each component, we see how clean message flow, asynchronous processing, and routing patterns give us the same power that production microservices rely on every day. Check out the FULL CODES.

Copy CodeCopiedUse a different Browser!pip install kombu

import threading
import time
import logging
import uuid
import datetime
import sys

from kombu import Connection, Exchange, Queue, Producer, Consumer
from kombu.mixins import ConsumerMixin

logging.basicConfig(
level=logging.INFO,
format=’%(message)s’,
handlers=[logging.StreamHandler(sys.stdout)],
force=True
)
logger = logging.getLogger(__name__)

BROKER_URL = “memory://localhost/”

We begin by installing Kombu, importing dependencies, and configuring logging so we can clearly see every message flowing through the system. We also set the in-memory broker URL, allowing us to run everything locally in Colab without needing RabbitMQ. This setup forms the foundation for our distributed messaging workflow. Check out the FULL CODES.

Copy CodeCopiedUse a different Browsermedia_exchange = Exchange(‘media_exchange’, type=’topic’, durable=True)

task_queues = [
Queue(‘video_queue’, media_exchange, routing_key=’video.#’),
Queue(‘audit_queue’, media_exchange, routing_key=’#’),
]

We define a topic exchange to flexibly route messages using wildcard patterns. We also create two queues: one dedicated to video-related tasks and another audit queue that listens to everything. Using topic routing, we can precisely control how messages flow across the system. Check out the FULL CODES.

Copy CodeCopiedUse a different Browserclass Worker(ConsumerMixin):
def __init__(self, connection, queues):
self.connection = connection
self.queues = queues
self.should_stop = False

def get_consumers(self, Consumer, channel):
return [
Consumer(queues=self.queues,
callbacks=[self.on_message],
accept=[‘json’],
prefetch_count=1)
]

def on_message(self, body, message):
routing_key = message.delivery_info[‘routing_key’]
payload_id = body.get(‘id’, ‘unknown’)

logger.info(f”n RECEIVED MSG via key: [{routing_key}]”)
logger.info(f” Payload ID: {payload_id}”)

try:
if ‘video’ in routing_key:
self.process_video(body)
elif ‘audit’ in routing_key:
logger.info(” [Audit] Logging event…”)

message.ack()
logger.info(f” ACKNOWLEDGED”)

except Exception as e:
logger.error(f” ERROR: {e}”)

def process_video(self, body):
logger.info(” [Processor] Transcoding video (Simulating work…)”)
time.sleep(0.5)

We implement a custom worker using Kombu’s ConsumerMixin to run it in a background thread. In the message callback, we inspect the routing key, invoke the appropriate processing function, and acknowledge the message. This worker architecture gives us clean, concurrent message consumption with full control. Check out the FULL CODES.

Copy CodeCopiedUse a different Browserdef publish_messages(connection):
producer = Producer(connection)

tasks = [
(‘video.upload’, {‘file’: ‘movie.mp4’}),
(‘user.login’, {‘user’: ‘admin’}),
]

logger.info(“n PRODUCER: Starting to publish messages…”)

for r_key, data in tasks:
data[‘id’] = str(uuid.uuid4())[:8]

logger.info(f” SENDING: {r_key} -> {data}”)

producer.publish(
data,
exchange=media_exchange,
routing_key=r_key,
serializer=’json’
)
time.sleep(1.5)

logger.info(” PRODUCER: Done.”)

We now build a producer that sends structured JSON payloads into the exchange with different routing keys. We generate unique IDs for each event and observe how they are routed to other queues. This mirrors real-world microservice event publishing, where producers and consumers remain decoupled. Check out the FULL CODES.

Copy CodeCopiedUse a different Browserdef run_example():
with Connection(BROKER_URL) as conn:
worker = Worker(conn, task_queues)
worker_thread = threading.Thread(target=worker.run)
worker_thread.daemon = True
worker_thread.start()

logger.info(” SYSTEM: Worker thread started.”)
time.sleep(1)

try:
publish_messages(conn)
time.sleep(2)
except KeyboardInterrupt:
pass
finally:
worker.should_stop = True
logger.info(“n SYSTEM: Execution complete.”)

if __name__ == “__main__”:
run_example()

We start the worker in a background thread and fire the producer in the main thread. This structure gives us a mini distributed system running in Colab. By observing the logs, we see messages published → routed → consumed → acknowledged, completing the full event-processing lifecycle.

In conclusion, we orchestrated a dynamic, distributed task-routing pipeline that processes real-time events with clarity and precision. We witnessed how Kombu abstracts away the complexity of messaging systems while still giving us fine-grained control over routing, consumption, and worker concurrency. As we see messages move from producer to exchange to queue to worker, we gained a deeper appreciation for the elegance of event-driven system design, and we are now well-equipped to scale this foundation into robust microservices, background processors, and enterprise-grade workflows.

Check out the FULL CODES. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build a High-Performance Distributed Task Routing System Using Kombu with Topic Exchanges and Concurrent Workers appeared first on MarkTechPost.

Google Introduces T5Gemma 2: Encoder Decoder Models with Multimodal In …

Google has released T5Gemma 2, a family of open encoder-decoder Transformer checkpoints built by adapting Gemma 3 pretrained weights into an encoder-decoder layout, then continuing pretraining with the UL2 objective. The release is pretrained only, intended for developers to post-train for specific tasks, and Google explicitly notes it is not releasing post-trained or IT checkpoints for this drop.

T5Gemma 2 is positioned as an encoder-decoder counterpart to Gemma 3 that keeps the same low level building blocks, then adds 2 structural changes aimed at small model efficiency. The models inherit Gemma 3 features that matter for deployment, notably multimodality, long context up to 128K tokens, and broad multilingual coverage, with the blog stating over 140 languages.

https://arxiv.org/pdf/2512.14856

What Google actually released?

The release includes 3 pretrained sizes, 270M-270M, 1B-1B, and 4B-4B, where the notation means the encoder and decoder are the same size. The research team reports approximate totals excluding the vision encoder, about 370M, 1.7B, and 7B parameters. The multimodal accounting lists a 417M parameter vision encoder, along with encoder and decoder parameters broken into embedding and non embedding components.

The adaptation, encoder-decoder without training from scratch

T5Gemma 2 follows the same adaptation idea introduced in T5Gemma, initialize an encoder-decoder model from a decoder-only checkpoint, then adapt with UL2. In the above figure the research team show encoder and decoder parameters initialized from the pretrained decoder-only model, then pretrained with UL2, with images first converted by SigLIP into 256 tokens.

This matters because encoder-decoder splits the workload, the encoder can read the full input bidirectionally, while the decoder focuses on autoregressive generation. The research team argues this separation can help long context tasks where the model must retrieve relevant evidence from a large input before generating.

Two efficiency changes that are easy to miss but affect small models

First, T5Gemma 2 uses tied word embeddings across encoder input embedding, decoder input embedding, and decoder output or softmax embedding. This reduces parameter redundancy, and references an ablation showing little quality change while reducing embedding parameters.

Second, it introduces merged attention in the decoder. Instead of separate self-attention and cross-attention sublayers, the decoder performs a single attention operation where K and V are formed by concatenating encoder outputs and decoder states, and masking preserves causal visibility for decoder tokens. This ties to easier initialization, because it narrows differences between the adapted decoder and the original Gemma style decoder stack, and it reports parameter savings with a small average quality drop in their ablations.

https://arxiv.org/pdf/2512.14856

https://arxiv.org/pdf/2512.14856

Multimodality, image understanding is encoder side, not decoder side

T5Gemma 2 is multimodal by reusing Gemma 3’s vision encoder and keeping it frozen during training. Vision tokens are always fed to the encoder and encoder tokens have full visibility to each other in self attention. This is a pragmatic encoder-decoder design, the encoder fuses image tokens with text tokens into contextual representations, and the decoder can then attend to those representations while generating text.

On the tooling side, T5Gemma 2 is placed under an image-text-to-text pipeline, which matches the research’s design, image in, text prompt in, text out. That pipeline example is the fastest way to validate the end to end multimodal path, including dtype choices like bfloat16 and automatic device mapping.

Long context to 128K, what enables it

Google researchers attributes the 128K context window to Gemma 3’s alternating local and global attention mechanism. The Gemma 3 team describes a repeating 5 to 1 pattern, 5 local sliding window attention layers followed by 1 global attention layer, with a local window size of 1024. This design reduces KV cache growth relative to making every layer global, which is one reason long context becomes feasible at smaller footprints.

In the T5Gemma 2, the research team also mention adopting positional interpolation methods for long context, and they pretrain on sequences up to 16K input paired with 16K target outputs, then evaluate long context performance up to 128K on benchmarks including RULER and MRCR. The detailed pretraining results table includes 32K and 128K evaluations, showing the long context deltas they claim over Gemma 3 at the same scale.

https://arxiv.org/pdf/2512.14856

Training setup and what “pretrained only” implies for users

The research team states the models are pretrained on 2T tokens and describes a training setup that includes a batch size of 4.2M tokens, cosine learning rate decay with 100 warmup steps, global gradient clipping at 1.0, and checkpoint averaging over the last 5 checkpoints.

Key Takeaways

T5Gemma 2 is an encoder decoder family adapted from Gemma 3 and continued with UL2, it reuses Gemma 3 pretrained weights, then applies the same UL2 based adaptation recipe used in T5Gemma.

Google released pretrained checkpoints only, no post trained or instruction tuned variants are included in this drop, so downstream use requires your own post training and evaluation.

Multimodal input is handled by a SigLIP vision encoder that outputs 256 image tokens and stays frozen, those vision tokens go into the encoder, the decoder generates text.

Two parameter efficiency changes are central, tied word embeddings share encoder, decoder, and output embeddings, merged attention unifies decoder self attention and cross attention into a single module.

Long context up to 128K is enabled by Gemma 3’s interleaved attention design, a repeating 5 local sliding window layers with window size 1024 followed by 1 global layer, and T5Gemma 2 inherits this mechanism.

Check out the Paper, Technical details and Model on Hugging Face. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Introduces T5Gemma 2: Encoder Decoder Models with Multimodal Inputs via SigLIP and 128K Context appeared first on MarkTechPost.

Introducing SOCI indexing for Amazon SageMaker Studio: Faster containe …

Today, we are excited to introduce a new feature for SageMaker Studio: SOCI (Seekable Open Container Initiative) indexing. SOCI supports lazy loading of container images, where only the necessary parts of an image are downloaded initially rather than the entire container.
SageMaker Studio serves as a web Integrated Development Environment (IDE) for end-to-end machine learning (ML) development, so users can build, train, deploy, and manage both traditional ML models and foundation models (FM) for the complete ML workflow.
Each SageMaker Studio application runs inside a container that packages the required libraries, frameworks, and dependencies for consistent execution across workloads and user sessions. This containerized architecture allows SageMaker Studio to support a wide range of ML frameworks such as TensorFlow, PyTorch, scikit-learn, and more while maintaining strong environment isolation. Although SageMaker Studio provides containers for the most common ML environments, data scientists may need to tailor these environments for specific use cases by adding or removing packages, configuring custom environment variables, or installing specialized dependencies. SageMaker Studio supports this customization through Lifecycle Configurations (LCCs), which allow users to run bash scripts at the startup of a Studio IDE space. However, repeatedly customizing environments using LCCs can become time-consuming and difficult to maintain at scale. To address this, SageMaker Studio supports building and registering custom container images with preconfigured libraries and frameworks. These reusable custom images reduce setup friction and improve reproducibility for consistency across projects, so data scientists can focus on model development rather than environment management.
As ML workloads become increasingly complex, the container images that power these environments have grown in size, leading to longer startup times that can delay productivity and interrupt development workflows. Data scientists, ML engineers, and developers may have longer wait times for their environments to initialize, particularly when switching between different frameworks or when using images with extensive pre-installed libraries and dependencies. This startup latency becomes a significant bottleneck in iterative ML development where quick experimentation and rapid prototyping are essential. Instead of downloading the entire container image upfront, SOCI creates an index that allows the system to fetch only the specific files and layers needed to start the application, with additional components loaded on-demand as required. This significantly reduces container startup times from minutes to seconds, allowing your SageMaker Studio environments to launch faster and get you working on your ML projects sooner, ultimately improving developer productivity and reducing time-to-insight for ML experiments.
Prerequisites
To use SOCI indexing with SageMaker Studio, you need:

An AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage SageMaker and ECR resources. For details, refer to Create an AWS account.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain.
A private Amazon Elastic Container Registry (ECR) repository to store your container images with SOCI indexes.
Verify you have AWS CLI version 2.0 or higher installed to interact with these services and manage your SOCI-indexed images.

SageMaker Studio SOCI Indexing – Feature overview
The SOCI (Seekable Open Container Initiative), originally open sourced by AWS, addresses container startup delays in SageMaker Studio through selective image loading. This technology creates a specialized index that maps the internal structure of container images for granular access to individual files without downloading the entire container archive first. Traditional container images are stored as ordered lists of layers in gzipped tar files, which typically require complete download before accessing any content. SOCI overcomes this limitation by generating a separate index stored as an OCI Artifact that links to the original container image through OCI Reference Types. This design preserves all original container images, maintains consistent image digests, and ensures signature validity—critical factors for AI/ML environments with strict security requirements.
For SageMaker Studio users, you can implement SOCI indexing through the integration with Finch container runtime, this translates to 35-70% reduction in container startup times across all instance types using Bring Your Own Image (BYOI). This implementation extends beyond current optimization strategies that are limited to specific first-party image and instance type combinations, providing faster app launch times in SageMaker AI Studio and SageMaker Unified Studio environments.
Creating a SOCI index
To create and manage SOCI indices, you can use several container management tools, each offering different advantages depending on your development environment and preferences:

Finch CLI is a Docker-compatible command-line tool developed by AWS that provides native support for building and pushing SOCI indices. It offers a familiar Docker-like interface while including built-in SOCI functionality, making it straightforward to create indexed images without additional tooling.
nerdctl serves as an alternative container CLI for containerd, the industry-standard container runtime. It provides Docker-compatible commands while offering direct integration with containerd features, including SOCI support for lazy loading capabilities.
Docker + SOCI CLI combines the widely used Docker toolchain with the dedicated SOCI command-line interface. This approach allows you to leverage existing Docker workflows while adding SOCI indexing capabilities through a separate CLI tool, providing flexibility for teams already invested in Docker-based development processes.

In the standard SageMaker Studio workflow, launching a machine learning environment requires downloading the complete container image before any application can start. When user initiates a new SageMaker Studio session, the system must pull the entire image containing frameworks like TensorFlow, PyTorch, scikit-learn, Jupyter, and associated dependencies from the container registry. This process is sequential and time consuming—the container runtime downloads each compressed layer, extracts the complete filesystem to local storage, and only then can the application begin initialization. For typical ML images ranging from 2-5 GB, this results in startup times of 3-5 minutes, creating significant friction in iterative development workflows where data scientists frequently switch between different environments or restart sessions.The SOCI-enhanced workflow transforms container startup by enabling intelligent, on-demand file retrieval. Instead of downloading entire images, SOCI creates a searchable index that maps the precise location of every file within the compressed container layers. When launching a SageMaker Studio application, the system downloads only the SOCI index (typically 10-20 MB) and the minimal set of files required for application startup—usually 5-10% of the total image size. The container begins running immediately while a background process continues downloading remaining files as the application requests them. This lazy loading approach reduces initial startup times from few minutes to seconds, allowing users to begin productive work almost immediately while the environment completes initialization transparently in the background.
Converting the image to SOCI
You can convert your existing image into a SOCI image and push it to your private ECR using the following commands:

#/bin/bash
# Download and install soci-snapshotter, containerd, and nerdctl
sudo yum install soci-snapshotter
sudo yum install containerd jq
sudo systemctl start soci-snapshotter
sudo systemctl restart containerd
sudo yum install nerdctl

# Set your registry variables
REGISTRY=”123456789012.dkr.ecr.us-west-2.amazonaws.com”
REPOSITORY_NAME=”my-sagemaker-image”

# Authenticate for image pull and push
AWS_REGION=us-west-2
REGISTRY_USER=AWS
REGISTRY_PASSWORD=$(/usr/local/bin/aws ecr get-login-password –region $AWS_REGION)
echo $REGISTRY_PASSWORD | sudo nerdctl login -u $REGISTRY_USER –password-stdin $REGISTRY

# Pull the original image
sudo nerdctl pull $REGISTRY/$REPOSITORY_NAME:original-image

# Create SOCI index using the convert subcommand
sudo nerdctl image convert –soci $REGISTRY/$REPOSITORY_NAME:original-image $REGISTRY/$REPOSITORY_NAME:soci-image

# Push the SOCI v2 indexed image
sudo nerdctl push –platform linux/amd64 $REGISTRY/$REPOSITORY_NAME:soci-image

This process creates two artifacts for the original container image in your ECR repository:

SOCI index – Metadata enabling lazy loading.
Image index manifest – OCI-compliant manifest linking them together.

To use SOCI-indexed images in SageMaker Studio, you must reference the image index URI rather than the original container image URI when creating SageMaker Image and SageMaker Image Version resources. The image index URI corresponds to the tag you specified during the SOCI conversion process (for example, soci-image in the previous example).

#/bin/bash
# Use the SOCI v2 image index URI
IMAGE_INDEX_URI=”123456789012.dkr.ecr.us-west-2.amazonaws.com/my-sagemaker-image:soci-image”  

# Create SageMaker Image
aws sagemaker create-image
–image-name “my-sagemaker-image”
–role-arn “arn:aws:iam::123456789012:role/SageMakerExecutionRole”  

# Create SageMaker Image Version with SOCI index
aws sagemaker create-image-version
–image-name “my-sagemaker-image”
–base-image “$IMAGE_INDEX_URI”  

# Create App Image Config for JupyterLab
aws sagemaker create-app-image-config
–app-image-config-name “my-sagemaker-image-config”
–jupyter-lab-app-image-config ‘{ “FileSystemConfig”: { “MountPath”: “/home/sagemaker-user”, “DefaultUid”: 1000, “DefaultGid”: 100 } }’  

#Update domain to include the custom image (required step)
aws sagemaker update-domain
 –domain-id “d-xxxxxxxxxxxx”
 –default-user-settings ‘{
  “JupyterLabAppSettings”: {
  “CustomImages”: [{
  “ImageName”: “my-sagemaker-image”,
  “AppImageConfigName”: “my-sagemaker-image-config”
  }]
  }
 }’

The image index URI contains references to both the container image and its associated SOCI index through the OCI Image Index manifest. When SageMaker Studio launches applications using this URI, it automatically detects the SOCI index and enables lazy loading capabilities.
SOCI indexing is supported for all ML environments (JupyterLab, CodeEditor, etc.) for both SageMaker Unified Studio and SageMaker AI. For additional information on setting up your customer image, please reference SageMaker Bring Your Own Image documentation.
Benchmarking SOCI impact on SageMaker Studio JupyterLab startup
The primary objective of this new feature in SageMaker Studio is to streamline the end user experience by reducing the startup durations for SageMaker Studio applications launched with custom images. To measure the effectiveness of lazy loading custom container images in SageMaker Studio using SOCI, we will empirically quantify and contrast start-up durations for a given custom image both with and without SOCI. Further, we’ll conduct this test for a variety of custom images representing a diverse sets of dependencies, files, and data, to evaluate how effectiveness may vary for end users with different custom image needs.
To empirically quantify the startup durations for custom image app launches, we will programmatically launch JupyterLab and CodeEditor Apps with the SageMaker CreateApp API—specifying the candidate sageMakerImageArn and sageMakerImageVersionAlias event time with an appropriate instanceType—recording the eventTime for analysis. We will then poll the SageMaker ListApps API every second to monitor the app startup, recording the eventTime of the first response that where Status is reported as InService. The delta between these two times for a particular app is the startup duration.
For this analysis, we have created two sets of private ECR repositories, each with the same SageMaker custom container images but with only one set implementing SOCI indices. When comparing the equivalent images in ECR, we can see the SOCI artifacts present in only one repo. We will be deploying the apps into a single SageMaker AI domain. All custom images are attached to that domain so that its SageMaker Studio users can choose those custom images when invoking startup of a JupyterLab space.
To run the tests, for each custom image, we invoke a series of ten CreateApp API calls:

“requestParameters”: {
    “domainId”: “<>”,
    “spaceName”: “<>”,
    “appType”: “JupyterLab”,
    “appName”: “default”,
    “tags”: [],
    “resourceSpec”: {
        “sageMakerImageArn”: “<>”,
        “sageMakerImageVersionAlias”: “<>”,
        “instanceType”: “<>”
    },
    “recoveryMode”: false
}

The following table captures the startup acceleration with SOCI index enabled for Amazon SageMaker distribution images:

App type
Instance type
Image
App startup duration (sec)
% Reduction in app startup duration

Regular image
SOCI image

SMAI JupyterLab
t3.medium
SMD 3.4.2
231
150
35.06%

t3.medium
SMD 3.4.2
350
191
45.43%

c7i.large
SMD 3.4.2
331
141
57.40%

SMAI CodeEditor
t3.medium
SMD 3.4.2
202
110
45.54%

t3.medium
SMD 3.4.2
213
78
63.38%

c7i.large
SMD 3.4.2
279
91
67.38%

Note: Each app startup latency and their improvement may vary depending on the availability of SageMaker ML instances.
Based on these findings, we see that running SageMaker Studio custom images with SOCI indexes allows SageMaker Studio users to launch their apps faster compared to without SOCI indexes. Specifically, we see ~35-70% faster container start-up time.
Conclusion
In this post, we showed you how the introduction of SOCI indexing to SageMaker Studio improves the developer experience for machine learning practitioners. By optimizing container startup times through lazy loading—reducing wait times from several minutes to under a minute—AWS helps data scientists, ML engineers, and developers spend less time waiting and more time innovating. This improvement addresses one of the most common friction points in iterative ML development, where frequent environment switches and restarts impact productivity. With SOCI, teams can maintain their development velocity, experiment with different frameworks and configurations, and accelerate their path from experimentation to production deployment.

About the authors
Pranav Murthy is a Senior Generative AI Data Scientist at AWS, specializing in helping organizations innovate with Generative AI, Deep Learning, and Machine Learning on Amazon SageMaker AI. Over the past 10+ years, he has developed and scaled advanced computer vision (CV) and natural language processing (NLP) models to tackle high-impact problems—from optimizing global supply chains to enabling real-time video analytics and multilingual search. When he’s not building AI solutions, Pranav enjoys playing strategic games like chess, traveling to discover new cultures, and mentoring aspiring AI practitioners. You can find Pranav on LinkedIn.
Raj Bagwe is a Senior Solutions Architect at Amazon Web Services, based in San Francisco, California. With over 6 years at AWS, he helps customers navigate complex technological challenges and specializes in Cloud Architecture, Security and Migrations. In his spare time, he coaches a robotics team and plays volleyball. You can find Raj on LinkedIn.
Nikita Arbuzov is a Software Development Engineer at Amazon Web Services, working and maintaining SageMaker Studio platform and its applications, based in New York, NY. With over 3 years of experience in backend platform latency optimization, he works on improving customer experience and usability of SageMaker AI and SageMaker Unified Studio. In his spare time, Nikita performs different outdoor activities, like mountain biking, kayaking, and snowboarding, loves traveling around the US and enjoys making new friends. You can find Nikita on LinkedIn.

Build and deploy scalable AI agents with NVIDIA NeMo, Amazon Bedrock A …

This post is co-written with Ranjit Rajan, Abdullahi Olaoye, and Abhishek Sawarkar from NVIDIA.
AI’s next frontier isn’t merely smarter chat-based assistants, it’s autonomous agents that reason, plan, and execute across entire systems. But to accomplish this, enterprise developers need to move from prototypes to production-ready AI agents that scale securely. This challenge grows as enterprise problems become more complex, requiring architectures where multiple specialized agents collaborate to accomplish sophisticated tasks.
Building AI agents in development differs fundamentally from deploying them at scale. Developers face a chasm between prototype and production, struggling with performance optimization, resource scaling, security implementation, and operational monitoring. Typical approaches leave teams juggling multiple disconnected tools and frameworks, making it difficult to maintain consistency from development through deployment with optimal performance. That’s where the powerful combination of Strands Agents, Amazon Bedrock AgentCore, and NVIDIA NeMo Agent Toolkit shine. You can use these tools together to design sophisticated multi-agent systems, orchestrate them, and scale them securely in production with built-in observability, agent evaluation, profiling, and performance optimization. This post demonstrates how to use this integrated solution to build, evaluate, optimize, and deploy AI agents on Amazon Web Services (AWS) from initial development through production deployment.
Foundation for enterprise-ready agents
The open source Strands Agents framework simplifies AI agent development through its model-driven approach. Developers create agents using three components:

Foundation models (FMs) such as Amazon Nova, Claude by Anthropic, and Meta’s Llama
Tools (over 20 built-in, plus custom tools using Python decorators)
Prompts that guide agent behavior.

The framework includes built-in integrations with AWS services such as Amazon Bedrock and Amazon Simple Storage Service (Amazon S3), local testing support, continuous integration and continuous development (CI/CD) workflows, multiple deployment options, and OpenTelemetry observability.
Amazon Bedrock AgentCore is an agentic platform for building, deploying, and operating effective agents securely at scale. It has composable, fully managed services:

Runtime for secure, serverless agent deployment
Memory for short-term and long-term context retention
Gateway for secure tool access by transforming APIs and AWS Lambda functions into agent-compatible tools and connecting to existing Model Context Protocol (MCP) servers
Identity for secure agent identity and access management
Code Interpreter for secure code execution in sandbox environments
Browser for fast, secure web interactions
Observability for comprehensive operational insights to trace, debug, and monitor agent performance
Evaluations for continuously inspecting agent quality based on real-world behavior
Policy to keep agents within defined boundaries

These services, designed to work independently or together, abstract the complexity of building, deploying, and operating sophisticated agents while working with open source frameworks or models delivering enterprise-grade security and reliability.
Agent evaluation, profiling, and optimization with NeMo Agent Toolkit
NVIDIA NeMo Agent Toolkit is an open source framework designed to help developers build, profile, and optimize AI agents regardless of their underlying framework. Its framework-agnostic approach means it works seamlessly with Strands Agents, LangChain, LlamaIndex, CrewAI, and custom enterprise frameworks. In addition, different frameworks can interoperate when they’re connected in the NeMo Agent Toolkit.
The toolkit’s profiler provides complete agent workflow analysis that tracks token usage, timing, workflow-specific latency, throughput, and run times for individual agents and tools, enabling targeted performance improvements. Built on the toolkit’s evaluation harness, it includes Retrieval Augmented Generation (RAG)-specific evaluators (such as answer accuracy, context relevance, response groundedness, and agent trajectory) and supports custom evaluators for specialized use cases, enabling targeted performance optimization. The automated hyperparameter optimizer profiles and systematically discovers optimal settings for parameters such as temperature, top_p, and max_tokens while maximizing accuracy, groundedness, context relevance, and minimizing token usage, latency, and optimizing for other custom metrics as well. This automated approach profiles your complete agent workflows, identified bottlenecks, and uncovers optimal parameter combinations that manual tuning might miss. The toolkit’s intelligent GPU sizing calculator alleviates guesswork by simulating agent latency and concurrency scenarios and predicting precise GPU infrastructure requirements for production deployment.
The toolkit’s observability integration connects with popular monitoring services including Arize Phoenix, Weights & Biases Weave, Langfuse, and OpenTelemetry supported systems, like Amazon Bedrock AgentCore Observability, creating a continuous feedback loop for ongoing optimization and maintenance.
Real-world implementation
This example demonstrates a knowledge-based agent that retrieves and synthesizes information from web URLs to answer user queries. Built using Strands Agents with integrated NeMo Agent Toolkit, the solution is containerized for quick deployment in Amazon Bedrock AgentCore Runtime and takes advantage of Bedrock AgentCore services, such as AgentCore Observability. Additionally, developers have the flexibility to integrate with fully managed models in Amazon Bedrock, models hosted in Amazon SageMaker AI, containerized models in Amazon Elastic Kubernetes Service (Amazon EKS) or other model API endpoints. The overall architecture is designed for a streamlined workflow, moving from agent definition and optimization to containerization and scalable deployment.
The following architecture diagram illustrates an agent built with Strands Agents integrating NeMo Agent Toolkit deployed in Amazon Bedrock AgentCore.

Agent development and evaluation
Start by defining your agent and workflows in Strands Agents, then wrap it with NeMo Agent Toolkit to configure components such as a large language model (LLM) for inference and tools. Refer to the Strands Agents and NeMo Agent Toolkit integration example in GitHub for a detailed setup guide. After configuring your environment, validate your agent logic by running a single workflow from the command line with an example prompt:

nat run –config_file examples/frameworks/strands_demo/configs/config.yml –input “How do I use the Strands Agents API?”

The following is the truncated terminal output:

Workflow Result:
[‘The Strands Agents API is a flexible system for managing prompts, including both
system prompts and user messages. System prompts provide high-level instructions to
the model about its role, capabilities, and constraints, while user messages are your
queries or requests to the agent. The API supports multiple techniques for prompting,
including text prompts, multi-modal prompts, and direct tool calls. For guidance on
how to write safe and responsible prompts, please refer to the Safety & Security –
Prompt Engineering documentation.’]

Instead of executing a single workflow and exiting, to simulate a real-world scenario, you can spin up a long-running API server capable of handling concurrent requests with the serve command:

nat serve –config_file examples/frameworks/strands_demo/configs/config.yml

The following is the truncated terminal output:

INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)

The agent is now running locally on port 8000. To interact with the agent, open a new terminal and execute the following cURL command. This will generate output similar to the previous nat run step but the agent runs continuously as a persistent service rather than executing one time and exiting. This simulates the production environment where Amazon Bedrock AgentCore will run the agent as a containerized service:

curl -X ‘POST’ ‘http://localhost:8080/invocations’ -H ‘accept: application/json’ -H ‘Content-Type: application/json’ -d ‘{“inputs” : “How do I use the Strands Agents API?”}’curl -X ‘POST’ ‘http://localhost:8000/generate’ -H ‘accept: application/json’ -H ‘Content-Type: application/json’ -d ‘{“inputs” : “How do I use the Strands Agents API?”}’ 

The following is the truncated terminal output:

{“value”:”The Strands Agents API provides a flexible system for managing prompts,
including both system prompts and user messages. System prompts provide high-level
instructions to the model about its role, capabilities, and constraints, while user
messages are your queries or requests to the agent. The SDK supports multiple techniques
for prompting, including text prompts, multi-modal prompts, and direct tool calls.
For guidance on how to write safe and responsible prompts, please refer to the
Safety & Security – Prompt Engineering documentation.”}

Agent profiling and workflow performance monitoring
With the agent running, the next step is to establish a performance baseline. To illustrate the depth of insights available, in this example, we use a self-managed Llama 3.3 70B Instruct NIM on an Amazon Elastic Compute Cloud (Amazon EC2) P4de.24xlarge instance powered by NVIDIA A100 Tensor Core GPUs (8xA100 80 GB GPU) running on Amazon EKS. We use the nat eval command to evaluate the agent and generate the analysis:
nat eval –config_file examples/frameworks/strands_demo/configs/eval_config.yml
The following is the truncated terminal output:

Evaluating Trajectory: 100%|████████████████████████████████████████████████████████████████████| 10/10 [00:10<00:00, 1.00s/it]
2025-11-24 16:59:18 – INFO – nat.profiler.profile_runner:127 – Wrote combined data to: .tmp/nat/examples/frameworks/strands_demo/eval/all_requests_profiler_traces.json
2025-11-24 16:59:18 – INFO – nat.profiler.profile_runner:146 – Wrote merged standardized DataFrame to .tmp/nat/examples/frameworks/strands_demo/eval/standardized_data_all.csv
2025-11-24 16:59:18 – INFO – nat.profiler.profile_runner:200 – Wrote inference optimization results to: .tmp/nat/examples/frameworks/strands_demo/eval/inference_optimization.json
2025-11-24 16:59:28 – INFO – nat.profiler.profile_runner:224 – Nested stack analysis complete
2025-11-24 16:59:28 – INFO – nat.profiler.profile_runner:235 – Concurrency spike analysis complete
2025-11-24 16:59:28 – INFO – nat.profiler.profile_runner:264 – Wrote workflow profiling report to: .tmp/nat/examples/frameworks/strands_demo/eval/workflow_profiling_report.txt
2025-11-24 16:59:28 – INFO – nat.profiler.profile_runner:271 – Wrote workflow profiling metrics to: .tmp/nat/examples/frameworks/strands_demo/eval/workflow_profiling_metrics.json
2025-11-24 16:59:28 – INFO – nat.eval.evaluate:345 – Workflow output written to .tmp/nat/examples/frameworks/strands_demo/eval/workflow_output.json
2025-11-24 16:59:28 – INFO – nat.eval.evaluate:356 – Evaluation results written to .tmp/nat/examples/frameworks/strands_demo/eval/rag_relevance_output.json
2025-11-24 16:59:28 – INFO – nat.eval.evaluate:356 – Evaluation results written to .tmp/nat/examples/frameworks/strands_demo/eval/rag_groundedness_output.json
2025-11-24 16:59:28 – INFO – nat.eval.evaluate:356 – Evaluation results written to .tmp/nat/examples/frameworks/strands_demo/eval/rag_accuracy_output.json
2025-11-24 16:59:28 – INFO – nat.eval.evaluate:356 – Evaluation results written to .tmp/nat/examples/frameworks/strands_demo/eval/trajectory_accuracy_output.json
2025-11-24 16:59:28 – INFO – nat.eval.utils.output_uploader:62 – No S3 config provided; skipping upload.

The command generates detailed artifacts that include JSON files per evaluation metric (such as accuracy, groundedness, relevance, and Trajectory accuracy) showing scores from 0–1, reasoning traces, retrieved contexts, and aggregated averages. Additional information in the artifacts generated include workflow outputs, standardized tables, profile traces, and compact summaries for latency and token efficiency. This multi-metric sweep provides a holistic view of agent quality and behavior. The evaluation highlights that while the agent achieved consistent groundedness scores—meaning answers were reliably supported by sources—there is still an opportunity to improve retrieval relevance. The profile trace output contains workflow-specific latency, throughput, and runtime at 90%, 95%, and 99% confidence intervals. The command generates a Gantt chart of the agent flow and nested stack analysis to pinpoint exactly where bottlenecks exist, as seen in the following figure. It also reports concurrency spikes and token efficiency so you can understand precisely how scaling impacts prompt and completion usage.

During the profiling, nat spawns eight concurrent agent workflows (shown in orange bars in the chart), which is the default concurrency configuration during evaluation. The p90 latency for the workflow shown is approximately 58.9 seconds. Crucially, the data confirmed that response generation was the primary bottleneck, with the longest LLM segments taking roughly 61.4 seconds. Meanwhile, non-LLM overhead remained minimal. HTTP requests averaged only 0.7–1.2 seconds, and knowledge base access was negligible. Using this level of granularity, you can now identify and optimize specific bottlenecks in the agent workflows.
Agent performance optimization
After profiling, refine the agent’s parameters to balance quality, performance, and cost. Manual tuning of LLM settings like temperature and top_p is often a game of guesswork. The NeMo Agent Toolkit turns this into a data-driven science. You can use the built-in optimizer to perform a systematic sweep across your parameter search space:
nat optimize –config_file examples/frameworks/strands_demo/configs/optimizer_config.yml
The following is the truncated terminal output:

Evaluating Trajectory: 100%|██████████████████████████████████████████████████████████████| 10/10 [00:10<00:00, 1.00it/s]
2025-10-31 16:50:41 – INFO – nat.profiler.profile_runner:127 – Wrote combined data to: ./tmp/nat/strands_demo/eval/all_requests_profiler_traces.json
2025-10-31 16:50:41 – INFO – nat.profiler.profile_runner:146 – Wrote merged standardized DataFrame to: ./tmp/nat/strands_demo/eval/standardized_data_all.csv
2025-10-31 16:50:41 – INFO – nat.profiler.profile_runner:208 – Wrote inference optimization results to: ./tmp/nat/strands_demo/eval/inference_optimization.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:337 – Workflow output written to ./tmp/nat/strands_demo/eval/workflow_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/token_efficiency_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/llm_latency_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_relevance_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_groundedness_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_accuracy_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/trajectory_accuracy_output.json
2025-10-31 16:50:41 – INFO – nat.eval.utils.output_uploader:61 – No S3 config provided; skipping upload.
Evaluating Regex-Ex_Accuracy: 100%|████████████████████████████████████████████████████████| 10/10 [00:21<00:00, 2.15s/it]
2025-10-31 16:50:44 – INFO – nat.profiler.profile_runner:127 – Wrote combined data to: ./tmp/nat/strands_demo/eval/all_requests_profiler_traces.json
2025-10-31 16:50:44 – INFO – nat.profiler.profile_runner:146 – Wrote merged standardized DataFrame to: ./tmp/nat/strands_demo/eval/standardized_data_all.csv
2025-10-31 16:50:45 – INFO – nat.profiler.profile_runner:208 – Wrote inference optimization results to: ./tmp/nat/strands_demo/eval/inference_optimization.json
2025-10-31 16:50:46 – INFO – nat.eval.evaluate:337 – Workflow output written to ./tmp/nat/strands_demo/eval/workflow_output.json
2025-10-31 16:50:47 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/token_efficiency_output.json
2025-10-31 16:50:48 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/llm_latency_output.json
2025-10-31 16:50:49 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_relevance_output.json
2025-10-31 16:50:50 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_groundedness_output.json
2025-10-31 16:50:51 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/trajectory_accuracy_output.json
2025-10-31 16:50:52 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_accuracy_output.json
2025-10-31 16:50:53 – INFO – nat.eval.utils.output_uploader:61 – No S3 config provided; skipping upload.
[I 2025-10-31 16:50:53,361] Trial 19 finished with values: [0.6616666666666667, 1.0, 0.38000000000000007, 0.26800000000000006, 2.1433333333333333, 2578.222222222222] and parameters: {‘llm_sim_llm.top_p’: 0.8999999999999999, ‘llm_sim_llm.temperature’: 0.38000000000000006, ‘llm_sim_llm.max_tokens’: 5632}.
2025-10-31 16:50:53 – INFO – nat.profiler.parameter_optimization.parameter_optimizer:120 – Numeric optimization finished
2025-10-31 16:50:53 – INFO – nat.profiler.parameter_optimization.parameter_optimizer:162 – Generating Pareto front visualizations…
2025-10-31 16:50:53 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:320 – Creating Pareto front visualizations…
2025-10-31 16:50:53 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:330 – Total trials: 20
2025-10-31 16:50:53 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:331 – Pareto optimal trials: 14
2025-10-31 16:50:54 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:345 – Parallel coordinates plot saved to: ./tmp/nat/strands_demo/optimizer/plots/pareto_parallel_coordinates.png
2025-10-31 16:50:56 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:374 – Pairwise matrix plot saved to: ./tmp/nat/strands_demo/optimizer/plots/pareto_pairwise_matrix.png
2025-10-31 16:50:56 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:387 – Visualization complete!
2025-10-31 16:50:56 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:389 – Plots saved to: ./tmp/nat/strands_demo/optimizer/plots
2025-10-31 16:50:56 – INFO – nat.profiler.parameter_optimization.parameter_optimizer:171 – Pareto visualizations saved to: ./tmp/nat/strands_demo/optimizer/plots
2025-10-31 16:50:56 – INFO – nat.profiler.parameter_optimization.optimizer_runtime:88 – All optimization phases complete.

This command launches an automated sweep across key LLM parameters, such as temperature, top_p, and max_tokens, as defined in the config (in this case optimizer_config.yml) search space. The optimizer runs 20 trials with three repetitions each, using weighted evaluation metrics to automatically discover optimal model settings. It might take up to 15–20 minutes for the optimizer to run 20 trials.
The toolkit evaluates each parameter set against a weighted multi-objective score, aiming to maximize quality (for example, accuracy, groundedness, or tool use) while minimizing token cost and latency. Upon completion, it generates detailed performance artifacts and summary tables so you can quickly identify and select the optimal configuration for production. The following is the hyperparameter optimizer configuration:

llms:
nim_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
temperature: 0.5
top_p: 0.9
max_tokens: 4096
# Enable optimization for these parameters
optimizable_params:
– temperature
– top_p
– max_tokens
# Define search spaces
search_space:
temperature:
low: 0.1
high: 0.7
step: 0.2 # Tests: 0.1, 0.3, 0.5, 0.7
top_p:
low: 0.7
high: 1.0
step: 0.1 # Tests: 0.7, 0.8, 0.9, 1.0
max_tokens:
low: 4096
high: 8192
step: 512 # Tests: 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192

In this example, NeMo Agent Toolkit Optimize systematically evaluated parameter configurations and identified temperature ≈ 0.7, top_p ≈ 1.0, and max_tokens ≈ 6k (6144) as optimal configuration yielding the highest accuracy across 20 trials. This configuration delivered a 35% accuracy improvement over baseline while simultaneously achieving 20% token efficiency gains compared to the 8192 max_tokens setting—maximizing both performance and cost efficiency for these production deployments.
The optimizer plots pairwise pareto curves, as shown in the following pairwise matrix comparison charts, to analyze trade-offs between different parameters. The parallel coordinates plot, that follows the matrix comparison chart, shows optimal trials (red lines) achieving high quality scores (0.8–1.0) across accuracy, groundedness, and relevance while trading off some efficiency as token usage and latency drop to 0.6–0.8 on the normalized scale. The pairwise matrix confirms strong correlations between quality metrics and reveals actual token consumption clustered tightly around 2,500–3,100 tokens across all trials. These results indicate that further gains in accuracy and token efficiency might be possible through prompt engineering. This is something that development teams can achieve using NeMo Agent Toolkit’s prompt optimization capabilities, helping reduce costs while maximizing performance.
The following image shows the pairwise matrix comparison:

The following image shows the parallel coordinates plot:

Right-sizing production GPU infrastructure
After your agent is optimized and you’ve finalized the runtime or inference configuration, you can shift your focus to assessing your model deployment infrastructure. If you’re self-managing your model deployment on a fleet of EC2 GPU-powered instances, then one of the most difficult aspects of moving agents to production is predicting exactly what compute resources are necessary to support a target use case and concurrent users without overrunning the budget or causing timeouts. The NeMo Agent Toolkit GPU sizing calculator addresses this challenge by using your agent’s actual performance profile to determine the optimal cluster size for specific service level objectives (SLOs), enabling right-sizing that alleviates the trade-off between performance and cost. To generate a sizing profile, you run the sizing calculator across a range of concurrency levels (for example, 1–32 simultaneous users):

nat sizing calc –config_file examples/frameworks/strands_demo/configs/sizing_config.yml –calc_output_dir /tmp/strands_demo/sizing_calc_run1/ –concurrencies 1,2,4,8,12,20,24,28,32 –num_passes 2

Executing this on our reference EC2 P4de.24xlarge instance powered by NVIDIA A100 Tensor Core GPUs running on Amazon EKS for a Llama 3.3 70B Instruct NIM produced the following capacity analysis:

Per concurrency results:
Alerts!: W = Workflow interrupted, L = LLM latency outlier, R = Workflow runtime outlier
| Alerts | Concurrency | p95 LLM Latency | p95 WF Runtime | Total Runtime |
|——–|————–|—————–|—————-|—————|
| | 1 | 11.8317 | 21.3647 | 33.2416 |
| | 2 | 19.3583 | 26.2694 | 36.931 |
| | 4 | 25.728 | 32.4711 | 61.13 |
| | 8 | 38.314 | 57.1838 | 89.8716 |
| | 12 | 55.1766 | 72.0581 | 130.691 |
| | 20 | 103.68 | 131.003 | 202.791 |
| !R | 24 | 135.785 | 189.656 | 221.721 |
| !R | 28 | 125.729 | 146.322 | 245.654 |
| | 32 | 169.057 | 233.785 | 293.562 |

As shown in the following chart, calculated concurrency scales almost linearly with both latency and end‑to‑end runtime, with P95 LLM latency and workflow runtime demonstrating near-perfect trend fits (R² ≈ 0.977/0.983). Each additional concurrent request introduces a predictable latency penalty, suggesting the system operates within a linear capacity zone where throughput can be optimized by adjusting latency tolerance.

With the sizing metrics captured, you can estimate the GPU cluster size for a specific concurrency and latency. For example, to support 25 concurrent users with a target workflow runtime of 50 seconds, you can run the calculator:

nat sizing calc –offline_mode –calc_output_dir /tmp/strands_demo/sizing_calc_run1/ –test_gpu_count 8 –target_workflow_runtime 50 –target_users 25

This workflow analyzes current performance metrics and generates a resource recommendation. In our example scenario, the tool calculates that to meet strict latency requirements for 25 simultaneous users, approximately 30 GPUs are required based on the following formula:

gpu_estimate = (target_users / calculated_concurrency) * test_gpu_count
calculated_concurrency = (target_time_metric – intercept) / slope

The following is the output from the sizing estimation:

Targets: LLM Latency ≤ 0.0s, Workflow Runtime ≤ 50.0s, Users = 25
Test parameters: GPUs = 8
Per concurrency results:
Alerts!: W = Workflow interrupted, L = LLM latency outlier, R = Workflow runtime outlier
| Alerts | Concurrency | p95 LLM Latency | p95 WF Runtime | Total Runtime | GPUs (WF Runtime, Rough) |
|——–|————-|—————–|—————-|—————|————————–|
| | 1 | 11.8317 | 21.3647 | 33.2416 | 85.4587 |
| | 2 | 19.3583 | 26.2694 | 36.931 | 52.5388 |
| | 4 | 25.728 | 32.4711 | 61.13 | 32.4711 |
| | 8 | 38.314 | 57.1838 | 89.8716 | |
| | 12 | 55.1766 | 72.0581 | 130.691 | |
| | 20 | 103.68 | 131.003 | 202.791 | |
| !R | 24 | 135.785 | 189.656 | 221.721 | |
| !R | 28 | 125.729 | 146.322 | 245.654 | |
| | 32 | 169.057 | 233.785 | 293.562 | |

=== GPU ESTIMATES ===
Estimated GPU count (Workflow Runtime): 30.5

Production agent deployment to Amazon Bedrock AgentCore
After evaluating, profiling, and optimizing your agent, deploy it to production. Although running the agent locally is sufficient for testing, enterprise deployment requires an agent runtime that helps provide security, scalability, and robust memory management without the overhead of managing infrastructure. This is where Amazon Bedrock AgentCore Runtime shines—providing enterprise-grade serverless agent runtime without the infrastructure overhead. Refer to the step-by-step deployment guide in the NeMo Agent Toolkit Repository. By packaging your optimized agent in a container and deploying it to the serverless Bedrock AgentCore Runtime, you elevate your prototype agent to a resilient application for long-running tasks and concurrent user requests. After you deploy the agent, visibility becomes critical. This integration creates a unified observability experience, transforming opaque black-box execution into deep visibility. You gain exact traces, spans, and latency breakdowns for every interaction in production, integrated into Bedrock AgentCore Observability using OpenTelemetry.
The following screenshot shows the Amazon CloudWatch dashboard displaying Amazon Bedrock AgentCore traces and spans, visualizing the execution path and latency of the deployed Strands agent.

Amazon Bedrock AgentCore services extend well beyond agent runtime management and observability. Your deployed agents can seamlessly use additional Bedrock AgentCore services, including Amazon Bedrock AgentCore Identity for authentication and authorization, Amazon Bedrock AgentCore Gateway for tools access, Amazon Bedrock AgentCore Memory for context-awareness, Amazon Bedrock AgentCore Code Interpreter for secure code execution, and Amazon Bedrock AgentCore Browser for web interactions, to create enterprise-ready agents.
Conclusion
Production AI agents need performance visibility, optimization, and reliable infrastructure. For the example use case, this integration delivered on all three fronts: achieving 20% token efficiency gains, 35% accuracy improvements for the example use case, and performance-tuned GPU infrastructure calibrated for target concurrency. By combining Strands Agents for foundational agent development and orchestration, the NVIDIA NeMo Agent Toolkit for deep agent profiling, optimization, and right-sizing production GPU infrastructure, and Amazon Bedrock AgentCore for secure, scalable agent infrastructure, developers can have an end-to-end solution that helps provide predictable outcomes. You can now build, evaluate, optimize, and deploy agents at scale on AWS with this integrated solution. To get started, check out the Strands Agents and NeMo Agent Toolkit integration example and deploying Strands Agents and NeMo Agent Toolkit to Amazon Bedrock AgentCore Runtime.

About the authors
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime, Browser, Code Interpreter, and Identity. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and enjoys life with his wife and kids.
Sagar Murthy is an agentic AI GTM leader at AWS, where he collaborates with frontier foundation model partners, agentic frameworks, startups, and enterprise customers to evangelize AI and data innovations, open-source solutions, and scale impactful partnerships. With collaboration experiences spanning data, cloud and AI, he brings a blend of technical solutions background and business outcomes focus to delight developers and customers.
Chris Smith is a Solutions Architect at AWS specializing in AI-powered automation and enterprise AI agent orchestration. With over a decade of experience architecting solutions at the intersection of generative AI, cloud computing, and systems integration, he helps organizations design and deploy agent systems that transform emerging technologies into measurable business outcomes. His work spans technical architecture, security-first implementation, and cross-functional team leadership.
Ranjit Rajan is a Senior Solutions Architect at NVIDIA, where he helps customers design and build solutions spanning generative AI, agentic AI, and accelerated multi-modal data processing pipelines for pre-training and fine-tuning foundation models.
Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.
Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on Agentic AI. He focuses on product strategy and roadmap of integrating Agentic AI library in partner platforms & enhancing user experience on accelerated computing for AI Agents.

Bi-directional streaming for real-time agent interactions now availabl …

Building natural voice conversations with AI agents requires complex infrastructure and lots of code from engineering teams. Text-based agent interactions follow a turn-based pattern: a user sends a complete request, waits for the agent to process it, and receives a full response before continuing. Bi-directional streaming removes this constraint by establishing a persistent connection that carries data in both directions simultaneously.
Amazon Bedrock AgentCore Runtime supports bi-directional streaming for real-time, two-way communication between users and AI agents. With this capability, agents can simultaneously listen to user input while generating responses, creating a more natural conversational flow. This is particularly well-suited for multimodal interactions, such as voice and vision agent conversations. The agent can begin responding while still receiving user input, handle mid-conversation interruptions, and adjust its responses based on real-time feedback.
A bi-directional voice chat agent can conduct spoken conversations with the fluidity of human dialogue so that users can interrupt, clarify, or change topics naturally. These agents process streaming audio input and output simultaneously while maintaining conversational state. Building this infrastructure requires managing persistent low-latency connections, handling concurrent audio streams, preserving context across exchanges, and scaling multiple conversations. Implementing these capabilities from scratch demands months of engineering effort and specialized real-time systems expertise. Amazon Bedrock AgentCore Runtime addresses these challenges by providing a secure, serverless, and purpose-built hosting environment for deploying and running AI agents, without requiring developers to build and maintain complex streaming infrastructure themselves.
In this post, you will learn about bi-directional streaming on AgentCore Runtime and the prerequisites to create a WebSocket implementation. You will also learn how to use Strands Agents to implement a bi-directional streaming solution for voice agents.
AgentCore Runtime bi-directional streaming
Bi-directional streaming uses the WebSocket protocol. WebSocket provides full-duplex communication over a single TCP connection, establishing a persistent channel where data flows continuously in both directions. This protocol has broad client support across browsers, mobile applications, and server environments, making it accessible for diverse implementation scenarios.
When a connection is established, the agent can receive user input as a stream while simultaneously sending response chunks back to the user. The AgentCore Runtime manages the underlying infrastructure that handles connection, message ordering, and maintains conversational state across the bi-directional exchange. This alleviates the need for developers to build custom streaming infrastructure or manage the complexities of concurrent data flows.Voice conversations differ from text-based interactions in their expectation of natural flow. When speaking with a voice agent, users expect the same conversational dynamics they experience with humans: the ability to interrupt when they need to correct themselves, to interject clarification mid-response, or to redirect the conversation without awkward pauses.With bi-directional streaming, it’s possible for voice agents to process incoming audio while generating responses, detecting interruptions, and adjusting behavior in real-time. The agent maintains conversational context throughout these interactions, preserving the thread of dialogue even as the conversation shifts direction. This capability also helps voice agents from turn-based systems into a responsive conversational partner.
Beyond voice conversations, bi-directional streaming has several interaction patterns. Interactive debugging sessions allow developers to guide agents through problem-solving in real-time, providing feedback as the agent explores solutions. Collaborative agents can work alongside users on shared tasks, receiving continuous input as the work progresses rather than waiting for complete instructions. Multi-modal agents can process streaming video or sensor data while simultaneously providing analysis and recommendations. Async long-running agent operations can process tasks over minutes or hours while streaming incremental results to clients.
WebSocket implementation
To create a WebSocket implementation in AgentCore Runtime, you should follow a few patterns. Firstly, your containers must implement WebSocket endpoints on port 8080 at the /ws path, which aligns with standard WebSocket server practices. This WebSocket endpoint will enable a single agent container to serve both the traditional InvokeAgentRuntime API and the new InvokeAgentRuntimeWithWebsocketStream API. Additionally, customers must provide a /ping endpoint for health checks.
Bi-directional streaming using WebSockets on AgentCore Runtime supports applications using a WebSocket language library. The client must connect to the service endpoint with a WebSocket protocol connection:

wss://bedrock-agentcore.<region>.amazonaws.com/runtimes/<agentRuntimeArn>/ws

You also need to use one of the supported authentication methods (SigV4 headers, SigV4 pre-signed URL, or OAuth 2.0) and to make sure that the agent application implements the WebSocket service contract as specified in HTTP protocol contract.
Strands bi-directional agent: Simplified voice agent development
Amazon Nova Sonic unifies speech understanding and generation into a single model, delivering human-like conversational AI with low latency, leading accuracy, and strong price performance. Its integrated architecture provides expressive speech generation and real-time transcription in one model, dynamically adapting responses based on input speech prosody, pace, and timbre.
With bi-directional streaming now also available in AgentCore Runtime, you have several ways to show how to host a voice agent: one can be the direct implementation where you need to managing WebSocket connections, parsing protocol events, handling audio chunks, and orchestrating async tasks; another is the strands bi-directional agent implementation that abstracts this complexity and implements these steps on its own.
Example Implementation
In this post, you should refer to the Amazon Bedrock AgentCore bi-directional code, which implements bi-directional communication with Amazon Bedrock AgentCore. The repository has two implementations: One that uses native Amazon Nova Sonic Python implementation deployed directly to AgentCore Runtime, and a high-level framework implementation using the Strands bi-directional agent for simplified real-time audio conversations.

The following diagram shows the native Amazon Nova Sonic Python WebSocket server directly to AgentCore. It provides full control over the Nova Sonic protocol with direct event handling for complete visibility into session management, audio streaming, and response generation.

The Strands bi-directional agent framework for real-time audio conversations with Amazon Nova Sonic provides a high-level abstraction that simplifies bi-directional streaming, automatic session management, and tool integration. The code snippet below is an example of this simplification.

from strands.experimental.bidi.agent import BidiAgent
from strands.experimental.bidi.models.nova_sonic import BidiNovaSonicModel
from strands_tools import calculator
@app.websocket(“/ws”)
async def websocket_endpoint(websocket: WebSocket, model_name: str):
# Define a Nova Sonic BidiModel
model = BidiNovaSonicModel(
region=”us-east-1″,
model_id=”amazon.nova-sonic-v1:0″,
provider_config={
“audio”: {
“input_sample_rate”: 16000,
“output_sample_rate”: 24000,
“voice”: “matthew”,
}
}
)
# Create a Strands Agent with tools and system prompt
agent = BidiAgent(
model=model,
tools=[calculator],
system_prompt=”You are a helpful assistant with access to a calculator tool.”,
)
# Start streaming conversation
await agent.run(inputs=[receive_and_convert], outputs=[websocket.send_json])

This implementation demonstrates the simplicity of Strands: instantiate a model, create an agent with tools and a system prompt, and run it with input/output streams. The framework handles protocol complexity internally.
The following is the agent declaration section in the code:

agent = BidiAgent(
model=model,
tools=[calculator, weather_api, database_query],
system_prompt=”You are a helpful assistant…”
)

Tools are passed directly to the agent’s constructor, and Strands handles function calling orchestration automatically. In summary, a native WebSocket implementation of the same functionality requires approximately 150 lines of code, whereas Strands implementation reduces this to approximately 20 lines focused on business logic. Developers can focus on defining agent behavior, integrating tools, and crafting system prompts rather than managing WebSocket connections, parsing events, handling audio chunks, or orchestrating async tasks. This makes bi-directional streaming accessible to developers without specialized real-time systems expertise while maintaining full access to the audio conversation capabilities of Nova Sonic. The Strands bi-directional feature is currently only supported for the Python SDK. If you are looking for flexibility in the implementation of your voice agent, the native Amazon Nova Sonic implementation can help you. Also, this can be important for the cases where you have multiple different patterns of communication from agent to model. With Amazon Nova Sonic implementation you will be able to control every step of the process with full control. The framework approach can provide better control of dependencies, because it is done by the SDK, and provides consistency across systems. The same Strands bi-directional agent code structure works with Nova Sonic, OpenAI Realtime API, and Google Gemini Live developers simply swap the model implementation while keeping the rest of their code unchanged.
Conclusion
The bi-directional streaming capability of Amazon Bedrock AgentCore Runtime transforms how developers can build conversational AI agents. By providing WebSocket-based real-time communication infrastructure, AgentCore removes months of engineering effort required to implement streaming systems from scratch. The framework runtime enables developers to deploy multiple types of voice agents—from native protocol implementations using Amazon Nova Sonic to high-level frameworks like the Strands bi-directional agent—within the same secure, serverless environment.

About the authors
Lana Zhang is a Senior Specialist Solutions Architect for Generative AI at AWS within the Worldwide Specialist Organization. She specializes in AI/ML, with a focus on use cases such as AI voice assistants and multimodal understanding. She works closely with customers across diverse industries, including media and entertainment, gaming, sports, advertising, financial services, and healthcare, to help them transform their business solutions through AI.
Phelipe Fabres is a Senior Specialist Solutions Architect for Generative AI at AWS for Startups. He specializes in AI/ML with a focus on Agentic systems and the full process of training/inference. He has more than 10 years of working with software development, from monolith to event-driven architectures with a Ph.D. in Graph Theory.
Evandro Franco is an Sr. Data Scientist working on Amazon Web Services. He is part of the Global GTM team that helps AWS customers overcome business challenges related to AI/ML on top of AWS, mainly on Amazon Bedrock AgentCore and Strands Agents. He has more than 18 years of experience working with technology, from software development, infrastructure, serverless, to machine learning. In his free time, Evandro enjoys playing with his son, mainly building some funny Lego bricks.

Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses …

Meta has released SAM Audio, a prompt driven audio separation model that targets a common editing bottleneck, isolating one sound from a real world mix without building a custom model per sound class. Meta released 3 main sizes, sam-audio-small, sam-audio-base, and sam-audio-large. The model is available to download and to try in the Segment Anything Playground.

Architecture

SAM Audio uses separate encoders for each conditioning signal, an audio encoder for the mixture, a text encoder for the natural language description, a span encoder for time anchors, and a visual encoder that consumes a visual prompt derived from video plus an object mask. The encoded streams are concatenated into time aligned features, then processed by a diffusion transformer that applies self attention over the time aligned representation and cross attention to the textual feature, then a DACVAE decoder reconstructs waveforms and emits 2 outputs, target audio and residual audio.

https://ai.meta.com/blog/sam-audio/

What SAM Audio does, and what ‘segment’ means here?

SAM Audio takes an input recording that contains multiple overlapping sources, for example speech plus traffic plus music, and separates out a target source based on a prompt. In the public inference API, the model produces 2 outputs, result.target and result.residual. The research team describes target as the isolated sound, and residual as everything else.

That target plus residual interface maps directly to editor operations. If you want to remove a dog bark across a podcast track, you can treat the bark as the target, then subtract it by keeping only residual. If you want to extract a guitar part from a concert clip, you keep the target waveform instead. Meta uses these exact kinds of examples to explain what the model is meant to enable.

The 3 prompt types Meta is shipping

Meta positions SAM Audio as a single unified model that supports 3 prompt types, and it says these prompts can be used alone or combined.

Text prompting: You describe the sound in natural language, for example “dog barking” or “singing voice”, and the model separates that sound from the mixture. Meta lists text prompts as one of the core interaction modes, and the open source repo includes an end to end example using SAMAudioProcessor and model.separate.

Visual prompting: You click the person or object in a video and ask the model to isolate the audio associated with that visual object. Meta team describes visual prompting as selecting the sounding object in the video. In the released code path, visual prompting is implemented by passing video frames plus masks into the processor via masked_videos.

Span prompting: Meta team calls span prompting an industry first. You mark time segments where the target sound occurs, then the model uses those spans to guide separation. This matters for ambiguous cases, for example when the same instrument appears in multiple passages, or when a sound is present only briefly and you want to prevent the model from over separating.

https://ai.meta.com/blog/sam-audio/

Results

Meta team positions SAM Audio as achieving cutting edge performance across diverse, real world scenarios, and frames it as a unified alternative to single purpose audio tools. The team publishes a subjective evaluation table across categories, General, SFX, Speech, Speaker, Music, Instr(wild), Instr(pro), with General scores of 3.62 for sam audio small, 3.28 for sam audio base, and 3.50 for sam audio large, and Instr(pro) scores reaching 4.49 for sam audio large.

Key Takeaways

SAM Audio is a unified audio separation model, it segments sound from complex mixtures using text prompts, visual prompts, and time span prompts.

The core API produces two waveforms per request, target for the isolated sound and residual for everything else, which maps cleanly to common edit operations like remove noise, extract stem, or keep ambience.

Meta released multiple checkpoints and variants, including sam-audio-small, sam-audio-base, sam-audio-large, plus tv variants that the repo says perform better for visual prompting, the repo also publishes a subjective evaluation table by category.

The release includes tooling beyond inference, Meta provides a sam-audio-judge model that scores separation results against a text description with overall quality, recall, precision, and faithfulness.

Check out the Technical details and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation appeared first on MarkTechPost.

How to Orchestrate a Fully Autonomous Multi-Agent Research and Writing …

In this tutorial, we implement how we build a small but powerful two-agent CrewAI system that collaborates using the Gemini Flash model. We set up our environment, authenticate securely, define specialized agents, and orchestrate tasks that flow from research to structured writing. As we run the crew, we observe how each component works together in real time, giving us a hands-on understanding of modern agentic workflows powered by LLMs. With these steps, we clearly see how multi-agent pipelines become practical, modular, and developer-friendly. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserimport os
import sys
import getpass
from textwrap import dedent

print(“Installing CrewAI and tools… (this may take 1-2 mins)”)
!pip install -q crewai crewai-tools

from crewai import Agent, Task, Crew, Process, LLM

We set up our environment and installed the required CrewAI packages so we can run everything smoothly in Colab. We import the necessary modules and lay the foundation for our multi-agent workflow. This step ensures that our runtime is clean and ready for the agents we create next. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserprint(“n— API Authentication —“)
api_key = None

try:
from google.colab import userdata
api_key = userdata.get(‘GEMINI_API_KEY’)
print(” Found GEMINI_API_KEY in Colab Secrets.”)
except Exception:
pass

if not api_key:
print(” Key not found in Secrets.”)
api_key = getpass.getpass(” Enter your Google Gemini API Key: “)

os.environ[“GEMINI_API_KEY”] = api_key

if not api_key:
sys.exit(” Error: No API Key provided. Please restart and enter a key.”)

We authenticate ourselves securely by retrieving or entering the Gemini API key. We ensure the key is securely stored in the environment so the model can operate without interruption. This step gives us confidence that our agent framework can communicate reliably with the LLM. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browsergemini_flash = LLM(
model=”gemini/gemini-2.0-flash”,
temperature=0.7
)

We configure the Gemini Flash model that our agents rely on for reasoning and generation. We choose the temperature and model variant to balance creativity and precision. This configuration becomes the shared intelligence that drives all agent tasks ahead. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserresearcher = Agent(
role=’Tech Researcher’,
goal=’Uncover cutting-edge developments in AI Agents’,
backstory=dedent(“””You are a veteran tech analyst with a knack for finding emerging trends before they become mainstream. You specialize in Autonomous AI Agents and Large Language Models.”””),
verbose=True,
allow_delegation=False,
llm=gemini_flash
)

writer = Agent(
role=’Technical Writer’,
goal=’Write a concise, engaging blog post about the researcher’s findings’,
backstory=dedent(“””You transform complex technical concepts into compelling narratives. You write for a developer audience who wants practical insights without fluff.”””),
verbose=True,
allow_delegation=False,
llm=gemini_flash
)

We define two specialized agents, a researcher and a writer, each with a clear role and backstory. We design them so they complement one another, allowing one to discover insights while the other transforms them into polished writing. Here, we begin to see how multi-agent collaboration takes shape. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserresearch_task = Task(
description=dedent(“””Conduct a simulated research analysis on ‘The Future of Agentic AI in 2025’. Identify three key trends: 1. Multi-Agent Orchestration 2. Neuro-symbolic AI 3. On-device Agent execution Provide a summary for each based on your ‘expert knowledge’.”””),
expected_output=”A structured list of 3 key AI trends with brief descriptions.”,
agent=researcher
)

write_task = Task(
description=dedent(“””Using the researcher’s findings, write a short blog post (approx 200 words). The post should have: – A catchy title – An intro – The three bullet points – A conclusion on why developers should care.”””),
expected_output=”A markdown-formatted blog post.”,
agent=writer,
context=[research_task]
)

We create two tasks that assign specific responsibilities to our agents. We let the researcher generate structured insights and then pass the output to the writer to create a complete blog post. This step shows how we orchestrate sequential task dependencies cleanly within CrewAI. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browsertech_crew = Crew(
agents=[researcher, writer],
tasks=[research_task, write_task],
process=Process.sequential,
verbose=True
)

print(“n— Starting the Crew —“)
result = tech_crew.kickoff()

from IPython.display import Markdown
print(“nn########################”)
print(“## FINAL OUTPUT ##”)
print(“########################n”)
display(Markdown(str(result)))

We assemble the agents and tasks into a crew and run the entire multi-agent workflow. We watch how the system executes step by step, producing the final markdown output. This is where everything comes together, and we see our agents collaborating in real time.

In conclusion, we appreciate how seamlessly CrewAI allows us to create coordinated agent systems that think, research, and write together. We experience firsthand how defining roles, tasks, and process flows lets us modularize complex work and achieve coherent outputs with minimal code. This framework empowers us to build richer, more autonomous agentic applications, and we walk away confident in extending this foundation into larger multi-agent systems, production pipelines, or more creative AI collaborations.

Check out the FULL CODES HERE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Orchestrate a Fully Autonomous Multi-Agent Research and Writing Pipeline Using CrewAI and Gemini for Real-Time Intelligent Collaboration appeared first on MarkTechPost.