Amazon SageMaker AI introduces EAGLE based adaptive speculative decodi …

Generative AI models continue to expand in scale and capability, increasing the demand for faster and more efficient inference. Applications need low latency and consistent performance without compromising output quality. Amazon SageMaker AI introduces new enhancements to its inference optimization toolkit that bring EAGLE based adaptive speculative decoding to more model architectures. These updates make it easier to accelerate decoding, optimize performance using your own data and deploy higher-throughput models using the familiar SageMaker AI workflow.
EAGLE, short for Extrapolation Algorithm for Greater Language-model Efficiency, is a technique that speeds up large language model decoding by predicting future tokens directly from the hidden layers of the model. When you guide optimization using your own application data, the improvements align with the actual patterns and domains you serve, producing faster inference that reflects your real workloads rather than generic benchmarks. Based on the model architecture, SageMaker AI trains EAGLE 3 or EAGLE 2 heads.
Note that this training and optimization is not limited to just a one time optimization operation. You can start by utilizing the datasets provided by SageMaker for the initial training, but as you continue to gather and collect your own data you can also fine-tune using your own curated dataset for highly adaptive, workload-specific performance. An example would be utilizing a tool such as Data Capture to curate your own dataset over time from real-time requests that are hitting your hosted model. This can be an iterative feature with multiple cycles of training to continuously improve performance.
In this post we’ll explain how to use EAGLE 2 and EAGLE 3 speculative decoding in Amazon SageMaker AI.
Solution overview
SageMaker AI now offers native support for both EAGLE 2 and EAGLE 3 speculative decoding, enabling each model architecture to apply the technique that best matches its internal design. For your base LLM, you can utilize either SageMaker JumpStart models or bring your own model artifacts to S3 from other model hubs, such as HuggingFace.
Speculative decoding is a widely employed technique for accelerating inference in LLMs without compromising quality. This method involves using a smaller draft model to generate preliminary tokens, which are then verified by the target LLM. The extent of the speedup achieved through speculative decoding is heavily dependent on the selection of the draft model.

The sequential nature of modern LLMs makes them expensive and slow, and speculative decoding has proven to be an effective solution to this problem. Methods like EAGLE improve upon this by reusing features from the target model, leading to better results. However, a current trend in the LLM community is to increase training data to boost model intelligence without adding inference costs. Unfortunately, this approach has limited benefits for EAGLE. This limitation is due to EAGLE’s constraints on feature prediction. To address this, EAGLE-3 is introduced, which predicts tokens directly instead of features and combines features from multiple layers using a technique called training-time testing. These changes significantly improve performance and allow the model to fully benefit from increased training data.

To give customers maximum flexibility, SageMaker supports every major workflow for building or refining an EAGLE model. You can train an EAGLE model entirely from scratch using the SageMaker curated open dataset, or train it from scratch with your own data to align speculative behavior with your traffic patterns. You can also start from an existing EAGLE base model: either retraining it with the default open dataset for a fast, high-quality baseline, or fine-tuning that base model with your own dataset for highly adaptive, workload-specific performance. In addition, SageMaker JumpStart provides fully pre-trained EAGLE models so you can begin optimizing immediately without preparing any artifacts.
The solution spans six supported architectures and includes a pre-trained, pre-cached EAGLE base to accelerate experimentation. SageMaker AI also supports widely used training data formats, specifically ShareGPT and OpenAI chat and completions, so existing corpora can be used directly. Customers can also provide the data captured using their own SageMaker AI endpoints provided the data is in the above specified formats. Whether you rely on the SageMaker open dataset or bring your own, optimization jobs typically deliver around a 2.5x thoughput over standard decoding while adapting naturally to the nuances of your specific use case.
All optimization jobs automatically produce benchmark results giving you clear visibility into latency and throughput improvements. You can run the entire workflow using SageMaker Studio or the AWS CLI and you deploy the optimized model through the same interface you already use for standard SageMaker AI inference.
SageMaker AI currently supports LlamaForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM, Qwen2ForCausalLM and GptOssForCausalLM with EAGLE 3, and Qwen3NextForCausalLM with EAGLE 2. You can use one optimization pipeline across a mix of architectures while still gaining the benefits of model-specific behavior.
How EAGLE works inside the model
Speculative decoding can be thought of like a seasoned chief scientist guiding the flow of discovery. In traditional setups, a smaller “assistant” model runs ahead, quickly sketching out several possible token continuations, while the larger model examines and corrects those suggestions. This pairing reduces the number of slow, sequential steps by verifying multiple drafts at once.
EAGLE streamlines this process even further. Instead of depending on an external assistant, the model effectively becomes its own lab partner: it inspects its internal hidden-layer representations to anticipate several future tokens in parallel. Because these predictions arise from the model’s own learned structure, they tend to be more accurate upfront, leading to deeper speculative steps, fewer rejections, and smoother throughput.
By removing the overhead of coordinating a secondary model and enabling highly parallel verification, this approach alleviates memory bandwidth bottlenecks and delivers notable speedups, often around 2.5x, while maintaining the same output quality the baseline model would produce.
Running optimization jobs from the SDK or CLI
You can interface with the Optimization Toolkit using the AWS Python Boto3 SDK, Studio UI. In this section we explore utilizing the AWS CLI, the same API calls will map over to the Boto3 SDK. Here, the core API calls for endpoint creation remain the same: create_model, create_endpoint_config, and create_endpoint. The workflow we showcase here begins with model registration using the create_model API call. With the create_model API call you can specify your serving container and stack. You don’t need to create a SageMaker model object and can specify the model data in the Optimization Job API call as well.
For the EAGLE heads optimization, we specify the model data by pointing towards to the Model Data Source parameter, at the moment specification of the HuggingFace Hub Model ID is not supported. Pull your artifacts and upload them to an S3 bucket and specify it in the Model Data Source parameter. By default checks are done to verify that the appropriate files are uploaded so you have the standard model data expected for LLMs:

# traditional model data needed
model/
config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
generation_config.json
vocab.json
model.safetensors
model.safetensors.index.json

Let’s look at a few paths here:

Using your own model data with your own EAGLE curated dataset
Bringing your own trained EAGLE that you may want to train more
Bring your own model data and use SageMaker AI built-in datasets

1. Using your own model data with your own EAGLE curated dataset
We can start an optimization job with the create-optimization-job API call. Here is an example with a Qwen3 32B model. Note that you can bring your own data or also use the built-in SageMaker provided datasets. First we can create a SageMaker Model object that specifies the S3 bucket with our model artifacts:

aws sagemaker –region us-west-2 create-model
–model-name <target-model-name>
–primary-container ‘{ “Image”: “763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}”,
“ModelDataSource”: { “S3DataSource”: { “S3Uri”: “Enter model path”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } } }’ –execution-role-arn “Enter Execution Role ARN”

Our optimization call then pulls down these model artifacts when you specify the SageMaker Model and a TrainingDataSource parameter as the following:

aws sagemaker –region us-west-2 create-optimization-job
–optimization-job-name <job-name>
–account-id <account-id>
–deployment-instance-type ml.p5.48xlarge
–max-instance-count 10
–model-source ‘{
“SageMakerModel”: { “ModelName”: “Created Model name” }
}’
–optimization-configs'{
“ModelSpeculativeDecodingConfig”: {
“Technique”: “EAGLE”,
“TrainingDataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “Enter custom train data location”
}
}
}’
–output-config ‘{
“S3OutputLocation”: “Enter optimization output location”
}’
–stopping-condition ‘{“MaxRuntimeInSeconds”: 432000}’
–role-arn “Enter Execution Role ARN”

2. Bringing your own trained EAGLE that you may want to train more
For your own trained EAGLE you can specify another parameter in the create_model API call where you point towards your EAGLE artifacts, optionally you can also specify a SageMaker JumpStart Model ID to pull down the packaged model artifacts.

# Enable additional model data source with EAGLE artifacts
aws sagemaker –region us-west-2 create-model
–model-name <target-model-name>
–primary-container ‘{ “Image”: “763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}”,
“ModelDataSource”: { “S3DataSource”: { “S3Uri”: “<model path>”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } },
“AdditionalModelDataSources”: [ { “ChannelName”: “eagle_model”,
“S3DataSource”: { “S3Uri”: “<pre-trained EAGLE path>”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } } ] }’ –execution-role-arn “Enter Execution Role ARN”

Similarly the optimization API then inherits this model object with the necessary model data:

aws sagemaker –region us-west-2 create-optimization-job
–account-id <account-id>
–optimization-job-name <job-name>
–deployment-instance-type ml.p5.48xlarge
–max-instance-count 10
–model-source ‘{
“SageMakerModel”: {
“ModelName”: “Created Model Name”
}
}’
–optimization-configs ‘{
“ModelSpeculativeDecodingConfig”: {
“Technique”: “EAGLE”,
“TrainingDataSource”: {
“S3Uri”: “Enter training data path”,
“S3DataType”: “S3Prefix”
}
}
}’
–output-config ‘{
“SageMakerModel”: {
“ModelName”: “Model Name”
},
“S3OutputLocation”: “Enter output data location”
}’
–stopping-condition ‘{“MaxRuntimeInSeconds”: 432000}’
–role-arn “Enter Execution Role ARN”

3. Bring your own model data and use SageMaker built-in datasets
Optionally, we can utilize the SageMaker provided datasets:

# SageMaker Provided Optimization Datasets
gsm8k_training.jsonl (https://huggingface.co/datasets/openai/gsm8k)
magicoder.jsonl (https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K)
opencodeinstruct.jsonl (https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
swebench_oracle_train.jsonl (https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
ultrachat_0_8k_515292.jsonl (https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)

After completion, SageMaker AI stores evaluation metrics in S3 and records the optimization lineage in Studio. You can deploy the optimized model to an inference endpoint with either the create_endpoint API call or in the UI.
Benchmarks
To benchmark this further we compared three states:

No EAGLE: Base model without EAGLE as a baseline
Base EAGLE: EAGLE training using built-in datasets provided by SageMaker AI
Trained EAGLE: EAGLE training using built-in datasets provided by SageMaker AI and retraining with own custom dataset

The numbers displayed below are for qwen3-32B across metrics such as Time to First Token (TTFT) and overall throughput.

Configuration
Concurrency
TTFT (ms)
TPOT (ms)
ITL (ms)
Request Throughput
Output Throughput (tokens/sec)
OTPS per request (tokens/sec)

No EAGLE
4
168.04
45.95
45.95
0.04
86.76
21.76

No EAGLE
8
219.53
51.02
51.01
0.08
156.46
19.6

Base EAGLE
1
89.76
21.71
53.01
0.02
45.87
46.07

Base EAGLE
2
132.15
20.78
50.75
0.05
95.73
48.13

Base EAGLE
4
133.06
20.11
49.06
0.1
196.67
49.73

Base EAGLE
8
154.44
20.58
50.15
0.19
381.86
48.59

Trained EAGLE
1
83.6
17.32
46.37
0.03
57.63
57.73

Trained EAGLE
2
129.07
18
48.38
0.05
110.86
55.55

Trained EAGLE
4
133.11
18.46
49.43
0.1
214.27
54.16

Trained EAGLE
8
151.19
19.15
51.5
0.2
412.25
52.22

Pricing considerations
Optimization jobs run on SageMaker AI training instances, you will be billed depending on the instance type and job duration. Deployment of the resulting optimized model uses standard SageMaker AI Inference pricing.
Conclusion
EAGLE based adaptive speculative decoding gives you a faster and more effective path to improve generative AI inference performance on Amazon SageMaker AI. By working inside the model rather than relying on a separate draft network, EAGLE accelerates decoding, increases throughput and maintains generation quality. When you optimize using your own dataset, the improvements reflect the unique behavior of your applications, resulting in better end-to-end performance. With built-in dataset support, benchmark automation and streamlined deployment, the inference optimization toolkit helps you deliver low-latency generative applications at scale.

About the authors
Kareem Syed-Mohammed is a Product Manager at AWS. He is focuses on enabling generative AI model development and governance on SageMaker HyperPod. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.
Xu Deng is a Software Engineer Manager with the SageMaker team. He focuses on helping customers build and optimize their AI/ML inference experience on Amazon SageMaker. In his spare time, he loves traveling and snowboarding.
Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on SageMaker. In his spare time, he loves traveling and writing.
Vinay Arora is a Specialist Solution Architect for Generative AI at AWS, where he collaborates with customers in designing cutting-edge AI solutions leveraging AWS technologies. Prior to AWS, Vinay has over two decades of experience in finance—including roles at banks and hedge funds—he has built risk models, trading systems, and market data platforms. Vinay holds a master’s degree in computer science and business management.
Siddharth Shah is a Principal Engineer at AWS SageMaker, specializing in large-scale model hosting and optimization for Large Language Models. He previously worked on the launch of Amazon Textract, performance improvements in the model-hosting platform, and expedited retrieval systems for Amazon S3 Glacier. Outside of work, he enjoys hiking, video games, and hobby robotics.
Andy Peng is a builder with curiosity, motivated by scientific research and product innovation. He helped build key initiatives that span AWS SageMaker and Bedrock, Amazon S3, AWS App Runner, AWS Fargate, Alexa Health & Wellness, and AWS Payments, from 0-1 incubation to 10x scaling. Open-source enthusiast.
Johna Liu is a Software Development Engineer on the Amazon SageMaker team, where she builds and explores AI/LLM-powered tools that enhance efficiency and enable new capabilities. Outside of work, she enjoys tennis, basketball and baseball.
Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring new Seattle restaurants, traveling, and spending time with family and friends.

Train custom computer vision defect detection model using Amazon SageM …

On October 10, 2024, Amazon announced the discontinuation of the Amazon Lookout for Vision service, with a scheduled shut down date of October 31, 2025 (see Exploring alternatives and seamlessly migrating data from Amazon Lookout for Vision blog post). As part of our transition guidance for customers, we recommend the use of Amazon SageMaker AI tools to build applications for customers who are interested in AI/ML computer vision models for automated quality inspection use cases. To support that effort, AWS has made a pre-trained computer vision defect detection model available on AWS Marketplace that can be fine-tuned using Amazon SageMaker AI for a customer’s specific use case. If run in the cloud, this model only requires paying for infrastructure costs for training or inference. This approach provides the tools to accelerate solution development while facilitating complete flexibility to build a solution that integrates with any existing hardware and software infrastructure.
In this blog post, you will learn how to migrate your computer vision workloads from Amazon Lookout for Vision to Amazon SageMaker AI by following our step-by-step guidance.
AWS is sharing the main underlying models used for the service to end users in the AWS Marketplace. You can use the two main types of models, binary classification and semantic segmentation, when you train in your own AWS accounts for deployment on AWS or at the edge.
This model helps customers continue to use AWS defect detection technology at their own pace with greater flexibility. For example, you can train your models with larger instance types for faster training times. With access to set hyperparameters, you can also adjust model behavior that was not previously available on the AWS console. For example, you can set the multi-head model for semantic segmentation to disable the binary classifier head. This can make the model mode more tolerant of changing background and lighting conditions. You can also personalize the maximum training time, which was set to a non-changeable 24-hour limit on Amazon Lookout for Vision (L4V).
The GitHub repository for Amazon Lookout for Vision has been updated with a Jupyter Notebook to help you train datasets with these two model types and package them up. From there you can deploy the models by using a SageMaker endpoint, or edge devices.
To label the images beyond the sample data, you can use Amazon SageMaker Ground Truth to enable crowdsourcing or allow private teams to label the data, or use a partner solution such as Edge Impulse, Roboflow, or SuperbAI to do so. When you have the manifest file of the labeled data, the marketplace models can be used for training. You will lose a thumbnail-based dataset management tool like the Amazon Lookout for Vision console, so consider one of the previously mentioned partner solutions to help manage datasets. You can also export your existing data from the Lookout For Vision service using this guide.
Prerequisites
Before you begin, make sure you have the following components and permissions in place:

Amazon SageMaker Studio or Amazon SageMaker Unified Studio for integrated development environment (IDE)
AWS Identity and Access Management (IAM) role with these permissions to follow the principle of least privilege

Amazon S3

s3:GetObject
s3:PutObject
s3:DeleteObject
s3:ListBucket

SageMaker

sagemaker:CreateTrainingJob
sagemaker:CreateModel
sagemaker:CreateEndpoint
sagemaker:CreateEndpointConfig
sagemaker:CreateTransformJob
sagemaker:DescribeTrainingJob
sagemaker:DescribeModel
sagemaker:DescribeEndpoint
sagemaker:DescribeEndpointConfig
sagemaker:DescribeTransformJob
sagemaker:InvokeEndpoint
sagemaker:DeleteEndpoint
sagemaker:DeleteEndpointConfig
sagemaker:DeleteModel

Model subscription:

An AWS account with a subscription to Computer Vision Defect Detection Model or
An IAM role with these three permissions permission to make AWS Marketplace subscriptions in the AWS account you use:

aws-marketplace:ViewSubscriptions
aws-marketplace:Unsubscribe
aws-marketplace:Subscribe

Labeled data (you can use the cookie data sample in Github) or label your own data with SageMaker Ground Truth or an AWS Partner tool
Basic knowledge of creating a SageMaker notebook instance and running Jupyter notebook

Architecture overview
The following diagram illustrates the end-to-end flow, from image acquisition to inferencing at the edge. This blog focus on steps 2 and 3.

Use an edge application to configure cameras or sensors and capture training images.
Use SageMaker GroundTruth or AWS Partner platforms to export and label images.
Use Amazon SageMaker AI for model training.
Use REST, PLC, or digital input for image acquisition and processing.
Run real-time inference using the trained and deployed model.
Publish inference results to analytics and monitoring for alerts and analytics.
Perform automated action on the machine of concern or notify plant personnel of anomalies from inspection station component using OPC-UA or digital output.
Line operators and plant managers receive notifications for action.

Set up the labeling process
This section covers the steps to set up the labeling process using Amazon SageMaker Ground Truth, including creating a private labeling team and configuring the labeling job.

Configure Amazon SageMaker Ground Truth private team:

Select Amazon SageMaker AI, Ground Truth, Labeling workforces.
Select Private, then Create Private Team.
Enter a team name.
Leave other values as their defaults.
Select Create a new Amazon Cognito user group.
Select Create private Team.

On the Workers tab, select Invite New Workers.
Enter your team members’ email addresses to send sign-up invitations.

Label the dataset
After successfully completing the workforce setup for labelling, the next step is to label the dataset. This section explains how to prepare the dataset by uploading the images to an Amazon Simple Storage Service (Amazon S3) bucket, then create and run the SageMaker Ground Truth labeling job to label the images as normal or anomaly.

Upload the image datasets to an Amazon S3 bucket that SageMaker Ground Truth can access. If you don’t have a dataset, you can use either the cookie-dataset or aliens-dataset.

Copy all of the images from “normal” and “anomaly” folders into a single directory for SMGT to access or you will get an error message on the next step.
To use AWS CloudShell, run the following script:

#!/bin/bash
# Clone the repository
git clone https://github.com/aws-samples/amazon-lookout-for-vision.git
cd amazon-lookout-for-vision/aliens-dataset
# Remove existing all directory if it exists
rm -rf all
# Create a new all directory
mkdir -p all
# Copy normal images to all directory
cp normal/*.png all/
# Make sure we’re in the right directory before running the loop
cd “$(dirname “$0″)/amazon-lookout-for-vision/aliens-dataset”
# Copy anomaly images with .anomaly.png suffix
for file in anomaly/*.png; do
if [ -f “$file” ]; then
filename=$(basename “$file”)
cp “$file” “all/${filename}.anomaly.png”
fi
done
# Count files to verify
echo “Normal images: $(find normal -name “*.png” | wc -l)”
echo “Anomaly images: $(find anomaly -name “*.png” | wc -l)”
echo “Total images in all directory: $(find all -type f | wc -l)”
# Upload to S3
aws s3 cp all/ s3://<BUCKET_NAME>/aliens-dataset-all/ –recursive
# Clean up – remove the cloned repository
cd ../..
rm -rf amazon-lookout-for-vision

Alternatively, if you have the AWS CLI installed, you can copy them with the following commands (See setting up AWS CLI for how to do this):

sh-4.2$ git checkout https://github.com/aws-samples/amazon-lookout-for-vision.git
sh-4.2$ cd aliens-dataset ## keep in mind the filenames here clash, the following Linux command can help fix this
sh-4.2$ mkdir all
sh-4.2$ cp normal/.png all
sh-4.2$ aws s3 cp s3://aws-blogs-artifacts-public/artifacts/ML-19308/copy_conflicts.sh .

sh-4.2$ bash copy_conflicts.sh

sh-4.2$ ls -al all/

-rwxrwxr-x 1 ec2-user ec2-user 120035 Feb 17 16:39 59.png
-rwxrwxr-x 1 ec2-user ec2-user 93407 Feb 17 16:39 5.png
-rwxrwxr-x 1 ec2-user ec2-user 125477 Feb 17 16:39 5.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 123679 Feb 17 16:39 60.png
-rwxrwxr-x 1 ec2-user ec2-user 96330 Feb 17 16:39 6.png
-rwxrwxr-x 1 ec2-user ec2-user 126014 Feb 17 16:39 6.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 81051 Feb 17 16:39 7.png
-rwxrwxr-x 1 ec2-user ec2-user 128985 Feb 17 16:39 7.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 94216 Feb 17 16:39 8.png
-rwxrwxr-x 1 ec2-user ec2-user 128002 Feb 17 16:39 8.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 110814 Feb 17 16:39 9.png
-rwxrwxr-x 1 ec2-user ec2-user 131385 Feb 17 16:39 9.png.anomaly.png

sh-4.2$aws s3 cp all/ s3://<BUCKET_NAME>/aliens-dataset-all/ –recursive
Note: To prevent filename clash from the two folders, a suffix anomaly was added. The uploaded files should be in your <BUCKET_NAME>/aliens-dataset-all bucket for the Ground Truth job.

In the AWS Console, navigate to Amazon SageMaker AI, Ground Truth, Labeling Jobs, Create labeling job.

There are several options here to fill in; the most important fields to fill or select are:

Input data setup: Select Automated data setup
S3 location for input datasets: <Full path where your dataset exists>
S3 location data output datasets: <Same location as input dataset>
Data type: Select Image
IAM Role – Select Create new role if you do not have one set up to allow Ground Truth to interact with SageMaker services.

Choose Complete data setup. An Input data connection successful message displays. If you get an error, check your IAM role to make sure S3 access is enabled, and the directory has image files in it, as it will not recurse through sub-directories.

Select the task type. These models support Image Classification (Single Label), which is binary classification (think good or bad), or Semantic segmentation. You cannot use a bounding box type with these models. You can change your selection later.
Choose Next.
For Worker types, select Private. You can read more about Amazon Mechanical Turks or labeling subscriptions in the Developer Guide.
Under Private teams, select the private team you created in the previous steps.
For Task timeout and Task expiration time, leave the default values.
Leave Enable automated data labeling unselected. You can read more about automated data labeling here; however, it is not compatible with semantic segmentation.
On the Image classification screen, add two new labels: normal and anomaly. You can fill in the rest as needed. Choose Preview to see a preview of what it will look like to the end user.
Choose Create.
Select Ground Truth, and then select the Private tab.

Open the labeling portal sign-in URL in a new tab in your browser and then sign in to see your assigned tasks.
Select an assigned task and choose Start working to label the data.
Select normal or anomaly.

When the job is complete, make note of the output dataset location. You will need this for the training step.

If you need to add workers to the labelling job:

On the Amazon SageMaker AI Ground Truth page, select Labeling workforces.
Select the Private tab.
Click on the private team that was created earlier (CV-team).
Select the Workers tab
Select the desired worker from the list and choose Add workers to team.

You will then be redirected to the Amazon SageMaker AI, labelling workforces page with a confirmation message that worker has been added.

After you complete the labeling task, the output of the task is used to train the Computer Vision Detection model from the AWS Marketplace.
Train the model
This section discusses training the computer vision model using the AWS Marketplace Computer Vision Detection model and the labeled dataset from the previous step.

Go to the AWS Marketplace to subscribe to the model, https://aws.amazon.com/marketplace/pp/prodview-j72hhmlt6avp6.
Choose Continue to Subscribe.
Choose Continue to configuration.
Select the latest software version, your Region, and make sure Create a training job is selected.

Note: Copy the Product Arn and store in a text editor or notepad for later use.

Go to SageMaker AI, Notebook instances, Create notebook instance.

Note: GPU-enabled notebook instance is not required. Amazon SageMaker Training jobs will spin up the GPU instances needed during training, so most basic instances will be sufficient.

Select m5.2xl instance, Jupyter lab 4, with volume size of 128 GB. The default is 5 GB, which is too small.
Select an IAM role to allow the notebook to access resources in your account. You will need access to S3.
In the Git Repositories – optional section, select Clone a public Git repository to this notebook instance only.
Enter the Git repository URL. Leave all the other fields as their default, then choose Create notebook instance to start the instance.
After the instance starts, (the status will display as InService), select Open JupyterLab action for the new notebook instance.

JupyterLab opens:

On the left navigation pane, open the computer-vision-defect-detection folder.

In the AWS Console, go to Marketplace, Manage subscriptions, and then copy the ARN of your model subscription.

In the Jupyter notebook, locate the snippet below and update the placeholder value for algorithm_name variable with the Product Arn you copied in the previous step.

# TODO: change this to use subscribed SageMaker algorithm algorithm_name = “<Customer to specify the algorithm name after subscription >”

The bucket that would be used for this step would be automatically created and named in the format SageMaker-<REGION>-<ACCOUNT_ID>.

# Initialize SageMaker session and get execution role
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
#bucket = sagemaker_session.default_bucket()
role = get_execution_role()
# Project name would be used as part of s3 output path
project = “ComputerVisionDefectDetection”

In the AWS Console, navigate to Amazon SageMaker AI, Ground Truth, Labeling jobs and select the job that was completed.
Identify and take note of the output images folder (Output dataset location)

Note: To start the training job, look at the path for the output manifest in <BUCKET NAME>/aliens-dataset/all/aliensv2/manifests/output/output.manifest—this will be the training manifest for the next step.

Set the bucket variable to be the images bucket name that you previously set and object key the path to your manifest:

bucket: where to store the manifest file
classification_manifest_key: where the output manifest file is stored (for example, aliens-dataset-all/[job-name]/manifests/output/output.manifest)

Review the model training configuration in the Classification Model with Algorithm Estimator section.

# Create AlgorithmEstimator for classificatio
classification_estimator = AlgorithmEstimator(
algorithm_arn=algorithm_name,
role=role, instance_count=1,
instance_type=’ml.g4dn.2xlarge’,
volume_size=20, max_run=7200,
input_mode=’Pipe’, # REQUIRED: Algorithm only supports Pipe mode
sagemaker_session=sagemaker_session,
enable_network_isolation=True
)

# Set hyperparameters
classification_estimator.set_hyperparameters(
ModelType=’classification’,
TestInputDataAttributeNames=’source-ref,anomaly-label-metadata,anomaly-label’, 
TrainingInputDataAttributeNames=’source-ref,anomaly-label-metadata,anomaly-label’)

print(“Classification estimator configured successfully”)</code></pre><pre><code class=”lang-python”># Define training input using TrainingInput class
classification_training_input = TrainingInput(
s3_data=classification_s3_path, ‘
s3_data_type=’AugmentedManifestFile’,
attribute_names=[
‘source-ref’,
‘anomaly-label-metadata’,
‘anomaly-label’
],
record_wrapping=’RecordIO’,
input_mode=’Pipe’ # Must match the estimator’s input_mode)
# Start training job
classification_job_name = f’defect-detection-classification-
{datetime.datetime.now().strftime(“%Y-%m-%d-%H-%M-%S”)}
‘print(f”Starting classification training job: {classification_job_name}”)
classification_estimator.fit(
inputs={‘training’: classification_training_input},
job_name=classification_job_name,
wait=True,
logs=True

)

Note: The job uses NVIDIA G4DN instances. They can be sized up to a larger instance to decrease training time, but on a only 118 instances. The image dataset training finishes in less than 10 minutes with a g4dn.2xl. You can experiment with other instance types, however results may vary because the models were extensively tested on the G4DN instances.

Validate the values of TestInputDataAttributeNames and TrainingInputDataAttributeNames in the Hyperparameters section, as well as AttributeNames in the

TrainingInput section. The labels on all three must match the structure of your manifest file. Here is a sample manifest:

{
“source-ref”: “s3://[bucketname]/getting-started/training-images/anomaly-1.jpg”,
“anomaly-label-metadata”: {
“job-name”: “anomaly-label”,
“class-name”: “anomaly”,
“human-annotated”: “yes”,
“creation-date”: “2022-08-22T20:52:51.851Z”,
“type”: “groundtruth/image-classification”
},
“anomaly-label”: 1
}
{
“source-ref”: “s3://[bucketname]/getting-started/training-images/anomaly-2.jpg”,
“anomaly-label-metadata”: {
“job-name”: “anomaly-label”,
“class-name”: “anomaly”,
“human-annotated”: “yes”,
“creation-date”: “2022-08-22T21:11:39.545Z”,
“type”: “groundtruth/image-classification”
},
“anomaly-label”: 1
}

Note: Two of the three values include the labelling job name.

response = sagemaker.create_training_job(
TrainingJobName=classification_training_job_name,
HyperParameters={
‘ModelType’: ‘classification’,
‘TestInputDataAttributeNames’: ‘source-ref,aliens-v3,aliens-v3-metadata’,
‘TrainingInputDataAttributeNames’: ‘source-ref,aliens-v3,aliens-v3-metadata’
}
)

Run all the cells or blocks listed in the Classification Model with Algorithm Estimator section to start the training job.
If you want to train a segmentation model as well, follow the steps in the Segmentation Model with Algorithm Estimator section.

Note: After the training is completed, you are ready to test it!  There are few inference options available for this:

Real-time inference using Amazon SageMaker endpoints
Amazon SageMaker AI Batch Transform inference.
Edge deployment

Deploy the model
Amazon SageMaker AI endpoints and Amazon SageMaker AI Batch Transform inference are both used for inference but serve different purposes.
Amazon SageMaker AI endpoints
Amazon SageMaker AI endpoints are used for real-time inference, providing low-latency predictions suitable for applications requiring immediate responses. Endpoints remain active while they’re deployed, making them better suited for continuous and steady traffic, but potentially more costly due to ongoing resource usage.

In the Jupyter notebook, navigate to the (Optional) Running real-time inference using Amazon SageMaker endpoints section.
Run the following cell blocks to set up and invoke the endpoint:

#classification_training_job_name = “defect-detection-classification-2025-10-01-00-29-57” # remove

classification_training_job_name = “<provide training job name here>”

# Create estimator from training job
estimator = AlgorithmEstimator.attach(classification_training_job_name)

# Deploy endpoint using SageMaker v2 SDK
predictor = estimator.deploy(
initial_instance_count=1,
instance_type=’ml.c5.2xlarge’
)

print(f”Endpoint deployed: {predictor.endpoint_name}”)

#Invoke the endpoint

# Invoke the endpoint using predictor
result = predictor.predict(image_data)

# Clean up the temporary file
os.remove(local_file)

# Print the result
print(“nEndpoint Response:”)
print(json.dumps(result, indent=2))

Validate the inference, then delete the endpoint by running the following block:

# Delete the endpoint

predictor.delete_endpoint()
print(“Endpoint deleted”)

Note: If you start an endpoint, keep in mind you will be billed while it is running until you turn it off.
Amazon SageMaker AI Batch Transform
Batch Transform is designed for offline inference and making predictions on large datasets stored in S3, and is ideal for bulk processing where low latency is not critical. After the job is complete, the resources are released, making it cost-effective for sporadic workloads.

Navigate to the (Optional) Run Batch Transform Inference using SageMaker SDK v2 section.
Define the s3_input_data and s3_output_path parameters.

# Run batch transform job

#############################################
# Change to your input/output data S3 path #
#############################################

s3_input_data = “s3://<Specify-s3-path-to-test-images>”
s3_output_path = f”s3://{bucket}/{project}/batch-transform-output”

Run all the cells and blocks in the (Optional) Run Batch Transform Inference using SageMaker SDK v2 section to complete the batch inference.
Validate the batch transform job after completion by navigating to the s3_output_path folder. The following is a sample inference output file:

{
“Source”: {
“Type”: “direct”
},
“IsAnomalous”: true,
“Confidence”: 0.92744799389183
}

Clean up
To avoid incurring unnecessary charges, delete the following resources when you no longer need them:

Delete SageMaker endpoints.

Navigate to the Amazon SageMaker Console.
Select Endpoints.
Select the endpoint you created.
Choose Delete.

Delete SageMaker Notebook instances.

Navigate to the Amazon SageMaker Console.
Select Notebook instances.
Select the notebook instance you created.
Choose Stop if the instance is running.
Once stopped, choose Delete.

Delete S3 objects and buckets.

Navigate to the Amazon S3 Console.
Delete all objects in the buckets you created for this tutorial.
Delete the empty buckets.

Delete the Ground Truth labeling team.

Navigate to Ground Truth.
Select Labeling workforces.
Select the Private tab.
Select the private team you created.
Choose Delete team.

Conclusion
In this blog post, we’ve demonstrated how to transition from Amazon Lookout for Vision to using the underlying Computer Vision Detection models available through the AWS Marketplace, showing the step-by-step process of setting up labeling, training the model, and running inference through batch transformation. The transition provides customers with greater flexibility in terms of training options, hyperparameter adjustments, and deployment choices while continuing to use AWS defect detection technology at their own pace. Also be sure to check out our edge-based open source integrated Defect Detection Application on GitHub if you would like to combine what you have learned here.

About the authors
Ryan Vanderwerf is a is a senior partner solutions architect at Amazon Web Services specializing in smart manufacturing, vision, and machine learning. Ryan previously provided Java virtual machine-focused consulting and project development as a software engineer at OCI on the Grails and Micronaut team. He was chief architect/director of products at ReachForce, with a focus on software and system architecture for AWS Cloud SaaS solutions for marketing data management. Ryan has built several SaaS solutions in several domains such as financial, media, telecom, and e-learning companies since 1996
Lu Min is a Software Development Engineer for AWS Edge ML services, focused on developing machine learning solutions that operate at the edge for AWS customers. With expertise in optimizing ML models for resource-constrained environments, Lu helps customers implement efficient inference capabilities on edge devices and cloud communication, as well as manage model lifecycle using AWS SageMaker.
Tim Westman is the Product Manager and Go-to-Market Lead for Edge Machine Learning, AWS. Tim leads the Product Management and Business Development for the Edge Machine Learning business at Amazon Web Services. In this role, he works with customers to help build computer vision solutions at the edge to solve complex operational challenges. Tim has more than 30 years of experience in sales, business development and product management roles for leading hardware and software companies, with the last 8 years specializing in AI and computer vision for IoT applications.
Kunle Adeleke is an enterprise solutions architect, providing guidance to large AWS commercial customers in diverse industries craft their technology strategy. Kunle has led enterprise architecture teams and software development teams in both government and commercial sectors. His deep expertise spans software development, solution architecture, enterprise architecture, security, and data & AI/ML.

Practical implementation considerations to close the AI value gap

Artificial Intelligence (AI) is changing how businesses operate. Gartner® predicts at least 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028. And 92% of companies are boosting their AI spending, according to McKinsey.
But here’s the problem: most companies are yet to realize a positive impact of AI on their profit and loss (P&L). According to analysis from S&P Global Market Intelligence,

“The share of companies abandoning most of their AI initiatives jumped to 42%, up from 17% last year [2024]” in the first half of 2025.

According to Gartner,

“Over 40% of agentic AI projects will be canceled by the end of 2027.”

The gap between spending and results is clear. To make AI work, companies need to stop running scattered experiments and start building enterprise-wide programs. As McKinsey puts it:

“The organizations that are building a genuine and lasting competitive advantage from their AI efforts are the ones that are thinking in terms of holistic transformative change that stands to alter their business models, cost structures, and revenue streams—rather than proceeding incrementally.”

The AWS Customer Success Center of Excellence (CS COE) helps customers get tangible value from their AWS investments. We’ve seen a pattern: customers who build AI strategies that address people, process, and technology together succeed more often.
In this post, we share practical considerations that can help close the AI value gap.
Implementation considerations
The following sections include practical implementation considerations for aligning leadership, redesigning incentives, building governance frameworks, and measuring outcomes—all grounded in real-world examples from organizations that have successfully closed their AI value gap. These practical insights can help you avoid common pitfalls and accelerate your path from AI investment to measurable business impact.
Figure 1: Six considerations for successful AI transformation and sustained value realization
Business leaders — not just tech leaders — need to drive your AI agenda
AI transformation requires translating vision into specific business outcomes with clear tracking mechanisms—and this demands broad cross-functional leadership from day one.
Roles like Chief Revenue Officers and line-of-business leaders need a seat at the decision-making table alongside technology leaders right from the start. These leaders have typically joined digital or cloud transformations much later in the process, but AI is different. The most impactful AI use cases come from two sources: line-of-business leaders who understand customer pain points and industry opportunities intimately, and employees across business functions who are willing to change their mindsets and fundamentally alter their operating models. Consider a large global institutional investment organization that embarked on an AI transformation program. They started by defining and creating relevant data and AI technical and business professions. Then, the organization designed and implemented the mechanisms and operating model needed to create data and AI products. Ultimately, they launched a new data and AI organization that helps them create new products, better serve customers, and monetize data assets by addressing new business opportunities. While engineering and product management remained at its core, their entire leadership team treated this as a business development initiative and partnered to make it possible.
Redesign incentives to reward AI-first operations
Transform organizational behavior to reward actual AI adoption, not just theoretical interest. Restructure career pathways to create advancement opportunities tied to effective AI use and measurable business outcomes. Critical to success is defining what outcomes matter. AI can generate voluminous output with little business impact, making measurement of outcomes essential.
One organization introduced standardized definitions for business processes and automation levels. They then redesigned their performance management framework to incorporate automation achievement as a key metric for Product Managers. This approach shifted focus from traditional input metrics toward measurable automation outcomes. It encouraged leaders to prioritize AI-augmented structures and intelligent process redesign over manual operations.
This alignment demonstrates how organizations must clearly define and measure desired outcomes—and tie individual rewards directly to tangible AI-driven business results.
Put people first and have HR lead the change as a strategic partner
HR serves as the cornerstone for aligning culture, talent, and incentives with AI transformation goals. Success requires HR to partner with executives in communicating the rationale for AI initiatives, addressing employee concerns, and fostering organizational buy-in through coaching and thought leadership.
Build AI fluency through tailored learning pathways. Provide focused training with practical tools like pre-populated prompt catalogs and quick-start demonstrations. Strengthen employee engagement through continuous feedback loops, celebrate AI learning participation across teams, and invest in retention strategies that value AI-skilled talent. HR champions adoption by collaborating with business and operations teams to develop role-based “What’s in it for me” content and current versus future process comparisons. For example, HR at a global financial institution took a leadership role to accelerate adoption of a reimagined product operating model. After the institution had invested significantly in a bottom-up transformation, HR designed and led—in partnership with AWS—a top-down approach. They empowered business leaders from lines of business, operations, and technology with extensive executive-level training to help them lead product teams, not just operate them. These leaders worked with technology teams to build mechanisms that helped accelerate adoption of their product operating model. The resulting mechanisms enabled them to create AI solutions focused on industry opportunities and customer needs.
HR support is key to transforming resistance into enthusiasm by embedding AI-first behaviors into the cultural DNA.
Set guardrails that help protect—without slowing down
Establish AI governance frameworks from day one that balance centralization and federation. This facilitates compliance alignment and integration while enabling rapid innovation at the edge. Pure centralization offers simpler governance but slows innovation. Complete federation creates integration challenges and compliance gaps.
For both centralized and federated models, create cross-functional AI governance councils with representation from legal, risk, IT, and business units. Define clear guardrails, approval thresholds, and escalation paths. This approach accelerates AI delivery by creating clear paths to production and reducing bureaucratic friction while maintaining enterprise-wide coherence and risk management.
One financial services customer implemented a three-layered AI governance approach. At the enterprise level, they automated security and compliance policies through policy as code. At the line-of-business level, they created data policies that support AI solutions within the value stream. At the solution level, they addressed individual AI model risks and performance thresholds. This approach facilitated necessary guardrails and policy adherence while allowing builders to focus on value-added AI solution features. It unlocked true innovation at the edge while maintaining compliance alignment with critical policies.
Work with the right partners to move faster on AI
According to Gartner,

“Scaling AI solutions across the enterprise is challenging and requires intentional plans to address AI skills, infrastructure, governance policies and forums to facilitate collaboration, integration, and shared best practices.”

Organizations achieve higher success rates when working with partners who provide AI innovation, cloud expertise, and industry-specific knowledge at the right time. Effective AI transformation partners serve three roles: industry advisors who reimagine existing value streams and workflows to uncover high-value use cases, technical experts who bring leading experience building scalable AI solutions and change champions who manage cultural shifts through training and governance frameworks.
A global insurance company engaged an AI transformation partner for a long-term engagement focused on building durable capabilities. The partner established business case frameworks and assets to prioritize use cases and baseline KPIs. They developed detailed adoption strategies using train-the-trainer methodologies. They implemented measurement systems to continuously track productivity impact. Together, they established governance models for ongoing AI agent creation and enterprise-wide deployment. This “teach to fish” model meant the insurance company could independently sustain and expand their AI transformation beyond the partnership engagement.
Track results that matter—not just what AI costs
Traditional cost prediction models struggle with AI’s continuously changing pricing and capabilities. Success requires anchoring to one or two measurable business outcomes that can be baselined and tracked—such as customer conversations handled entirely by AI agents or revenue uplift per recommendation accepted.
Build adaptive ROI frameworks that can be seamlessly adjusted to changes in token pricing, inference efficiency, and model capabilities rather than fixed cost projections. Focus on outcome-based metrics that demonstrate clear business value as use cases scale. With these metrics executives can make informed investment decisions despite technological uncertainty. This approach transforms AI economics from unpredictable cost centers into measurable value drivers, providing the financial clarity needed for confident scaling decisions. A marketing team implemented generative AI for long-form content creation and quality assurance. They analyzed their end-to-end process to determine the distribution of their production capacity and identify the costliest failure point: localization errors. They anchored against measurable baselines of 150+ annual localization errors and 300 monthly QA hours across 150 assets. The solution delivered immediate impact by catching errors earlier, minimizing costly localization rework while accelerating production speed. Return on investment in the solution was measured through localization cost savings and top-line value through increased content output, providing a clear path to assess the impact of scaling the solution.
Conclusion
Becoming an AI-first organization requires synchronized transformation across seven critical dimensions: Data and AI Vision and Strategy that establishes a data-driven foundation while embedding AI into core business objectives; Business Process Redesign to optimize human-AI collaboration; Culture & Change Management to drive adoption top-down and bottom-up change; Infrastructure and Operations for scalable, self-healing systems; AI Skills and Talent development with continuous learning to build core AI capabilities beyond basic awareness; Security, Governance, and Ethics to facilitate responsible AI deployment; and AI Industrialization for seamless integration and automation.

Figure 2: Seven dimensions of AI-First organizational transformation
These dimensions provide a framework for systematically evaluating and implementing AI transformation. But here’s what matters most: technology alone delivers marginal gains. When orchestrated with organizational change and process redesign, it creates measurable business value. Organizations that have success, compared to those that do not, see dramatic results—45% more in cost savings and 60% more in revenue growth, according to the Boston Consulting Group (BCG).
The AWS Customer Success Center of Excellence collaborates with AWS partners to define programmatic implementation plans that can help customers embed AI into their operations, product development, business processes, and go-to-market strategies. Because becoming AI-first isn’t about isolated technology initiatives—it requires synchronized evolution across people, process, and technology, with comprehensive change management as the enabler.
For more information about becoming an AI-first company, contact your AWS account team. For more information on delivering agents see the AWS Artificial Intelligence blog.

About the authors
Bhargs Srivathsan leads the Customer Success Center of Excellence for Amazon Web Services (AWS), where she is responsible for defining and executing on the strategic vision for customer success across AWS’ services. In this role, she focuses on ensuring AWS customers and partners realize maximum value from their technology investments, particularly as the pace of innovation accelerates with AI and other emerging technologies. She works closely with the field, specialist GTM leaders, and partners across AWS to build and scale customer success capabilities that drive adoption and business outcomes for customers.
Sergio Klarreich is a Senior Manager of Customer Success at AWS, within the Customer Success Center of Excellence. Sergio leads a team focused on enabling enterprises to realize tangible business outcomes from AI investments. With hands-on experience leading Fortune 500 companies through successful AI-first transformation journeys and over 20 years driving technology innovation across global markets. He specializes in bridging the gap between AI strategy and measurable business results.
Joseph Badalamenti is a Senior Customer Success AI Specialist at AWS, within the Customer Success Center of Excellence. As a Customer Success Specialist, he partners with enterprise customers to accelerate their AI transformation journeys. Joseph specializes in Generative AI and Agentic AI implementations, helping organizations realize measurable business value through strategic AI adoption. Joseph has 20+ years experience supporting customers with Digital, Cloud, and AI Transformation journeys.

Microsoft AI Releases Fara-7B: An Efficient Agentic Model for Computer …

How do we safely let an AI agent handle real web tasks like booking, searching, and form filling directly on our own devices without sending everything to the cloud? Microsoft Research has released Fara-7B, a 7 billion parameter agentic small language model designed specifically for computer use. It is an open weight Computer Use Agent that runs from screenshots, predicts mouse and keyboard actions, and is small enough to execute on a single user device, which reduces latency and keeps browsing data local.

Fara-7B: An Efficient Agentic Model for Computer Use

From Chatbots to Computer Use Agents

Conventional chat oriented LLMs return text. Computer Use Agents such as Fara-7B instead control the browser or desktop user interface to complete tasks like filling forms, booking travel, or comparing prices. They perceive the screen, reason about the page layout, then emit low level actions such as click, scroll, type, web_search, or visit_url.

Many existing systems rely on large multimodal models wrapped in complex scaffolding that parses accessibility trees and orchestrates multiple tools. This increases latency and often requires server side deployment. Fara-7B compresses the behavior of such multi agent systems into a single multimodal decoder only model built on Qwen2.5-VL-7B. It consumes browser screenshots and text context, then directly outputs thought text followed by a tool call with grounded arguments such as coordinates, text, or URLs.

FaraGen, Synthetic Trajectories for Web Interaction

The key bottleneck for Computer Use Agents is data. High quality logs of human web interaction with multi step actions are rare and expensive to collect. The Fara project introduces FaraGen, a synthetic data engine that generates and filters web trajectories on live sites.

FaraGen uses a three stage pipeline. Task Proposal starts from seed URLs drawn from public corpora such as ClueWeb22 and Tranco, which are categorized into domains like e commerce, travel, entertainment, or forums. Large language models convert each URL into realistic tasks that users might attempt on that page, for example booking specific movie tickets or creating a shopping list with constraints on reviews and materials. Tasks must be achievable without login or paywall, fully specified, useful, and automatically verifiable.

Fara-7B: An Efficient Agentic Model for Computer Use

Task Solving runs a multi agent system based on Magentic-One and Magentic-UI. An Orchestrator agent plans the high level strategy and keeps a ledger of task state. A WebSurfer agent receives accessibility trees and Set-of-Marks screenshots, then emits browser actions through Playwright, such as click, type, scroll, visit_url, or web_search. A UserSimulator agent supplies follow up instructions when the task needs clarification.

Trajectory Verification uses three LLM based verifiers. An Alignment Verifier checks that the actions and final answer match the task intent. A Rubric Verifier generates a rubric of subgoals and scores partial completion. A Multimodal Verifier inspects screenshots plus the final answer to catch hallucinations and confirm that visible evidence supports success. These verifiers agree with human labels on 83.3 percent of cases, with reported false positive and false negative rates around 17 to 18 percent.

After filtering, FaraGen yields 145,603 trajectories with 1,010,797 steps over 70,117 unique domains. The trajectories range from 3 to 84 steps, with an average of 6.9 steps and about 0.5 unique domains per trajectory, which indicates that many tasks involve sites not seen elsewhere in the dataset. Generating data with premium models such as GPT-5 and o3 costs roughly 1 dollar per verified trajectory.

https://www.microsoft.com/en-us/research/wp-content/uploads/2025/11/Fara-7B-An-Efficient-Agentic-Model-for-Computer-Use.pdf

Model Architecture

Fara-7B is a multimodal decoder only model that uses Qwen2.5-VL-7B as the base. It takes as input a user goal, the latest screenshots from the browser, and the full history of previous thoughts and actions. The context window is 128,000 tokens. At each step the model first generates a chain of thought describing the current state and the plan, then outputs a tool call that specifies the next action and its arguments.

The tool space matches the Magentic-UI computer_use interface. It includes key, type, mouse_move, left_click, scroll, visit_url, web_search, history_back, pause_and_memorize_fact, wait, and terminate. Coordinates are predicted directly as pixel positions on the screenshot, which allows the model to operate without access to the accessibility tree at inference time.

Training uses supervised finetuning over approximately 1.8 million samples that mix multiple data sources. These include the FaraGen trajectories broken into observe think act steps, grounding and UI localization tasks, screenshot based visual question answering and captioning, and safety and refusal datasets.

https://www.microsoft.com/en-us/research/wp-content/uploads/2025/11/Fara-7B-An-Efficient-Agentic-Model-for-Computer-Use.pdf

Benchmarks and Efficiency

Microsoft evaluates Fara-7B on four live web benchmarks: WebVoyager, Online-Mind2Web, DeepShop, and the new WebTailBench, which focuses on under represented segments such as restaurant reservations, job applications, real estate search, comparison shopping, and multi site compositional tasks.

On these benchmarks, Fara-7B achieves 73.5 percent success on WebVoyager, 34.1 percent on Online-Mind2Web, 26.2 percent on DeepShop, and 38.4 percent on WebTailBench. This outperforms the 7B Computer Use Agent baseline UI-TARS-1.5-7B, which scores 66.4, 31.3, 11.6, and 19.5 respectively, and compares favorably to larger systems like OpenAI computer-use-preview and SoM Agent configurations built on GPT-4o.

On WebVoyager, Fara-7B uses on average 124,000 input tokens and 1,100 output tokens per task, with about 16.5 actions. Using market token prices, the research team estimate an average cost of 0.025 dollars per task, versus around 0.30 dollars for SoM agents backed by proprietary reasoning models such as GPT-5 and o3. Fara-7B uses a similar number of input tokens but about one tenth the output tokens of these SoM agents.

Key Takeaways

Fara-7B is a 7B parameter, open weight Computer Use Agent built on Qwen2.5-VL-7B that operates directly from screenshots and text, then outputs grounded actions such as clicks, typing and navigation, without relying on accessibility trees at inference time.

The model is trained with 145,603 verified browser trajectories and 1,010,797 steps generated by the FaraGen pipeline, which uses multi agent task proposal, solving, and LLM based verification on live websites across 70,117 domains.

Fara-7B achieves 73.5 percent success on WebVoyager, 34.1 percent on Online-Mind2Web, 26.2 percent on DeepShop, and 38.4 percent on WebTailBench, improving substantially over the 7B UI-TARS-1.5 baseline on all four benchmarks.

On WebVoyager, Fara-7B uses about 124,000 input tokens and 1,100 output tokens per task, with an average of 16.5 actions, yielding an estimated cost of around 0.025 dollars per task, which is around an order of magnitude cheaper in output token usage than SoM agents backed by GPT 5 class models.

Editorial Notes

Fara-7B is a useful step toward practical Computer Use Agents that can run on local hardware with lower inference cost while preserving privacy. The combination of Qwen2.5 VL 7B, FaraGen synthetic trajectories and WebTailBench gives a clear and well instrumented path from multi agent data generation to a single compact model that matches or exceeds larger systems on key benchmarks while enforcing Critical Point and refusal safeguards.

Check out the Paper, Model weights and technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Microsoft AI Releases Fara-7B: An Efficient Agentic Model for Computer Use appeared first on MarkTechPost.

NVIDIA AI Releases Nemotron-Elastic-12B: A Single AI Model that Gives …

Why are AI dev teams still training and storing multiple large language models for different deployment needs when one elastic model can generate several sizes at the same cost? NVIDIA is collapsing the usual ‘model family’ stack into a single training job. NVIDIA AI team releases Nemotron-Elastic-12B, a 12B parameter reasoning model that embeds nested 9B and 6B variants in the same parameter space, so all three sizes come from one elastic checkpoint with no extra distillation runs per size.

Many in one model family

Most production systems need several model sizes, a larger model for server side workloads, a mid size model for strong edge GPUs, and a smaller model for tight latency or power budgets. The usual pipeline trains or distills each size separately, so token cost and checkpoint storage scale with the number of variants.

Nemotron Elastic takes a different route. It starts from the Nemotron Nano V2 12B reasoning model and trains an elastic hybrid Mamba Attention network that exposes multiple nested submodels. The released Nemotron-Elastic-12B checkpoint can be sliced into 9B and 6B variants, Nemotron-Elastic-9B and Nemotron-Elastic-6B, using a provided slicing script, without any extra optimization.

All variants share weights and routing metadata, so training cost and deployment memory are tied to the largest model, not to the number of sizes in the family.

https://arxiv.org/pdf/2511.16664v1

Hybrid Mamba Transformer with elastic masks

Architecturally, Nemotron Elastic is a Mamba-2 Transformer hybrid. The base network follows the Nemotron-H style design, where most layers are Mamba-2 based sequence state space blocks plus MLP, and a small set of attention layers preserve global receptive field.

Elasticity is implemented by turning this hybrid into a dynamic model controlled by masks:

Width, embedding channels, Mamba heads and head channels, attention heads, and FFN intermediate size can be reduced through binary masks.

Depth, layers can be dropped according to a learned importance ordering, with residual paths preserving signal flow.

A router module outputs discrete configuration choices per budget. These choices are converted to masks with Gumbel Softmax, then applied to embeddings, Mamba projections, attention projections, and FFN matrices. The research team adds several details to keep the SSM structure valid:

Group aware SSM elastification that respects Mamba head and channel grouping.

Heterogeneous MLP elastification where different layers can have distinct intermediate sizes.

Normalized MSE based layer importance to decide which layers stay when depth is reduced.

Smaller variants are always prefix selections in the ranked component lists, which makes the 6B and 9B models true nested subnetworks of the 12B parent.

https://arxiv.org/pdf/2511.16664v1

Two stage training for reasoning workloads

Nemotron Elastic is trained as a reasoning model with a frozen teacher. The teacher is the original Nemotron-Nano-V2-12B reasoning model. The elastic-12B student is optimized jointly for all three budgets, 6B, 9B, 12B, using knowledge distillation plus language modeling loss.

Training runs in two stages:

Stage 1: short context, sequence length 8192, batch size 1536, around 65B tokens, with uniform sampling over the three budgets.

Stage 2: extended context, sequence length 49152, batch size 512, around 45B tokens, with non uniform sampling that favors the full 12B budget.

https://arxiv.org/pdf/2511.16664v1

The second stage is important for reasoning tasks. The above table shows that for AIME 2025, the 6B model improves from 56.88 to 68.13, a 19.8 percent relative gain, while the 9B model gains 9.7 percent and the 12B model gains 4.0 percent after extended context training.

Budget sampling is also tuned. In Stage 2, non uniform weights of 0.5, 0.3, 0.2 for 12B, 9B, 6B avoid degradation of the largest model and keep all variants competitive on Math 500, AIME 2025, and GPQA.

Benchmark results

Nemotron Elastic is evaluated on reasoning heavy benchmarks, MATH 500, AIME 2024, AIME 2025, GPQA, LiveCodeBench v5, and MMLU Pro. The below table summarizes pass at 1 accuracy.

https://arxiv.org/pdf/2511.16664v1

The 12B elastic model matches the NanoV2-12B baseline on average, 77.41 versus 77.38, while also providing 9B and 6B variants from the same run. The 9B elastic model tracks the NanoV2-9B baseline closely, 75.95 versus 75.99. The 6B elastic model reaches 70.61, slightly below Qwen3-8B at 72.68 but still strong for its parameter count given that it is not trained separately.

Training token and memory savings

Nemotron Elastic targets the cost problem directly. The below table compares the token budgets needed to derive 6B and 9B models from a 12B parent:

NanoV2 pretraining for 6B and 9B, 40T tokens total.

NanoV2 Compression with Minitron SSM, 480B exploratory plus 270B final, 750B tokens.

Nemotron Elastic, 110B tokens in a single elastic distillation run.

https://arxiv.org/pdf/2511.16664v1

The research team reports that this gives around 360 times reduction versus training the two extra models from scratch, and around 7 times reduction versus the compression baseline.

Deployment memory is reduced as well. The below table states that storing Nemotron Elastic 6B, 9B, and 12B together requires 24GB of BF16 weights, while storing NanoV2 9B plus 12B requires 42GB. This is a 43 percent memory reduction while also exposing an extra 6B size.

https://arxiv.org/pdf/2511.16664v1

Comparison

SystemSizes (B)Avg reasoning score*Tokens for 6B + 9BBF16 memoryNemotron Elastic6, 9, 1270.61 / 75.95 / 77.41110B24GBNanoV2 Compression9, 1275.99 / 77.38750B42GBQwen3872.68n / an / a

Key Takeaways

Nemotron Elastic trains one 12B reasoning model that contains nested 9B and 6B variants which can be extracted zero shot without extra training.

The elastic family uses a hybrid Mamba-2 and Transformer architecture plus a learned router that applies structured masks over width and depth to define each submodel.

The approach needs 110B training tokens to derive 6B and 9B from the 12B parent which is about 7 times fewer tokens than the 750B token Minitron SSM compression baseline and about 360 times fewer than training extra models from scratch.

On reasoning benchmarks such as MATH 500, AIME 2024 and 2025, GPQA, LiveCodeBench and MMLU Pro the 6B, 9B and 12B elastic models reach average scores of about 70.61, 75.95 and 77.41 which are on par with or close to the NanoV2 baselines and competitive with Qwen3-8B.

All three sizes share one 24GB BF16 checkpoint so deployment memory stays constant for the family compared with around 42GB for separate NanoV2-9B and 12B models which gives about 43 percent memory savings while adding a 6B option.

Editorial Comments

Nemotron-Elastic-12B is a practical step toward making reasoning model families cheaper to build and operate. One elastic checkpoint produces 6B, 9B, and 12B variants with a hybrid Mamba-2 and Transformer architecture, a learned router, and structured masks that preserve reasoning performance. The approach cuts token cost relative to separate compression or pretraining runs and keeps deployment memory at 24GB for all sizes, which simplifies fleet management for multi tier LLM deployments. Overall, Nemotron-Elastic-12B turns multi size reasoning LLMs into a single elastic systems design problem.

Check out the Paper and Model weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post NVIDIA AI Releases Nemotron-Elastic-12B: A Single AI Model that Gives You 6B/9B/12B Variants without Extra Training Cost appeared first on MarkTechPost.

AI Interview Series #3: Explain Federated Learning

Question:

“You’re an ML engineer at a fitness company like Fitbit or Apple Health.

Millions of users generate sensitive sensor data every day — heart rate, sleep cycles, step counts, workout patterns, etc.

You want to build a model that predicts health risk or recommends personalized workouts.

But due to privacy laws (GDPR, HIPAA), none of this raw data can ever leave the user’s device.

How would you train such a model?“

Training a model in this scenario seems impossible at first—after all, you can’t collect or centralize any of the user’s sensor data. But the trick is this: instead of bringing the data to the model, you bring the model to the data.

Using techniques like federated learning, the model is sent to each user’s device, trained locally on their private data, and only the model updates (not the raw data) are sent back. These updates are then securely aggregated to improve the global model while keeping every user’s data fully private.

This approach allows you to leverage massive, real-world datasets without ever violating privacy laws.

What is Federated Learning

Federated Learning is a technique for training machine learning models without ever collecting user data centrally. Instead of uploading private data (like heart rate, sleep cycles, or workout logs), the model is sent to each device, trained locally, and only the model updates are returned. These updates are securely aggregated to improve the global model—ensuring privacy and compliance with laws like GDPR and HIPAA.

There are multiple variants:

Centralized FL: A central server coordinates training and aggregates updates.

Decentralized FL: Devices share updates with each other directly—no single point of failure.

Heterogeneous FL: Designed for devices with different compute capabilities (phones, watches, IoT sensors).

The workflow is simple:

A global model is sent to user devices.

Each device trains on its private data (e.g., a user’s fitness and health metrics).

Only the model updates—not the data—are encrypted and sent back.

The server aggregates all updates into a new global model.

Challenges in Federated Learning

Device Constraints: User devices (phones, smartwatches, fitness trackers) have limited CPU/GPU power, small RAM, and rely on battery. Training must be lightweight, energy-efficient, and scheduled intelligently so it doesn’t interfere with normal device usage.

Model Aggregation: Even after training locally on thousands or millions of devices, we still need to combine all these model updates into a single global model. Techniques like Federated Averaging (FedAvg) help, but updates can be delayed, incomplete, or inconsistent depending on device participation.

Skewed Local Data (Non-IID Data):

Each user’s fitness data reflects personal habits and lifestyle:

Some users run daily; others never run.

Some have high resting heart rates; others have low.

Sleep cycles vary drastically by age, culture, work pattern.

Workout types differ—yoga, strength training, cycling, HIIT, etc.

This leads to non-uniform, biased local datasets, making it harder for the global model to learn generalized patterns.

Intermittent Client Availability: Many devices may be offline, locked, low on battery, or not connected to Wi-Fi. Training must only happen under safe conditions (charging, idle, Wi-Fi), reducing the number of active participants at any moment.

Communication Efficiency: Sending model updates frequently can drain bandwidth and battery. Updates must be compressed, sparse, or limited to smaller subsets of parameters.

Security & Privacy Guarantees: Even though raw data never leaves the device, updates must be encrypted. Additional protections like differential privacy or secure aggregation may be required to prevent reconstructing sensitive patterns from gradients.

AI Interview Series #2: Explain Some of the Common Model Context Protocol (MCP) Security Vulnerabilities

The post AI Interview Series #3: Explain Federated Learning appeared first on MarkTechPost.

Accelerate generative AI innovation in Canada with Amazon Bedrock cros …

Generative AI has created unprecedented opportunities for Canadian organizations to transform their operations and customer experiences. We are excited to announce that customers in Canada can now access advanced foundation models including Anthropic’s Claude Sonnet 4.5 and Claude Haiku 4.5 on Amazon Bedrock through cross-Region inference (CRIS).
This post explores how Canadian organizations can use cross-Region inference profiles from the Canada (Central) Region to access the latest foundation models to accelerate AI initiatives. We will demonstrate how to get started with these new capabilities, provide guidance for migrating from older models, and share recommended practices for quota management.
Canadian cross-Region inference: Your gateway to global AI innovation
To help customers achieve the scale of their Generative AI applications, Amazon Bedrock offers Cross-Region Inference (CRIS) profiles, a powerful feature that enables organizations to seamlessly distribute inference processing across multiple AWS Regions. This capability helps you get higher throughput while building at scale, helping to ensure your generative AI applications remain responsive and reliable even under heavy load.
Amazon Bedrock provides two types of cross-Region Inference profiles:

Geographic CRIS: Amazon Bedrock automatically selects the optimal commercial Region within that geography to process your inference request.
Global CRIS: Global CRIS further enhances cross-Region inference by enabling the routing of inference requests to supported commercial Regions worldwide, optimizing available resources and enabling higher model throughput.

Cross-Region Inference operates through the secure AWS network with end-to-end encryption for both data in transit and at rest. When a customer submits an inference request from the Canada (Central) Region, CRIS intelligently routes the request to one of the destination regions configured for the inference profile (US or Global profiles).
The key distinction is that while inference processing (the transient computation) may occur in another Region, all data at rest—including logs, knowledge bases, and any stored configurations—remains exclusively within the Canada (Central) Region. The inference request travels over the AWS Global Network, never traversing the public internet, and responses are returned encrypted to your application in Canada.

Cross-Region inference configuration for Canada
With CRIS, Canadian organizations gain earlier access to foundation models, including cutting-edge models like Claude Sonnet 4.5 with enhanced reasoning capabilities, providing a faster path to innovation. CRIS also delivers enhanced capacity and performance by providing access to capacity across multiple Regions. This enables higher throughput during peak periods such as tax season, Black Friday, and holiday shopping, automatic burst handling without manual intervention, and greater resiliency by serving requests from a larger pool of resources.
Canadian customers can choose between two inference profile types based on their requirements:

CRIS profile
Source Region
Destination Regions
Description

US cross-Region inference
ca-central-1
Multiple US Regions
Requests from Canada (Central) can be routed to supported US Regions with capacity.

Global inference
ca-central-1
Global AWS Regions
Requests from Canada (Central) can be routed to a Region in the AWS global CRIS profile.

Getting started with CRIS from Canada
To begin using cross-Region Inference from Canada, follow these steps:
Configure AWS Identity and Access Management (IAM) permissions
First, verify your IAM role or user has the necessary permissions to invoke Amazon Bedrock models using cross-Region inference profiles.
Here’s an example of a policy for US cross-Region inference:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“bedrock:InvokeModel*”
],
“Resource”: [
“arn:aws:bedrock:ca-central-1::inference-profile/us.anthropic.claude-sonnet-4-5-20250929-v1:0”
]
},
{
“Effect”: “Allow”,
“Action”: [
“bedrock:InvokeModel*”
],
“Resource”: [
“arn:aws:bedrock:*::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0”
],
“Condition”: {
“StringLike”: {
“bedrock:InferenceProfileArn”: “arn:aws:bedrock:ca-central-1::inference-profile/us.anthropic.claude-sonnet-4-5-20250929-v1:0″
}
}
}
]
}

For global CRIS refer to the blog post, Unlock global AI inference scalability using new global cross-Region inference on Amazon Bedrock with Anthropic’s Claude Sonnet 4.5.
Use cross-Region inference profiles
Configure your application to use the relevant inference profile ID. The profiles use prefixes to indicate their routing scope:

Model
Routing scope
Inference profile ID

Claude Sonnet 4.5
US Regions
us.anthropic.claude-sonnet-4-5-20250929-v1:0

Claude Sonnet 4.5
Global
global.anthropic.claude-sonnet-4-5-20250929-v1:0

Claude Haiku 4.5
US Regions
us.anthropic.claude-haiku-4-5-20251001-v1:0

Claude Haiku 4.5
Global
global.anthropic.claude-haiku-4-5-20251001-v1:0

Example code
Here’s how to use the Amazon Bedrock Converse API with a US CRIS inference profile from Canada:

import boto3

# Initialize Bedrock Runtime client
bedrock_runtime = boto3.client(
service_name=”bedrock-runtime”,
region_name=”ca-central-1″ # Canada (Central) Region
)

# Define the inference profile ID
inference_profile_id = “us.anthropic.claude-sonnet-4-5-20250929-v1:0”

# Prepare the conversation
response = bedrock_runtime.converse(
modelId=inference_profile_id,
messages=[
{
“role”: “user”,
“content”: [
{
“text”: “What are the benefits of using Amazon Bedrock for Canadian organizations?”
}
]
}
],
inferenceConfig={
“maxTokens”: 512,
“temperature”: 0.7
}
)

# Print the response
print(f”Response: {response[‘output’][‘message’][‘content’][0][‘text’]}”)

Quota management for Canadian workloads
When using CRIS from Canada, quota management is performed at the source Region level (ca-central-1). This means quota increases requested for the Canada (Central) Region apply to all inference requests originating from Canada, regardless of where they’re processed.
Understanding quota calculations
Important: When calculating your required quota increases, you need to take into account the burndown rate, defined as the rate at which input and output tokens are converted into token quota usage for the throttling system. The following models have a 5x burn down rate for output tokens (1 output token consumes 5 tokens from your quotas):

Anthropic Claude Opus 4
Anthropic Claude Sonnet 4.5
Anthropic Claude Sonnet 4
Anthropic Claude 3.7 Sonnet

For other models, the burndown rate is 1:1 (1 output token consumes 1 token from your quota). For input tokens, the token to quota ratio is 1:1. The calculation for the total number of tokens per request is as follows:
Input token count + Cache write input tokens + (Output token count x Burndown rate)
Requesting quota increases
To request quota increases for CRIS in Canada:

Navigate to the AWS Service Quotas console in the Canada (Central) Region
Search for the specific model quota (for example, “Claude Sonnet 4.5 tokens per minute”)
Submit an increase request based on your projected usage

Migrating from older Claude models to Claude 4.5
Organizations currently using older Claude models should plan their migration to Claude 4.5 to leverage the latest model capabilities.
To plan your migration strategy, incorporate the following elements:

Benchmark current performance: Establish baseline metrics for your existing models.
Test with representative workloads and optimize prompts: Validate Claude 4.5 performance with your specific use cases, and adjust prompt to leverage Claude 4.5’s enhanced capabilities and make use of the Bedrock prompt optimizer tool.
Implement gradual rollout: Transition traffic progressively.
Monitor and adjust: Track performance metrics and adjust quotas as needed.

Choosing between US and Global inference profiles
When implementing CRIS from Canada, organizations can choose between US and Global inference profiles based on their specific requirements.
US cross-Region inference is recommended for organizations with existing US data processing agreements, high throughput and resilience requirements and development and testing environments.
Conclusion
Cross-Region inference for Amazon Bedrock represents an opportunity for Canadian organizations that want to use AI while maintaining data governance. By distinguishing between transient inference processing and persistent data storage, CRIS provides faster access to the latest foundation models without compromising compliance requirements.
With CRIS, Canadian organizations get access to new models within days instead of months. The system scales automatically during peak business periods while maintaining complete audit trails within Canada. This helps you meet compliance requirements and use the same advanced AI capabilities as organizations worldwide. To get started, review your data governance requirements and configure IAM permissions. Then test with the inference profile that matches your needs—US for lower latency to US Regions, or Global for maximum capacity.

About the authors
Daniel Duplessis is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services (AWS), where he guides enterprises in crafting comprehensive AI implementation strategies and establish the foundational capabilities essential for scaling AI across the enterprise.
Dan MacKay is the Financial Services Compliance Specialist for AWS Canada. He advises customers on recommended practices and practical solutions for cloud-related governance, risk, and compliance. Dan specializes in helping AWS customers navigate financial services and privacy regulations applicable to the use of cloud technology in Canada with a focus on third-party risk management and operational resilience.
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Serge Malikov is a Senior Solutions Architect Manager based out of Canada. His focus is on the financial services industry.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Sharadha Kandasubramanian is a Senior Technical Program Manager for Amazon Bedrock. She drives cross-functional GenAI programs for Amazon Bedrock, enabling customers to grow and scale their GenAI workloads. Outside of work, she’s an avid runner and biker who loves spending time outdoors in the sun.

Power up your ML workflows with interactive IDEs on SageMaker HyperPod

Amazon SageMaker HyperPod clusters with Amazon Elastic Kubernetes Service (EKS) orchestration now support creating and managing interactive development environments such as JupyterLab and open source Visual Studio Code, streamlining the ML development lifecycle by providing managed environments for familiar tools to data scientists. This feature introduces a new add-on called Amazon SageMaker Spaces for AI developers to create and manage self-contained environments for running notebooks. Organizations can now maximize their GPU investments by running both interactive workloads and their training jobs on the same infrastructure, with support for fractional GPU allocations to improve cost efficiency. This feature reduces the complexity of managing multiple development environments and focus on building and deploying their AI and ML models.
This post shows how HyperPod administrators can configure Spaces for their clusters, and how data scientists can create and connect to these Spaces. You’ll also learn how to connect directly from your local VS Code environment to Spaces created in HyperPod.
Solution overview
The following diagram showcases the different components involved in creating and managing Spaces on HyperPod clusters.

Here’s how the feature works:

Cluster administrator installs the Spaces add-on from the SageMaker AI console. The administrator can either use a Quick install or a Custom install option to install the add-on.
Once the cluster is set up, data scientists and AI developers can create Spaces using HyperPod Command Line Interface, or kubectl.
Once the Space is created, the user can connect to a running Space through one of the following two options:

Access Space Web UI: This requires setting up an AWS Application Load Balancer (ALB) and setting up or registering your own custom Domain Name System (DNS) in Amazon Route 53. Once the custom domain is set up, the user will be able to connect to the JupyterLab or Code Editor space securely using a presigned URL through their web browser.
Remote IDE connection (connect to the Space remotely from local Visual Studio Code): SSH-over-SSM tunneling is used under the hood to securely connect remote IDEs to SageMaker Spaces pods without requiring customers to manage SSH keys or exposing port 22.

Prerequisites
To follow along, you need the following prerequisites:

An AWS account with permissions to create IAM roles, SageMaker resources such as HyperPod, and access to EKS cluster resources. If you are creating a new SageMaker HyperPod cluster, you will also need permissions to create networking and storage resources, see IAM permissions for cluster creation.
A SageMaker HyperPod cluster orchestrated using EKS, running Kubernetes version 1.30 or later. If you do not have one, you can create by following instructions in Creating a SageMaker HyperPod cluster with Amazon EKS orchestration. This workflow will create a HyperPod cluster, an EKS cluster and the associated resources such as an Amazon Virtual Private Cloud (VPC) and Amazon FSx for Lustre volume for storage.
HyperPod CLI installed (or kubectl).
A local IDE such as VS Code, with the AWS Toolkit for VS Code installed, to connect to the Spaces.

Step 1: Install the Spaces add-on
To get started, first install the Spaces add-on to your SageMaker cluster. This add-on allows users to run JupyterLab and Code Editor applications directly on cluster compute. The Quick install option is the fastest way to get started. With a single click, SageMaker AI automatically creates and configures the required AWS resources with optimized defaults. Here’s how to install it:

In the SageMaker AI console, choose Clusters on the left pane and navigate to your HyperPod cluster
Choose the IDE and Notebooks tab
Choose Quick install

Review the dependencies that will be automatically installed and choose Install.

The Quick install will create the associated dependencies for your Spaces add-on with default settings. They are listed below:

IAM roles for SageMaker Spaces:

Controller pod role for AWS API calls and AWS Systems Manager Session Manager (SSM) operations.
In-cluster router role for AWS Key Management Service (KMS) operations and JWT signing.
SSM managed instance role for remote access to Spaces. A list of the IAM roles and the required permissions are available in Set up permissions.

Remote access components:

Enables SSH connectivity to Spaces including SSM activation and session documents. This activates Systems Manager Advanced tier which includes additional per-instance charges.

Dependent EKS add-ons:

Cert-manager for certificate management.
Amazon Elastic Block Store (EBS) CSI driver for persistent storage volumes.
AWS Load Balancer Controller to manage AWS Elastic Load Balancers.

SageMaker Spaces add-on

Deploys the Spaces controller and in-cluster router for managing Space lifecycle operations.

The Quick install option does not install web UI configurations such as Route 53 DNS records and SSL certificates for accessing Spaces through the web browser. Administrators can either use the Custom install option or configure these properties after installation of the add-on. For instructions on configuring web browser access, see Operator installing – helm/Console.
The installation typically takes 2-5 minutes depending on availability of pre-existing dependencies or if the Spaces add-on will need to provision completely new resources.  After installation completes, administrators can perform the following actions as shown below:

View the Spaces created by data scientists in the Spaces table
Configure namespaces to organize Spaces by team or project
Create Space templates with pre-configured settings for common use cases
Edit configuration at as needed to enable or disable Spaces features or change your configuration settings

For production use cases, we recommend using the Custom install option, where admins can set up fine-grained IAM policies that apply principle of least-privilege. For the full set of configurations that can be set up using the Custom install option, including namespaces and default templates, see Installation.
Step 2: Create or update EKS access entries
To give your users access to create and manage Spaces, grant them access through EKS access entries. The following two access entry policies are required:

AmazonSagemakerHyperpodSpacePolicy
AmazonSagemakerHyperpodSpaceTemplatePolicy

For instructions on creating and editing access entries, see Create access entries and Update access entries.
Step 3: Create and manage Spaces
Data scientists can create JupyterLab and Code Editor Spaces on the cluster using kubectl or the HyperPod CLI. For detailed instructions on creating and managing Spaces, see Hyperpod CLI.
To create a Space, run the following commands:

# set cluster context using hyp CLI
hyp set-cluster-context –cluster-name <your-hyperpod-cluster-name>

# create a space
hyp create hyp-space
    –name “data-science-space”
    –display-name “Data Science Workspace”
    –namespace “default”

The hyp create hyp-space command will create a Space with the default settings. To create a Code Editor space, use the command below:

hyp create hyp-space
–name code-editor-demo
–display-name “code-editor space”
–memory 8Gi
–template-ref name=sagemaker-code-editor-template,namespace=jupyter-k8s-system

You can modify the settings when creating the Space as well, see example below:

hyp create hyp-space
   –name test-space
   –display-name “test space”
   –memory 8Gi
   –volume name=vol,mountPath=/home/,persistentVolumeClaimName=pvcname

Once the Space is created, you can access the Space from either the web UI, or from your local VS Code. To open the Space in VS Code, run:

hyp create hyp-space-access
    –name data-science-space
    –connection-type vscode-remote

If you have set up the custom domain following our documentation, you can get the Space access URL as shown below. This will open your space on your browser.

hyp create hyp-space-access
    –name data-science-space
    –connection-type web-ui

Alternatively, you can connect to the Space from your local VS Code using the AWS toolkit. From your VS Code IDE, open the AWS toolkit panel. From the toolkit, under SageMaker AI, choose HyperPod. Here, you can list, start, stop, and connect to Spaces.

The Spaces need to be created using the HyperPod CLI or kubectl.
HyperPod CLI supports additional CRUD operations to Spaces such as updating, describing and deleting Spaces. For a list of the operations, see HyperPod CLI on Github.
For practitioners familiar with kubectl, they can also create, update and delete Spaces using kubectl. For example, you can create a Space using kubectl as shown below:

kubectl apply -f – <<EOF
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: training-workspace-1
  namespace: hyperpod-training-team
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-training-team-localqueue
    kueue.x-k8s.io/priority-class: ide-priority
spec:
  displayName: “Training Team Workspace 1”
  image: jupyter/minimal-notebook:latest
  desiredStatus: Running
  resources:
    requests:
      cpu: 3
      memory: 12Gi
    limits:
      cpu: 3
      memory: 12Gi
EOF

Best practices
We recommend the following best practices when using SageMaker Spaces.
User management, RBAC, and collaboration
SageMaker Spaces identifies users through Amazon EKS Access Entries, which are derived from your IAM identity when you interact with a Space using either the HyperPod CLI or kubectl. Your EKS captured identity may appear as an IAM user or as an assumed-role session ARN. For assumed roles, the session name can represent the actual user when admin applies IAM policy to enforce assumed role session names that reflect individual identities. If session names are not enforced or do not uniquely map to users, SageMaker Spaces access control falls back to role-based access control, causing all users sharing the same role to be treated as the same identity. For more details see Add users and set up service accounts.
Spaces can either be private, accessible only by the user who created the Spaces, or public, accessible by any user who has access to the hosting Kubernetes namespace. Spaces are public by default. The creator and the administrator group still retain full control, including the ability to update or delete the Space. A Space becomes private only when access is restricted to the creator and the admin group. This model gives teams a flexible foundation: public Spaces support open collaboration within a shared environment, while private Spaces provide isolation.
Multiple users can collaborate on the same Space if it is configured to be shared. When enabled with SageMaker Distribution images for JupyterLab environments, we also support real time collaboration (RTC) which enables multiple users to collaborate on the interactive ML experiments and workloads.
Admin defaults and controls
Templates set up by admins help data scientists quickly use pre-configured Space settings for their use case. SageMaker provides two pre-created system templates, one for JupyterLab and one for Code Editor, so that data scientists to get started without additional configurations needed. Admins can also set up custom templates for data scientists with custom configurations such as image, storage and compute.Templates can be used by data scientists in the cluster and are flexible depending on the needs of admins. Admins can create multiple templates based on specific use cases, projects, or dependency requirements.
Customizing Spaces
Administrators and developers can customize their Spaces using custom images and lifecycle scripts. Use lifecycle scripts for minimal customization such as installing additional packages, setting up default variables, or running clean up tasks, while still using the SageMaker Distribution image capabilities. For organizations that have a standardized image for development and training, SageMaker Spaces also supports custom images and entry points for users. For custom image specifications, see Customization.
Shutdown idle compute
Spaces by default support automatic shutdown of idle workspaces to optimize resource usage. When idle shutdown is enabled, the system periodically checks the Space for activity and if the workspace is idle for the specified timeout duration, the workspace automatically stops, freeing up the compute resources for other tasks. Administrators can set default timeouts and optionally avoid overrides to defaults to enforce the idle shutdown.
Integration with other HyperPod add-ons
For guardrails against excess resource usage, set up HyperPod task governance, which provides comprehensive resource management controls. To help prevent workspaces from being evicted due to changes in unrelated workloads, configure task governance to set interactive ML workloads as the highest priority or schedule them in task governance namespaces with eviction turned off.
Set up the HyperPod Observability plug in to monitor the resource usage of Spaces running within the cluster. With one click install, the observability plugin provides insight into how many resources Spaces are using over time, allowing admins to observe and tune their compute allocations.
Fractional GPU support
SageMaker Spaces support fractional GPU configurations, specifically the MIG technology provided by NVIDIA GPUs. Fractional GPU support with MIG means that users can share GPU instances, optimizing compute usage, while still providing isolation between workloads. This means that experiments running on a fractional GPU profile are unlikely to interfere with other workloads running on the same GPU.
To check if an instance in your cluster supports fractional GPU, run the command:

hyp list-accelerator-partition-type –instance-type <instance type>

If your cluster contains instance groups that support fractional GPU, you can create a space with fractional GPU as shown below:

hyp create hyp-space
–name test-space
–display-name “mig-testing”
–accelerator-partition-type mig-3g.20gb
–accelerator-partition-count 1
–memory 8Gi
–template-ref sagemaker-code-editor-template

Clean up
To avoid incurring unnecessary charges, clean up the resources you created in this walkthrough.

Delete all spaces you created. Run this command for each space you created:

hyp delete hyp-space
–name <space-name>

Remove the SageMaker HyperPod Spaces add-on: From the cluster details page, navigate to the IDE and Notebooks tab, and choose Remove.
If you created a HyperPod cluster for the purposes of this blog, delete the cluster to avoid being charged for unused compute. To delete the cluster, follow the instructions in Deleting a SageMaker HyperPod cluster. Additionally, if you used the console to create the cluster, go to the AWS CloudFormation console and delete the parent stack to remove the additional resources such as storage and networking resources created for the cluster. The parent stack will be in the format sagemaker-<your-hyperpod-cluster-name>-<unique-id>

Conclusion
Spaces in SageMaker HyperPod boosts data scientist and AI developer productivity by providing more secure, managed development environments on purpose-build compute. We walked through the setup steps for administrators and data scientists, showing how teams can quickly create and connect to Spaces. With this feature, teams can now reduce time spent on environment setup and focus on model development, while also maintaining consistent development environments. By integrating with HyperPod task governance features, administrators can optimize for cost and equitable compute allocations.

About the authors
Durga Sury is a Senior Solutions Architect at Amazon SageMaker, helping enterprise customers build secure and scalable AI/ML systems. When she’s not architecting solutions, you can find her enjoying sunny walks with her dog, immersing herself in murder mystery books, or catching up on her favorite Netflix shows.
 Edward Sun is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solutions and simplifying the customer experience to integrate SageMaker Studio with popular technologies in data engineering and ML landscape. In his spare time, Edward is big fan of camping, hiking, and fishing, and enjoys spending time with his family.
Josh Dunne is a Senior UX Designer at SageMaker AI at Amazon Web Services. He has 7+ years of experience across UX and product management, with a focus on ML/AI and cloud computing creating practical, straightforward to use workflows for machine learning builders across SageMaker AI, including HyperPod, SageMaker Studio, SageMaker Unified Studio, and interactive IDEs.  Outside of work, he enjoys exploring the Pacific Northwest and traveling with his wife and their dog and trying new restaurants.
Joshua Towner is a Senior SDE working for SageMaker AI at Amazon Web Services, where he is currently working on building and improving interactive ML solutions for SageMaker Studio and HyperPod. Outside of work, he enjoys traveling, skiing, and watching movies.
Khushboo Srivastava is a Product Manager for Amazon SageMaker, AWS. She enjoys building products that simplify machine learning workflows for users. With over 7+ years in software engineering and data science, and 7+ years in product management, Khushboo has launched several products and services that have helped accelerate speed of AI/ML development for customers. With her background in generative AI and distributed computing, and her passion for democratizing AI, she is committed to sharing insights and empowering others in their AI and open source journey.
Prayag Singh is a Senior SDE working for SageMaker AI at Amazon Web Services. With 10+ years of software development experience, he focuses on integrating customers’ preferred ML tools and IDEs on SageMaker Studio and HyperPod. Outside of work, Prayag enjoys traveling and all things comedy, from stand-up specials to sitcoms. You can find him on LinkedIn.

Claude Opus 4.5 now in Amazon Bedrock

Anthropic’s newest foundation model, Claude Opus 4.5, is now available in Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models from leading AI companies. Opus 4.5 is a meaningful step forward in what AI systems can do and sets a new standard across coding, agents, computer use, and office tasks. It outperforms both Sonnet 4.5 and Opus 4.1 while providing Opus-level at one-third the cost.
In this post, I’ll show you what makes this model different, walk through key business applications, and demonstrate how to use Opus 4.5’s new tool use capabilities on Amazon Bedrock. By the end, you’ll understand how to use this model’s capabilities for production agent deployments.
Claude Opus 4.5: What makes this model different
Opus 4.5 is Anthropic’s most advanced model offering in the Opus class, designed for developers building sophisticated AI agents that can reason, plan, and execute complex tasks with minimal oversight. It upgrades Sonnet 4.5 with better performance on existing use cases and adds new capabilities for complex workflows.
The model excels in professional software engineering, achieving 80.9% on SWE-bench Verified, helping to transform multi-day development projects into hours-long tasks. It works independently including improved multilingual coding capabilities, and enhanced behaviors like more efficient code, better test coverage, and cleaner architecture choices. For office productivity, the model handles complex projects end-to-end. It powers agents that create PowerPoint presentations, Excel spreadsheets, and Word documents with professional polish, including document redlining for contracts and NDAs. The model also produces higher quality React and HTML artifacts. It maintains consistency and accuracy—important for finance and other industries where precision matters—and maintains context across files throughout long projects.
This is Anthropic’s best vision model yet, achieving 80.7% on MMMU, for workflows that depend on complex visual interpretation and multi-step navigation—such as analyzing design mockups, processing documents with complex layouts, or automating browser-based tasks—with computer use performance improving further still.
The model introduces two key improvements for agent developers. The tool search tool lets agents work with hundreds of tools by dynamically discovering and loading only what they need instead of loading all definitions upfront—potentially saving tens of thousands of tokens and preventing schema confusion when scaling to large tool libraries. Tool use examples lets you provide sample tool calls directly in the tool definition, improving accuracy for complex schemas with nested objects or arrays.

Opus 4.5 Performance BenchmarksSource: https://www.anthropic.com/news/claude-opus-4-5

Business applications and use cases
Opus 4.5 excels in the following use cases:

Software development: Build agents that write and refactor code across entire projects, manage full-stack architectures, or design agentic systems that break down high-level goals into executable steps. This generation of Claude spans the full development lifecycle: Opus 4.5 for production code and sophisticated agents (those using 10+ tools in workflows like end-to-end software engineering, cybersecurity, or financial analysis), Sonnet 4.5 for rapid iteration and scaled user experiences, Haiku 4.5 for sub-agents and free-tier products. Opus 4.5 can analyze technical documentation, plan a software implementation, write the required code, and iteratively refine it—while tracking requirements and architectural context throughout the process.
Enterprise operations and office tasks: Manage complex projects from start to finish. Opus 4.5 uses memory to maintain context and consistency across files, alongside improvements in creating spreadsheets, slides, and documents. The model handles ongoing enterprise projects, automating manual workflows.
Financial analysis: Work across complex information systems—regulatory filings, market reports, internal data—enabling predictive modeling and proactive compliance. The model’s consistency and accuracy make it useful for finance and other industries where precision matters.
Cybersecurity: Bring professional-grade analysis to security workflows, correlating logs, security issue databases, and security intelligence for security event detection and automated incident response.

Integration with Amazon Bedrock AgentCore
Amazon Bedrock provides the enterprise foundation for deploying Opus 4.5 in production. The fully managed service provides a unified API for foundation models with enterprise-grade security, compliance, and governance.
Opus 4.5 integrates with Amazon Bedrock AgentCore, which provides the infrastructure and primitives for building production agents. AgentCore includes persistent memory for maintaining context across sessions, Tool Gateway for converting your APIs and Lambda functions into agent-compatible tools, and built-in identity and access management for secure resource access. You can deploy and monitor agents with complete session isolation, long-running workflow support (up to 8 hours), and observability features—so you can focus on building agents instead of managing infrastructure.
Amazon Bedrock AgentCore provides additional capabilities for production deployments. The Tool Gateway converts your existing APIs and Lambda functions into agent-compatible tools with minimal code—working with the model’s tool search feature to orchestrate hundreds of tools. Built-in observability through Amazon CloudWatch tracks token usage, latency, and error rates across your agent workflows.
Getting started
Access the Opus 4.5 model today through Amazon Bedrock. I’ll demonstrate the model’s tool search capability—a feature that lets agents work with hundreds of tools without loading all definitions into context upfront. First, I import the required modules and set up the Amazon Bedrock client:

# Import required libraries
import boto3
import json
# Create a session and Bedrock client
session = boto3.Session()
bedrock_client = session.client(
service_name=’bedrock-runtime’,
region_name=’us-east-1′

For this example, I’ll define multiple tools with defer_loading to enable tool search. This lets the model discover and load only the tools it needs instead of loading all definitions upfront:

# Define tools with tool search enabled
tools = [
# Enable tool search – allows dynamic tool discovery
{
“type”: “tool_search_tool_regex”,
“name”: “tool_search_tool_regex”
},
# Tools marked with defer_loading are discovered on-demand
{
“name”: “get_weather”,
“description”: “Get current weather for a location”,
“input_schema”: {
“type”: “object”,
“properties”: {
“location”: {“type”: “string”},
“unit”: {“type”: “string”, “enum”: [“celsius”, “fahrenheit”]}
},
“required”: [“location”]
},
“defer_loading”: True,
# Provide example inputs to improve accuracy for complex schemas
“input_examples”: [
{“location”: “San Francisco, CA”, “unit”: “fahrenheit”},
{“location”: “Tokyo, Japan”, “unit”: “celsius”}
]
},
{
“name”: “search_documentation”,
“description”: “Search AWS documentation”,
“input_schema”: {
“type”: “object”,
“properties”: {
“query”: {“type”: “string”},
“service”: {“type”: “string”}
},
“required”: [“query”]
},
“defer_loading”: True,
“input_examples”: [
{“query”: “Lambda pricing”, “service”: “lambda”},
{“query”: “S3 bucket policies”}
]
},
{
“name”: “analyze_logs”,
“description”: “Analyze application logs for errors”,
“input_schema”: {
“type”: “object”,
“properties”: {
“log_file”: {“type”: “string”},
“time_range”: {“type”: “string”}
},
“required”: [“log_file”]
},
“defer_loading”: True,
“input_examples”: [
{“log_file”: “/var/log/app.log”, “time_range”: “last 24 hours”},
{“log_file”: “/var/log/error.log”}
]
}
]

Now I call the model using the invoke_model API with the effort parameter set to medium:

# Construct the request with beta features enabled
request_body = {
“anthropic_version”: “bedrock-2023-05-31”,
# Enable beta features: tool search, tool examples, and effort parameter
“anthropic_beta”: [“tool-search-tool-2025-10-19”, “tool-examples-2025-10-29”, “effort-2025-11-24”],
“max_tokens”: 4096,
“temperature”: 0.7,
# Set effort to “medium” for balanced token usage
“output_config”: {
“effort”: “medium”
},
“messages”: [
{
“role”: “user”,
“content”: “What’s the weather in Seattle?”
}
],
“tools”: tools
}

)
# Invoke the model
response = bedrock_client.invoke_model(
modelId=”global.anthropic.claude-opus-4-5-20251101-v1:0″,
body=json.dumps(request_body)

# Parse the response
response_body = json.loads(response[‘body’].read())

The model uses tool search to find the relevant tool (get_weather) from the library without loading all tool definitions upfront. The effort parameter, available in beta, controls how liberally the model spends tokens across thinking, tool calls, and responses. You can set effort to high for best results, medium for balanced usage, or low for conservative token usage.
Key features for agent development
Opus 4.5 has several capabilities that make it well-suited for building production agents. The model maintains coherence across extended workflows for consistent decision-making for agents that run multi-step processes over hours or days. Better tool handling means agents interact more reliably with external systems, APIs, and software interfaces—the model chooses the right tools and interprets results more accurately. Opus 4.5 also tracks information across conversation turns and maintains context, helping agents accumulate knowledge over time and make decisions based on history.
The effort parameter, available in beta, gives you control over token usage. You can set it to high for best results when quality matters most, medium for balanced performance, or low for conservative token usage. Opus 4.5 adjusts token spending across thinking, tool calls, and responses based on this setting. For production deployments, Amazon Bedrock AgentCore provides monitoring and observability through CloudWatch integration, tracking token usage in real-time (useful when tuning the effort parameter), along with latency metrics, session duration, and error rates to help optimize agent performance and manage costs.
Pricing
The model is priced at $5 per million input tokens and $25 per million output tokens, making Opus-level intelligence accessible at one-third the cost of previous offerings.
Availability and access
This model is available today in Amazon Bedrock through cross-Region inference, which automatically routes requests to available capacity across AWS Regions for higher throughput during peak demand.
Use this model for agents that handle long-running tasks, coordinate multiple tools, or maintain context across extended sessions.
For detailed information about availability, pricing, and model specifications, visit the Amazon Bedrock documentation.
Conclusion
This post showed you how to get started with Claude Opus 4.5 in Amazon Bedrock. Opus 4.5 excels at complex, long-running workflows like software development and enterprise operations. Opus 4.5’s capabilities in tool handling, context management, and decision-making make it valuable for building agents that operate reliably in production environments. The model works well for agents in software engineering, research synthesis, and enterprise workflow automation.
I encourage you to experiment with Opus 4.5 for your own agent workflows. Consider how its capabilities could improve manual processes in your organization, or support new types of automation. The combination of Opus 4.5’s capabilities with Amazon Bedrock’s enterprise features provides a foundation for production AI agents.
To get started, try the model in the Amazon Bedrock console, explore the technical documentation, and check out Anthropic’s Claude model detail page for more information about its capabilities. To deploy agents at scale, explore Opus 4.5 in Amazon Bedrock AgentCore for managed infrastructure with tool orchestration and monitoring.
I’d love to hear about what you build with this model—share your experiences and agent use cases in the comments below!

About the authors
Jonathan Evans is a Worldwide Solutions Architect for Generative AI at AWS, where he helps customers leverage cutting-edge AI technologies with Anthropic’s Claude models on Amazon Bedrock, to solve complex business challenges. With a background in AI/ML engineering and hands-on experience supporting machine learning workflows in the cloud, Jonathan is passionate about making advanced AI accessible and impactful for organizations of all sizes.

Moonshot AI Researchers Introduce Seer: An Online Context Learning Sys …

How do you keep reinforcement learning for large reasoning models from stalling on a few very long, very slow rollouts while GPUs sit under used? a team of researchers from Moonshot AI and Tsinghua University introduce ‘Seer’, a new online context learning system that targets a specific systems bottleneck in reinforcement learning for large language models. In synchronous on policy setups, the rollout phase dominates the cost of each iteration. Seer restructures this phase and reports rollout throughput gains of 74 percent to 97 percent and tail latency reductions of 75 percent to 93 percent compared with a strong synchronous baseline called veRL.

https://arxiv.org/pdf/2511.14617

Why synchronous rollout is slow for reasoning models?

Modern reasoning RL workloads use long chain of thought style outputs. In the Seer experiments, the researchers apply GRPO to three different models, Moonlight, Qwen2 VL 72B and Kimi K2. These workloads run on 32 compute nodes with 8 H800 GPUs per node. The three tasks use 32, 128 and 256 GPUs respectively, with 400, 600 and 800 prompts per iteration and 8 or 16 responses per prompt.

Maximum generation length is large. Moonlight is configured for 65,536 tokens, Qwen2 VL 72B for 40,960 tokens and Kimi K2 for 98,304 tokens. A single long chain of thought request can grow from a few hundred megabytes of KVCache to tens of gigabytes as decoding progresses. This memory growth forces instances to reduce concurrency or to preempt requests, which triggers expensive re decoding.

The research team defines tail requests as the last 10 percent of requests to finish in a rollout. For Moonlight and Qwen2 VL 72B, this tail alone can consume up to 50 percent of the total rollout time in the baseline system. Rollout already dominates iteration time, so this tail effect directly slows RL.

https://arxiv.org/pdf/2511.14617

Seer architecture on top of Mooncake and vLLM

Seer keeps the RL algorithm identical to synchronous veRL. Each training iteration uses only data from the current rollout iteration, so the system preserves on policy behavior. The training phase uses Megatron for distributed optimization. The rollout phase uses an in house implementation of vLLM as the inference engine.

To support aggressive request scheduling, Seer relies on a Global KVCache Pool built on the Mooncake disaggregated KVCache architecture used in production for Kimi. Mooncake provides a two tier DRAM and SSD KV cache store shared across inference nodes, which allows Seer to migrate requests without recomputing prefills.

On top of this substrate, Seer introduces three key mechanisms:

Divided Rollout

Context Aware Scheduling

Adaptive Grouped Speculative Decoding

These are orchestrated by a Request Buffer, a Context Manager and an Inference Engine Pool connected to the Global KVCache Pool.

https://arxiv.org/pdf/2511.14617

Divided Rollout, fine grained scheduling and migration

Conventional synchronous rollout assigns whole GRPO groups to inference instances. A group is a set of requests that share one prompt. Once assigned, a group stays on the same instance until all responses finish. Due to large variance in output lengths, this leads to load imbalance and long running stragglers.

Seer breaks groups down in two steps. It first decomposes each group into individual requests. It then divides each request into multiple chunks based on generation length. When the scheduler dispatches a request from the Request Buffer, it sets a small max tokens value such as 8,000 tokens for that chunk. After each chunk, the request is re enqueued until it reaches an end of sequence token or its original max tokens limit.

Because KVCache is stored in the Global KVCache Pool, divided requests can move between instances at chunk boundaries without re running the prefill. The scheduler maintains a concurrency level that keeps memory utilization high while avoiding preemption. This reduces waste and smooths KVCache usage across the iteration.

Context Aware Scheduling using group length statistics

The research team observe that different requests in the same group tend to have correlated output lengths. Seer uses this structure as online context. For each prompt group, it designates one request as the speculative request. The scheduler keeps speculative requests in a high priority queue and serves them with a smallest first policy based on generated tokens so far. Short requests complete quickly and exit. Long requests remain and identify groups that are potential tail candidates.

The Context Manager maintains a length estimate for each group. It updates this estimate to the maximum generated length among completed requests in the group. If no request has finished, it uses the original max tokens as a conservative bound. Once speculative requests are in flight or done, Seer schedules remaining requests with an approximate longest first policy at group level. This design achieves throughput and tail behavior close to an oracle scheduler that knows all output lengths in advance.

https://arxiv.org/pdf/2511.14617

Adaptive Grouped Speculative Decoding

Seer adds Adaptive Grouped Speculative Decoding on top of the previous two components to accelerate decoding, especially for long requests in the tail. It introduces a Distributed Grouped Draft Server, or DGDS. DGDS maintains a Compressed Suffix Tree for each group and aggregates token sequences from all requests in that group. Instances asynchronously append generated tokens to DGDS, periodically fetch updated suffix trees and perform local speculative decoding based on the shared pattern statistics.

The system adjusts draft length and the number of paths according to model architecture, batch size and measured acceptance length. For dense and Mixture of Experts models, it pre-computes different speculation thresholds and uses them to bound draft depth for each batch. In late tail stages, concurrency is low, so Seer increases draft depth and enables multi path drafting to raise accepted tokens per step.

Ablation results show that divided rollout yields up to 35 percent throughput improvement over the baseline. Adding Context Aware Scheduling increases this to up to 47 percent over baseline. Enabling grouped speculative decoding raises the total speedup to 77 percent to 87 percent over the baseline in the evaluated iteration.

End to end impact on RL training

The research team evaluate Seer on three RL tasks built on Moonlight, Qwen2 VL 72B and Kimi K2. They run 10 rollout iterations per task and measure output tokens per second and completion time for each rollout. Seer improves rollout throughput by 74 percent to 97 percent across these workloads relative to veRL with the same RL algorithm and vLLM based inference engine.

Tail latency is reduced by 75 percent to 93 percent. For memory constrained tasks, the baseline system spends up to half of its time on the last 10 percent of requests. Seer removes most of this tail by combining divided rollout, Context Aware Scheduling and Adaptive Grouped Speculative Decoding on top of the Mooncake based Global KVCache Pool.

Key Takeaways

Rollout bottleneck: Seer targets the rollout phase of synchronous RL, which accounts for about 63% to 87% of iteration time and is dominated by long tail requests and KV cache fragmentation.

Three core mechanisms: Seer combines divided rollout, context aware scheduling and adaptive grouped speculative decoding to exploit output length and pattern similarity among GRPO responses that share a prompt.

Fine grained scheduling on a global KV cache: Requests are split into chunks and migrated across a Mooncake style Global KVCache Pool, which preserves synchronous on policy RL while keeping GPU memory utilization high and reducing preemptions.

Online context for tail latency reduction: Group level length statistics from speculative requests drive context aware scheduling that approximates an oracle longest first scheduler and sharply reduces the time spent on the last 10 percent of requests.

Measured end to end gains: On production grade RL workloads with Moonlight, Qwen2 VL 72B and Kimi K2, Seer improves rollout throughput by 74% to 97% and reduces long tail latency by 75% to 93% relative to a state of the art synchronous vLLM based baseline.

Editorial Comments

Seer is an important systems contribution because it optimizes the rollout phase in synchronous RL without changing the underlying GRPO algorithm, so it preserves on policy guarantees and reproducibility while fixing a real infrastructure bottleneck. The combination of divided rollout, context aware scheduling and adaptive grouped speculative decoding offers a practical template for other RL stacks that rely on long chain of thought reasoning models and large KVCache footprints. Overall, Seer shows that online context learning at the systems level is now as critical as model architecture for scaling reasoning RL efficiently.

Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Moonshot AI Researchers Introduce Seer: An Online Context Learning System for Fast Synchronous Reinforcement Learning RL Rollouts appeared first on MarkTechPost.

How to Design a Mini Reinforcement Learning Environment-Acting Agent w …

In this tutorial, we code a mini reinforcement learning setup in which a multi-agent system learns to navigate a grid world through interaction, feedback, and layered decision-making. We build everything from scratch and bring together three agent roles: an Action Agent, a Tool Agent, and a Supervisor, so we can observe how simple heuristics, analysis, and oversight combine to produce more intelligent behavior. Also, we observe how the agents collaborate, refine their strategies, and gradually learn to reach the goal while overcoming obstacles and uncertainty. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
import time
from collections import defaultdict

class GridWorld:
def __init__(self, size=8):
self.size = size
self.agent_pos = [0, 0]
self.goal_pos = [size-1, size-1]
self.obstacles = self._generate_obstacles()
self.visited = set()
self.step_count = 0
self.max_steps = size * size * 2

def _generate_obstacles(self):
obstacles = set()
n_obstacles = self.size
while len(obstacles) < n_obstacles:
pos = (np.random.randint(1, self.size-1),
np.random.randint(1, self.size-1))
if pos != (0, 0) and pos != (self.size-1, self.size-1):
obstacles.add(pos)
return obstacles

def reset(self):
self.agent_pos = [0, 0]
self.visited = {tuple(self.agent_pos)}
self.step_count = 0
return self._get_state()

def _get_state(self):
return {
‘position’: tuple(self.agent_pos),
‘goal’: self.goal_pos,
‘distance_to_goal’: abs(self.agent_pos[0] – self.goal_pos[0]) +
abs(self.agent_pos[1] – self.goal_pos[1]),
‘visited_count’: len(self.visited),
‘steps’: self.step_count,
‘can_move’: self._get_valid_actions()
}

def _get_valid_actions(self):
valid = []
moves = {‘up’: [-1, 0], ‘down’: [1, 0], ‘left’: [0, -1], ‘right’: [0, 1]}
for action, delta in moves.items():
new_pos = [self.agent_pos[0] + delta[0], self.agent_pos[1] + delta[1]]
if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size and
tuple(new_pos) not in self.obstacles):
valid.append(action)
return valid

We set up the entire GridWorld environment and define how the agent, goal, and obstacles exist in it. We establish the structure for state representation and valid movements, and we prepare the environment so we can interact with it dynamically. As we run this part, we see the world taking shape and becoming ready for the agents to explore. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass GridWorld(GridWorld):
def step(self, action):
self.step_count += 1
moves = {‘up’: [-1, 0], ‘down’: [1, 0], ‘left’: [0, -1], ‘right’: [0, 1]}

if action not in moves:
return self._get_state(), -1, False, “Invalid action”

delta = moves[action]
new_pos = [self.agent_pos[0] + delta[0], self.agent_pos[1] + delta[1]]

if not (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size):
return self._get_state(), -1, False, “Hit wall”

if tuple(new_pos) in self.obstacles:
return self._get_state(), -1, False, “Hit obstacle”

self.agent_pos = new_pos
pos_tuple = tuple(self.agent_pos)
reward = -0.1
if pos_tuple not in self.visited:
reward += 0.5
self.visited.add(pos_tuple)

done = False
info = “Moved”
if self.agent_pos == self.goal_pos:
reward += 10
done = True
info = “Goal reached!”
elif self.step_count >= self.max_steps:
done = True
info = “Max steps reached”

return self._get_state(), reward, done, info

def render(self, agent_thoughts=None):
grid = np.zeros((self.size, self.size, 3))
for pos in self.visited:
grid[pos[0], pos[1]] = [0.7, 0.9, 1.0]
for obs in self.obstacles:
grid[obs[0], obs[1]] = [0.2, 0.2, 0.2]
grid[self.goal_pos[0], self.goal_pos[1]] = [0, 1, 0]
grid[self.agent_pos[0], self.agent_pos[1]] = [1, 0, 0]

plt.figure(figsize=(10, 8))
plt.imshow(grid, interpolation=’nearest’)
plt.title(f”Step: {self.step_count} | Visited: {len(self.visited)}/{self.size*self.size}”)
for i in range(self.size + 1):
plt.axhline(i – 0.5, color=’gray’, linewidth=0.5)
plt.axvline(i – 0.5, color=’gray’, linewidth=0.5)
if agent_thoughts:
plt.text(0.5, -1.5, agent_thoughts, ha=’center’, fontsize=9,
bbox=dict(boxstyle=’round’, facecolor=’wheat’, alpha=0.8),
wrap=True, transform=plt.gca().transData)
plt.axis(‘off’)
plt.tight_layout()
plt.show()

We define how each step in the environment works and how the world is visually rendered. We calculate rewards, detect collisions, track progress, and display everything through a clean grid visualization. As we execute this logic, we watch the agent’s journey unfold in real time with clear feedback. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ActionAgent:
def __init__(self):
self.q_values = defaultdict(lambda: defaultdict(float))
self.epsilon = 0.3
self.learning_rate = 0.1
self.discount = 0.95

def choose_action(self, state):
valid_actions = state[‘can_move’]
if not valid_actions:
return None
pos = state[‘position’]
if np.random.random() < self.epsilon:
action = np.random.choice(valid_actions)
reasoning = f”Exploring randomly: chose ‘{action}'”
else:
action_values = {a: self.q_values[pos][a] for a in valid_actions}
action = max(action_values, key=action_values.get)
reasoning = f”Exploiting: chose ‘{action}’ (Q={self.q_values[pos][action]:.2f})”
return action, reasoning

def learn(self, state, action, reward, next_state):
pos = state[‘position’]
next_pos = next_state[‘position’]
current_q = self.q_values[pos][action]
next_max_q = max([self.q_values[next_pos][a] for a in next_state[‘can_move’]], default=0)
new_q = current_q + self.learning_rate * (
reward + self.discount * next_max_q – current_q)
self.q_values[pos][action] = new_q

class ToolAgent:
def analyze(self, state, action_taken, reward, history):
suggestions = []
distance = state[‘distance_to_goal’]
if distance <= 3:
suggestions.append(” Very close to goal! Prioritize direct path.”)
exploration_rate = state[‘visited_count’] / (state[‘steps’] + 1)
if exploration_rate < 0.5 and distance > 5:
suggestions.append(” Low exploration rate. Consider exploring more.”)
if len(history) >= 5:
recent_rewards = [h[2] for h in history[-5:]]
avg_reward = np.mean(recent_rewards)
if avg_reward < -0.5:
suggestions.append(” Negative reward trend. Try different strategy.”)
elif avg_reward > 0.3:
suggestions.append(” Good progress! Current strategy working.”)
if len(state[‘can_move’]) <= 2:
suggestions.append(” Limited movement options. Be careful.”)
return suggestions

We implement the Action Agent and Tool Agent, giving the system both learning capability and analytical feedback. We observe how the Action Agent chooses actions through a balance of exploration and exploitation, while the Tool Agent evaluates performance and suggests improvements. Together, they create a learning loop that evolves with experience. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SupervisorAgent:
def decide(self, state, proposed_action, tool_suggestions):
if not proposed_action:
return None, “No valid actions available”

decision = proposed_action
reasoning = f”Approved action ‘{proposed_action}'”

for suggestion in tool_suggestions:
if “goal” in suggestion.lower() and “close” in suggestion.lower():
goal_direction = self._get_goal_direction(state)
if goal_direction in state[‘can_move’]:
decision = goal_direction
reasoning = f”Override: Moving ‘{goal_direction}’ toward goal”
break

return decision, reasoning

def _get_goal_direction(self, state):
pos = state[‘position’]
goal = state[‘goal’]
if goal[0] > pos[0]:
return ‘down’
elif goal[0] < pos[0]:
return ‘up’
elif goal[1] > pos[1]:
return ‘right’
else:
return ‘left’

We introduce the Supervisor Agent, which acts as the final decision-maker in the system. We see how it interprets suggestions, overrides risky choices, and ensures that actions remain aligned with overall goals. As we use this component, we experience a coordinated multi-agent decision flow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef train_multi_agent(episodes=5, visualize=True):
env = GridWorld(size=8)
action_agent = ActionAgent()
tool_agent = ToolAgent()
supervisor = SupervisorAgent()

episode_rewards = []
episode_steps = []

for episode in range(episodes):
state = env.reset()
total_reward = 0
done = False
history = []

print(f”n{‘=’*60}”)
print(f”EPISODE {episode + 1}/{episodes}”)
print(f”{‘=’*60}”)

while not done:
action_result = action_agent.choose_action(state)
if action_result is None:
break
proposed_action, action_reasoning = action_result

suggestions = tool_agent.analyze(state, proposed_action, total_reward, history)
final_action, supervisor_reasoning = supervisor.decide(state, proposed_action, suggestions)

if final_action is None:
break

next_state, reward, done, info = env.step(final_action)
total_reward += reward
action_agent.learn(state, final_action, reward, next_state)
history.append((state, final_action, reward, next_state))

if visualize:
clear_output(wait=True)
thoughts = (f”Action Agent: {action_reasoning}n”
f”Supervisor: {supervisor_reasoning}n”
f”Tool Agent: {‘, ‘.join(suggestions[:2]) if suggestions else ‘No suggestions’}n”
f”Reward: {reward:.2f} | Total: {total_reward:.2f}”)
env.render(thoughts)
time.sleep(0.3)

state = next_state

episode_rewards.append(total_reward)
episode_steps.append(env.step_count)

print(f”nEpisode {episode+1} Complete!”)
print(f”Total Reward: {total_reward:.2f}”)
print(f”Steps Taken: {env.step_count}”)
print(f”Cells Visited: {len(env.visited)}/{env.size**2}”)

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(episode_rewards, marker=’o’)
plt.title(‘Episode Rewards’)
plt.xlabel(‘Episode’)
plt.ylabel(‘Total Reward’)
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(episode_steps, marker=’s’, color=’orange’)
plt.title(‘Episode Steps’)
plt.xlabel(‘Episode’)
plt.ylabel(‘Steps to Complete’)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

return action_agent, tool_agent, supervisor

if __name__ == “__main__”:
print(” Multi-Agent RL System: Grid World Navigation”)
print(“=” * 60)
print(“Components:”)
print(” • Action Agent: Proposes actions using Q-learning”)
print(” • Tool Agent: Analyzes performance and suggests improvements”)
print(” • Supervisor Agent: Makes final decisions”)
print(“=” * 60)

trained_agents = train_multi_agent(episodes=5, visualize=True)

We run the full training loop where all agents collaborate inside the environment across multiple episodes. We track rewards, observe movement patterns, and visualize learning progression with each trial. As we complete this loop, we see the multi-agent system improving and becoming more efficient at navigating the grid world.

In conclusion, we see how a multi-agent RL system emerges from clean components and how each layer contributes to smarter navigation: the Action Agent learns via Q-updates, the Tool Agent guides improvements, and the Supervisor ensures safe, goal-oriented action selection. We appreciate how this simple yet dynamic grid world helps us visualize learning, exploration, and decision-making in real time.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Mini Reinforcement Learning Environment-Acting Agent with Intelligent Local Feedback, Adaptive Decision-Making, and Multi-Agent Coordination appeared first on MarkTechPost.

Google DeepMind Introduces Nano Banana Pro: the Gemini 3 Pro Image Mod …

Nano Banana Pro, also called Gemini 3 Pro Image, is Google DeepMind’s new image generation and editing model built on Gemini 3 Pro. It is positioned as a state of the art system for creating and editing images that must respect structure, world knowledge and text layout, not only style. Nano Banana Pro follows Nano Banana, which was based on Gemini 2.5 Flash Image and focused on fast, casual image editing such as restoring photos and generating figurines.

From Gemini 2.5 Flash Image to Gemini 3 Pro Image

The earlier Nano Banana model targeted quick creative edits for casual creators. It helped restore old photos and build stylized 3D mini figurines with a simple prompt. Nano Banana Pro keeps that editing flow but runs on top of Gemini 3 Pro, which brings stronger reasoning and real world knowledge into the image stack.

The model can turn prototypes, data tables and handwritten notes into diagrams and infographics that reflect the underlying information, rather than producing only decorative art.

Reasoning Guided, Search Grounded Visuals

A core design point for Nano Banana Pro is reasoning guided generation. Using Gemini 3 Pro, the model can consume text, structured content and references and then plan the image as an explanation of that content. Nano Banana Pro can also connect to Google Search, using the search index as a real time knowledge source.

Clear Text and Multilingual Layouts

Text inside images is a long standing failure mode for many diffusion based generators. Nano Banana Pro addresses this explicitly. Google states that it is the best model in the Gemini family for producing images with correctly rendered and legible text, for both short taglines and full paragraphs.

Gemini 3 Pro’s multilingual reasoning flows into the image model. Nano Banana Pro can render text in multiple languages and also translate text that already appears in products or posters. The documentation shows beverage cans where English text is translated into Korean while the visual design and layout stay unchanged.

Studio Level Control, Consistency and Upscaling

Nano Banana Pro exposes a set of controls aimed at design and production workflows rather than single shot art prompts. On the composition side, the model can use up to 14 input images and maintain the consistency and resemblance of up to 5 people in one workflow. This supports tasks such as combining reference photos into a single fashion editorial, transforming sketches into product shots or keeping the same cast across multiple scenes.

The studio control section of the model page lists several families of controls. Users can vary camera angle and shot type, including wide shot, panoramic and close up, while controlling depth of field and focus on specific subjects in the image. Color and lighting can be adjusted, for example changing day to night, replacing volumetric lighting with bokeh or applying a strong chiaroscuro effect without losing subject identity.

Nano Banana Pro supports explicit upscaling. The official Google blog states that it can generate crisp visuals at 1k, 2k or 4k resolution, and provides examples of progressive zoom in operations that keep detail and composition. Aspect ratio is also programmable. Prompts can convert between ratios such as 1:1, 4:3, 16:9 and cinematic formats while keeping the main character locked in place and adjusting only the background.

Key Takeaways

Nano Banana Pro is Gemini 3 Pro Image, an upgraded image generation and editing model that succeeds Nano Banana, which was based on Gemini 2.5 Flash Image, and is optimized for higher quality and control.

The model integrates Gemini 3 Pro reasoning and Google Search grounding so it can turn factual content, documents and real time data into infographics, recipes, process diagrams and other information dense visuals.

It provides strong text rendering and multilingual support, producing legible typography in images and enabling translation or localization of existing on image text while preserving layout and design.

Nano Banana Pro supports up to 14 input images and maintains resemblance for up to 5 people, with studio style controls for camera angle, depth of field, lighting, aspect ratios and upscaling to 1k, 2k and 4k resolutions.

The model is being deployed across Gemini app, AI Mode in Search, NotebookLM, Google Ads, Workspace apps, Gemini API, Google AI Studio, Vertex AI, Antigravity and Flow, with all outputs watermarked using SynthID plus tier specific visible watermarks.

Editorial Comments

Nano Banana Pro positions Gemini 3 Pro Image as a production oriented image system that links Gemini 3 Pro reasoning, Google Search grounding and structured controls for layout, text and upscaling. It directly addresses long standing issues in text rendering, multilingual localization and subject consistency, while keeping SynthID and visible watermarks as default provenance signals across tiers and surfaces. This launch moves Google’s image stack closer to an integrated, API first visual platform for developers and enterprises.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Introduces Nano Banana Pro: the Gemini 3 Pro Image Model for Text Accurate and Studio Grade Visuals appeared first on MarkTechPost.

Perplexity AI Releases TransferEngine and pplx garden to Run Trillion …

How can teams run trillion parameter language models on existing mixed GPU clusters without costly new hardware or deep vendor lock in? Perplexity’s research team has released TransferEngine and the surrounding pplx garden toolkit as open source infrastructure for large language model systems. This provides a way to run models with up to 1 trillion parameters across mixed GPU clusters, without locking into a single cloud provider or buying new GB200 class hardware.

https://arxiv.org/pdf/2510.27656

The real bottleneck, network fabrics not FLOPs

Modern deployments of Mixture of Experts models such as DeepSeek V3 with 671 billion parameters and Kimi K2 with 1 trillion parameters no longer fit on a single 8 GPU server. They must span multiple nodes, so the main constraint becomes the network fabric between GPUs.

Here the hardware landscape is fragmented. NVIDIA ConnectX 7 typically uses Reliable Connection transport with in order delivery. AWS Elastic Fabric Adapter uses Scalable Reliable Datagram transport that is reliable but out of order, and a single GPU may need 4 network adapters at 100 Gbps, or 2 at 200 Gbps, to reach 400 Gbps.

Existing libraries such as DeepEP, NVSHMEM, MoonCake and NIXL tend to optimize for one vendor and degrade or lack support on the other side. Perplexity’s research team directly states in the research paper that there was no viable cross provider solution for LLM inference before this work.

TransferEngine, a portable RDMA layer for LLM systems

TransferEngine addresses this by targeting only the intersection of guarantees across Network Interface Controllers. It assumes that the underlying RDMA transport is reliable, but does not assume any ordering of messages. On top of this, it exposes one sided WriteImm operations and an ImmCounter primitive for completion notification.

The library provides a minimal API in Rust. It offers two sided Send and Recv for control messages, and three main one sided operations, submit_single_write, submit_paged_writes, and submit_scatter, plus a submit_barrier primitive for synchronization across a group of peers. A NetAddr structure identifies peers and an MrDesc structure describes registered memory regions. An alloc_uvm_watcher call creates a device side watcher for CPU GPU synchronization in advanced pipelines.

Internally, TransferEngine spawns one worker thread per GPU and builds a DomainGroup per GPU that coordinates between 1 and 4 RDMA Network Interface Controllers. A single ConnectX 7 provides 400 Gbps. On EFA, the DomainGroup aggregates 4 network adapters at 100 Gbps, or 2 at 200 Gbps, to reach the same bandwidth. The sharding logic knows about all Network Interface Controllers and can split a transfer across them.

Across hardware, the research team reports peak throughput of 400 Gbps on both NVIDIA ConnectX 7 and AWS EFA. This matches single platform solutions and confirms that the abstraction layer does not leave large performance on the table.

https://arxiv.org/pdf/2510.27656

pplx garden, the open source package

TransferEngine ships as part of the pplx garden repository on GitHub under an MIT license. The directory structure is straightforward. fabric-lib contains the RDMA TransferEngine library, p2p-all-to-all implements a Mixture of Experts all to all kernel, python-ext provides the Python extension module from the Rust core, and python/pplx_garden contains the Python package code.

The system requirements reflect a modern GPU cluster. Perplexity research team recommends Linux kernel 5.12 or newer for DMA BUF support, CUDA 12.8 or newer, libfabric, libibverbs, GDRCopy, and an RDMA fabric with GPUDirect RDMA enabled. Each GPU should have at least one dedicated RDMA Network Interface Controller.

Disaggregated prefill and decode

The first production use case is disaggregated inference. Prefill and decode run on separate clusters, so the system must stream KvCache from prefill GPUs to decode GPUs at high speed.

TransferEngine uses alloc_uvm_watcher to track progress in the model. During prefill, the model increments a watcher value after each layer’s attention output projection. When the worker observes a change, it issues paged writes for the KvCache pages of that layer, followed by a single write for the remaining context. This approach allows layer by layer streaming of cache pages without fixed world membership, and it avoids the strict ordering constraints of collectives.

https://arxiv.org/pdf/2510.27656

Fast weight transfer for reinforcement learning

The second system is asynchronous reinforcement learning fine tuning, where training and inference run on separate GPU pools. Traditional designs gather updated parameters to a single rank then broadcast them, which limits throughput to one Network Interface Controller.

Perplexity research team instead uses TransferEngine to perform point to point weight transfer. Each training GPU writes its parameter shard directly into the corresponding inference GPUs using one sided writes. A pipelined execution splits each tensor into stages, host to device copy when Fully Sharded Data Parallel offloads weights, reconstruction and optional quantization, RDMA transfer, and a barrier implemented through scatter and ImmCounter.

In production, this setup delivers weight updates for models such as Kimi K2 at 1 trillion parameters and DeepSeek V3 at 671 billion parameters in about 1.3 seconds from 256 training GPUs to 128 inference GPUs.

https://arxiv.org/pdf/2510.27656

Mixture of Experts routing across ConnectX and EFA

The third piece in pplx garden is a point to point Mixture of Experts dispatch and combine kernel. It uses NVLink for intra node traffic and RDMA for inter node traffic. Dispatch and combine are split into separate send and receive phases so that the decoder can micro batch and overlap communication with grouped general matrix multiply.

A host proxy thread polls GPU state and calls TransferEngine when send buffers are ready. Routes are exchanged first, then each rank computes contiguous receive offsets for each expert and writes tokens into private buffers that can be reused between dispatch and combine. This reduces memory footprint and keeps writes large enough to use the full link bandwidth.

On ConnectX 7, Perplexity research team reports state of the art decode latency that is competitive with DeepEP across expert counts. On AWS EFA, the same kernel delivers the first viable MoE decode latencies with higher but still practical values.

In multi node tests with DeepSeek V3 and Kimi K2 on AWS H200 instances, distributing the model across nodes reduces latency at medium batch sizes, which is the common regime for production serving.

Comparison Table

Key pointTransferEngine (pplx garden)DeepEPNVSHMEM (generic MoE use)MooncakePrimary rolePortable RDMA point to point for LLM systemsMoE all to all dispatch and combineGeneral GPU shared memory and collectivesDistributed KV cache for LLM inferenceHardware focusNVIDIA ConnectX 7 and AWS EFA, multi NIC per GPUNVIDIA ConnectX with GPU initiated RDMA IBGDANVIDIA GPUs on RDMA fabrics including EFARDMA NICs in KV centric serving stacksEFA statusFull support, peak 400 Gbps reportedNo support, requires IBGDA on ConnectXAPI works but MoE use shows severe degradation on EFAPaper reports no EFA support in its RDMA enginePortability for LLM systemsCross vendor, single API across ConnectX 7 and EFAVendor specific and ConnectX focusedNVIDIA centric, not viable for EFA MoE routingFocused on KV sharing, no cross provider support

Key Takeaways

TransferEngine gives a single RDMA point to point abstraction that works on both NVIDIA ConnectX 7 and AWS EFA, and manages multiple Network Interface Controllers per GPU transparently.

The library exposes one sided WriteImm with ImmCounter, and achieves peak 400 Gbps throughput on both NIC families, which lets it match single vendor stacks while remaining portable.

Perplexity team uses TransferEngine in three production systems, disaggregated prefill decode with KvCache streaming, reinforcement learning weight transfer that updates trillion parameter models in about 1.3 seconds, and Mixture of Experts dispatch combine for large models like Kimi K2.

On ConnectX 7, pplx garden’s MoE kernels provide state of the art decode latency and exceed DeepEP on the same hardware, while on EFA they deliver the first practical MoE latencies for trillion parameter workloads.

Because TransferEngine is open source in pplx garden under an MIT license, teams can run very large Mixture of Experts and dense models on heterogeneous H100 or H200 clusters across cloud providers, without rewriting for each vendor specific networking stack.

Editorial Comments

Perplexity’s release of TransferEngine and pplx garden is a practical contribution for LLM infra teams who are blocked by vendor specific networking stacks and expensive fabric upgrades. A portable RDMA abstraction that reaches peak 400 Gbps on both NVIDIA ConnectX 7 and AWS EFA, supports KvCache streaming, fast reinforcement learning weight transfer, and Mixture of Experts routing, directly addresses trillion parameter serving constraints for real systems.

Check out the Paper and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Perplexity AI Releases TransferEngine and pplx garden to Run Trillion Parameter LLMs on Existing GPU Clusters appeared first on MarkTechPost.

An Implementation of Fully Traced and Evaluated Local LLM Pipeline Usi …

In this tutorial, we implement a complete workflow for building, tracing, and evaluating an LLM pipeline using Opik. We structure the system step-by-step, beginning with a lightweight model, adding prompt-based planning, creating a dataset, and finally running automated evaluations. As we move through each snippet, we see how Opik helps us track every function span, visualize the pipeline’s behavior, and measure output quality with clear, reproducible metrics. By the end, we have a fully instrumented QA system that we can extend, compare, and monitor with ease. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q opik transformers accelerate torch

import torch
from transformers import pipeline
import textwrap

import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio

device = 0 if torch.cuda.is_available() else -1
print(“Using device:”, “cuda” if device == 0 else “cpu”)

opik.configure()
PROJECT_NAME = “opik-hf-tutorial”

We set up our environment by installing the required libraries and initializing Opik. We load the core modules, detect the device, and configure our project so that every trace flows into the correct workspace. We lay the foundation for the rest of the tutorial. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserllm = pipeline(
“text-generation”,
model=”distilgpt2″,
device=device,
)

def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
result = llm(
prompt,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.3,
pad_token_id=llm.tokenizer.eos_token_id,
)[0][“generated_text”]
return result[len(prompt):].strip()

We load a lightweight Hugging Face model and create a small helper function to generate text cleanly. We prepare the LLM to operate locally without external APIs. This gives us a reliable and reproducible generation layer for the rest of the pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserplan_prompt = Prompt(
name=”hf_plan_prompt”,
prompt=textwrap.dedent(“””
You are an assistant that creates a plan to answer a question
using ONLY the given context.

Context:
{{context}}

Question:
{{question}}

Return exactly 3 bullet points as a plan.
“””).strip(),
)

answer_prompt = Prompt(
name=”hf_answer_prompt”,
prompt=textwrap.dedent(“””
You answer based only on the given context.

Context:
{{context}}

Question:
{{question}}

Plan:
{{plan}}

Answer the question in 2–4 concise sentences.
“””).strip(),
)

We define two structured prompts using Opik’s Prompt class. We control the planning phase and answering phase through clear templates. This helps us maintain consistency and observe how structured prompting impacts model behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserDOCS = {
“overview”: “””
Opik is an open-source platform for debugging, evaluating,
and monitoring LLM and RAG applications. It provides tracing,
datasets, experiments, and evaluation metrics.
“””,
“tracing”: “””
Tracing in Opik logs nested spans, LLM calls, token usage,
feedback scores, and metadata to inspect complex LLM pipelines.
“””,
“evaluation”: “””
Opik evaluations are defined by datasets, evaluation tasks,
scoring metrics, and experiments that aggregate scores,
helping detect regressions or issues.
“””,
}

@track(project_name=PROJECT_NAME, type=”tool”, name=”retrieve_context”)
def retrieve_context(question: str) -> str:
q = question.lower()
if “trace” in q or “span” in q:
return DOCS[“tracing”]
if “metric” in q or “dataset” in q or “evaluate” in q:
return DOCS[“evaluation”]
return DOCS[“overview”]

We construct a tiny document store and a retrieval function that Opik tracks as a tool. We let the pipeline select context based on the user’s question. This allows us to simulate a minimal RAG-style workflow without needing an actual vector database. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@track(project_name=PROJECT_NAME, type=”llm”, name=”plan_answer”)
def plan_answer(context: str, question: str) -> str:
rendered = plan_prompt.format(context=context, question=question)
return hf_generate(rendered, max_new_tokens=80)

@track(project_name=PROJECT_NAME, type=”llm”, name=”answer_from_plan”)
def answer_from_plan(context: str, question: str, plan: str) -> str:
rendered = answer_prompt.format(
context=context,
question=question,
plan=plan,
)
return hf_generate(rendered, max_new_tokens=120)

@track(project_name=PROJECT_NAME, type=”general”, name=”qa_pipeline”)
def qa_pipeline(question: str) -> str:
context = retrieve_context(question)
plan = plan_answer(context, question)
answer = answer_from_plan(context, question, plan)
return answer

print(“Sample answer:n”, qa_pipeline(“What does Opik help developers do?”))

We bring together planning, reasoning, and answering in a fully traced LLM pipeline. We capture each step with Opik’s decorators so we can analyze spans in the dashboard. By testing the pipeline, we confirm that all components integrate smoothly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclient = Opik()

dataset = client.get_or_create_dataset(
name=”HF_Opik_QA_Dataset”,
description=”Small QA dataset for HF + Opik tutorial”,
)

dataset.insert([
{
“question”: “What kind of platform is Opik?”,
“context”: DOCS[“overview”],
“reference”: “Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.”,
},
{
“question”: “What does tracing in Opik log?”,
“context”: DOCS[“tracing”],
“reference”: “Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.”,
},
{
“question”: “What are the components of an Opik evaluation?”,
“context”: DOCS[“evaluation”],
“reference”: “An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.”,
},
])

We create and populate a dataset inside Opik that our evaluation will use. We insert multiple question–answer pairs that cover different aspects of Opik. This dataset will serve as the ground truth for our QA evaluation later. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserequals_metric = Equals()
lev_metric = LevenshteinRatio()

def evaluation_task(item: dict) -> dict:
output = qa_pipeline(item[“question”])
return {
“output”: output,
“reference”: item[“reference”],
}

We define the evaluation task and select two metrics—Equals and LevenshteinRatio—to measure model quality. We ensure the task produces outputs in the exact format required for scoring. This connects our pipeline to Opik’s evaluation engine. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserevaluation_result = evaluate(
dataset=dataset,
task=evaluation_task,
scoring_metrics=[equals_metric, lev_metric],
experiment_name=”HF_Opik_QA_Experiment”,
project_name=PROJECT_NAME,
task_threads=1,
)

print(“nExperiment URL:”, evaluation_result.experiment_url)

We run the evaluation experiment using Opik’s evaluate function. We keep the execution sequential for stability in Colab. Once complete, we receive a link to view the experiment details inside the Opik dashboard. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browseragg = evaluation_result.aggregate_evaluation_scores()

print(“nAggregated scores:”)
for metric_name, stats in agg.aggregated_scores.items():
print(metric_name, “=>”, stats)

We aggregate and print the evaluation scores to understand how well our pipeline performs. We inspect the metric results to see where outputs align with references and where improvements are needed. This closes the loop on our fully instrumented LLM workflow.

In conclusion, we set up a small but fully functional LLM evaluation ecosystem powered entirely by Opik and a local model. We observe how traces, prompts, datasets, and metrics come together to give us transparent visibility into the model’s reasoning process. As we finalize our evaluation and review the aggregated scores, we appreciate how Opik lets us iterate quickly, experiment systematically, and validate improvements in a structured and reliable way.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows appeared first on MarkTechPost.

Streamline AI operations with the Multi-Provider Generative AI Gateway …

As organizations increasingly adopt AI capabilities across their applications, the need for centralized management, security, and cost control of AI model access is a required step in scaling AI solutions. The Generative AI Gateway on AWS guidance addresses these challenges by providing guidance for a unified gateway that supports multiple AI providers while offering comprehensive governance and monitoring capabilities.
The Generative AI Gateway is a reference architecture for enterprises looking to implement end-to-end generative AI solutions featuring multiple models, data-enriched responses, and agent capabilities in a self-hosted way. This guidance combines the broad model access of Amazon Bedrock, unified developer experience of Amazon SageMaker AI, and the robust management capabilities of LiteLLM, all while supporting customer access to models from external model providers in a more secure and reliable manner.
LiteLLM is an open source project that addresses common challenges faced by customers deploying generative AI workloads. LiteLLM simplifies multi-provider model access while standardizing production operational requirements including cost tracking, observability, prompt management, and more. In this post we’ll introduce how the Multi-Provider Generative AI Gateway reference architecture provides guidance for deploying LiteLLM into an AWS environment for production generative AI workload management and governance.
The challenge: Managing multi-provider AI infrastructure
Organizations building with generative AI face several complex challenges as they scale their AI initiatives:

Provider fragmentation: Teams often need access to different AI models from various providers—Amazon Bedrock, Amazon SageMaker AI, OpenAI, Anthropic, and others—each with different APIs, authentication methods, and billing models.
Decentralized governance model: Without a unified access point, organizations struggle to implement consistent security policies, usage monitoring, and cost controls across different AI services.
Operational complexity: Managing multiple access paradigms ranging from AWS Identity and Access Management roles to API keys, model-specific rate limits, and failover strategies across providers creates operational overhead and increases the risk of service disruptions.
Cost management: Understanding and controlling AI spending across multiple providers and teams becomes increasingly difficult, particularly as usage scales.
Security and compliance: Facilitating consistent security policies and audit trails across different AI providers presents significant challenges for enterprise governance.

Multi-Provider Generative AI Gateway reference architecture
This guidance addresses these common customer challenges by providing a centralized gateway that abstracts the complexity of multiple AI providers behind a single, managed interface.

Built on AWS services and using the open source LiteLLM project, organizations can use this solution to integrate with AI providers while maintaining centralized control, security, and observability.

Flexible deployment options on AWS
The Multi-Provider Generative AI Gateway supports multiple deployment patterns to meet diverse organizational needs:
Amazon ECS deployment For teams preferring containerized applications with managed infrastructure, the ECS deployment provides serverless container orchestration with automatic scaling and integrated load balancing.
Amazon EKS deployment Organizations with existing Kubernetes expertise can use the EKS deployment option, which provides full control over container orchestration while benefiting from a managed Kubernetes control plane. Customers can deploy a new cluster or leverage existing clusters for deployment.
The reference architecture provided for these deployment options is subject to additional security testing based on your organization’s specific security requirements. Conduct additional security testing and review as necessary before deploying anything into production.
Network architecture options
The Multi-Provider Generative AI Gateway supports multiple network architecture options:
Global Public-Facing Deployment For AI services with global user bases, combine the gateway with Amazon CloudFront (CloudFront) and Amazon Route 53. This configuration provides:

Enhanced security with AWS Shield DDoS protection
Simplified HTTPS management with the Amazon CloudFront default certificates
Global edge caching for improved latency
Intelligent traffic routing across regions

Regional direct access For single-Region deployments prioritizing low latency and cost optimization, direct access to the Application Load Balancer (ALB) removes the CloudFront layer while maintaining security through properly configured security groups and network ACLs.
Private internal access Organizations requiring complete isolation can deploy the gateway within a private VPC without internet exposure. This configuration makes sure that the AI model access remains within your secure network perimeter, with ALB security groups restricting traffic to authorized private subnet CIDRs only.
Comprehensive AI governance and management
The Multi-Provider Generative AI Gateway is built to enable robust AI governance standards from a straightforward administrative interface. In addition to policy-based configuration and access management, users can configure advanced capabilities like load-balancing and prompt caching.
Centralized administration interface
The Generative AI Gateway includes a web-based administrative interface in LiteLLM that supports comprehensive management of LLM usage across your organization.
Key capabilities include:
User and team management: Configure access controls at granular levels, from individual users to entire teams, with role-based permissions that align with your organizational structure.
API key management: Centrally manage and rotate API keys for the connected AI providers while maintaining audit trails of key usage and access patterns.
Budget controls and alerting: Set spending limits across providers, teams, and individual users with automated alerts when thresholds are approached or exceeded.
Comprehensive cost controls: Costs are influenced by AWS infrastructure and LLM providers. While it is the customer’s responsibility to configure this solution to meet their cost requirements, customers may review the existing cost settings for additional guidance.
Supports multiple model providers: Compatible with Boto3, OpenAI, and LangGraph SDK, allowing customers to use the best model for the workload regardless of the provider.
Support for Amazon Bedrock Guardrails: Customers can leverage guardrails created on Amazon Bedrock Guardrails for their generative AI workloads, regardless of the model provider.
Intelligent routing and resilience
Common considerations around model deployment include model and prompt resiliency. These factors are important to consider how failures are handled when responding to a prompt or accessing data stores.
Load balancing and failover: The gateway implements sophisticated routing logic that distributes requests across multiple model deployments and automatically fails over to backup providers when issues are detected.
Retry logic: Built-in retry mechanisms with exponential back-off facilitate reliable service delivery even when individual providers experience transient issues.
Prompt caching: Intelligent caching helps reduce costs by avoiding duplicate requests to expensive AI models while maintaining response accuracy.
Advanced policy management
Model deployment architecture can range from the simple to highly complex. The Multi-Provider Generative AI Gateway features the advanced policy management tools needed to maintain a strong governance posture.
Rate limiting: Configure sophisticated rate limiting policies that can vary by user, API key, model type, or time of day to facilitate fair resource allocation and help prevent abuse.
Model access controls: Restrict access to specific AI models based on user roles, making sure that sensitive or expensive models are only accessible to authorized personnel.
Custom routing rules: Implement business logic that routes requests to specific providers based on criteria such as request type, user location, or cost optimization requirements.
Monitoring and observability
As AI workloads grow to include more components, so to do observability needs. The Multi-Provider Generative AI Gateway architecture integrates with Amazon CloudWatch. This integration enables users to configure myriad monitoring and observability solutions, including open-source tools such as Langfuse.
Comprehensive logging and analytics
The gateway interactions are automatically logged to CloudWatch, providing detailed insights into:

Request patterns and usage trends across providers and teams
Performance metrics including latency, error rates, and throughput
Cost allocation and spending patterns by user, team, and model type
Security events and access patterns for compliance reporting

Built-in troubleshooting
The administrative interface provides real-time log viewing capabilities so administrators can quickly diagnose and resolve usage issues without needing to access CloudWatch directly.

Amazon SageMaker integration for expanded model access
Amazon SageMaker helps enhance the Multi-Provider Generative AI Gateway guidance by providing a comprehensive machine learning system that seamlessly integrates with the gateway’s architecture. By using the Amazon SageMaker managed infrastructure for model training, deployment, and hosting, organizations can develop custom foundation models or fine-tune existing ones that can be accessed through the gateway alongside models from other providers. This integration removes the need for separate infrastructure management while facilitating consistent governance across both custom and third-party models. SageMaker AI model hosting capabilities expands the gateway’s model access to include self-hosted models, as well as those available on Amazon Bedrock, OpenAI, and other providers.
Our open source contributions
This reference architecture builds upon our contributions to the LiteLLM open source project, enhancing its capabilities for enterprise deployment on AWS. Our enhancements include improved error handling, enhanced security features, and optimized performance for cloud-native deployments.
Getting started
The Multi-Provider Generative AI Gateway reference architecture is available today through our GitHub repository, complete with:

Infrastructure-as-Code: Amazon CloudFormation and AWS Cloud Development Kit (CDK) templates for automated deployment into an Amazon ECS cluster
Comprehensive documentation: Step-by-step deployment guides and configuration examples
Interactive workshop: Hands-on learning experience to explore the gateway capabilities
Detailed deployment guide: Deployment blog on AWS Builder Center

The code repository describes several flexible deployment options to get started.
Public gateway with global CloudFront distribution
Use CloudFront to provide a globally distributed, low-latency access point for your generative AI services. The CloudFront edge locations deliver content quickly to users around the world, while AWS Shield Standard helps protect against DDoS attacks. This is the recommended configuration for public-facing AI services with a global user base.
Custom domain with CloudFront
For a more branded experience, you can configure the gateway to use your own custom domain name, while still benefiting from the performance and security features of CloudFront. This option is ideal if you want to maintain consistency with your company’s online presence.
Direct access via public Application Load Balancer
Customers who prioritize low-latency over global distribution can opt for a direct-to-ALB deployment, without the CloudFront layer. This simplified architecture can offer cost savings, though it requires extra consideration for web application firewall protection.
Private VPC-only access
For a high level of security, you can deploy the gateway entirely within a private VPC, isolated from the public internet. This configuration is well-suited for processing sensitive data or deploying internal-facing generative AI services. Access is restricted to trusted networks like VPN, Direct Connect, VPC peering, or AWS Transit Gateway.
Learn more and deploy today
Ready to simplify your multi-provider AI infrastructure? Access the complete solution package to explore an interactive learning experience with step-by-step guidance describing each step of the deployment and management process.
Conclusion
The Multi-Provider Generative AI Gateway is a solution guidance intended to help customers get started working on generative AI solutions in a well-architected manner, while taking advantage of the AWS environment of services and complimentary open-source packages. Customers can work with models from Amazon Bedrock, Amazon SageMaker JumpStart, or third-party model providers. Operations and management of workloads is conducted via the LiteLLM management interface, and customers can choose to host on ECS or EKS based on their preference.
In addition, we have published a sample that integrates the gateway into an agentic customer service application. The agentic system is orchestrated using LangGraph and deployed on Amazon Bedrock AgentCore. LLM calls are routed through the gateway, providing the flexibility to test agents with different models–whether hosted on AWS or another provider.
This guidance is just one part of a mature generative AI foundation on AWS. For deeper reading on the components of a generative AI system on AWS, see Architect a mature generative AI foundation on AWS, which describes additional components of a generative AI system.

About the authors
Dan Ferguson is a Sr. Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.
Bobby Lindsey is a Machine Learning Specialist at Amazon Web Services. He’s been in technology for over a decade, spanning various technologies and multiple roles. He is currently focused on combining his background in software engineering, DevOps, and machine learning to help customers deliver machine learning workflows at scale. In his spare time, he enjoys reading, research, hiking, biking, and trail running.
Nick McCarthy is a Generative AI Specialist at AWS. He has worked with AWS clients across various industries including healthcare, finance, sports, telecoms and energy to accelerate their business outcomes through the use of AI/ML. Outside of work he loves to spend time traveling, trying new cuisines and reading about science and technology. Nick has a Bachelors degree in Astrophysics and a Masters degree in Machine Learning.
Chaitra Mathur is as a GenAI Specialist Solutions Architect at AWS. She works with customers across industries in building scalable generative AI platforms and operationalizing them. Throughout her career, she has shared her expertise at numerous conferences and has authored several blogs in the Machine Learning and Generative AI domains.
Sreedevi Velagala is a Solution Architect within the World-Wide Specialist Organization Technology Solutions team at Amazon Web Services, based in New Jersey. She has been focused on delivering tailored solutions and guidance aligned with the unique needs of diverse clientele across AI/ML, Compute, Storage, Networking and Analytics domains. She has been instrumental in helping customers learn how AWS can lower the compute costs for machine learning workloads using Graviton, Inferentia and Trainium. She leverages her deep technical knowledge and industry expertise to deliver tailored solutions that align with each client’s unique business needs and requirements.