Accelerate generative AI innovation in Canada with Amazon Bedrock cros …

Generative AI has created unprecedented opportunities for Canadian organizations to transform their operations and customer experiences. We are excited to announce that customers in Canada can now access advanced foundation models including Anthropic’s Claude Sonnet 4.5 and Claude Haiku 4.5 on Amazon Bedrock through cross-Region inference (CRIS).
This post explores how Canadian organizations can use cross-Region inference profiles from the Canada (Central) Region to access the latest foundation models to accelerate AI initiatives. We will demonstrate how to get started with these new capabilities, provide guidance for migrating from older models, and share recommended practices for quota management.
Canadian cross-Region inference: Your gateway to global AI innovation
To help customers achieve the scale of their Generative AI applications, Amazon Bedrock offers Cross-Region Inference (CRIS) profiles, a powerful feature that enables organizations to seamlessly distribute inference processing across multiple AWS Regions. This capability helps you get higher throughput while building at scale, helping to ensure your generative AI applications remain responsive and reliable even under heavy load.
Amazon Bedrock provides two types of cross-Region Inference profiles:

Geographic CRIS: Amazon Bedrock automatically selects the optimal commercial Region within that geography to process your inference request.
Global CRIS: Global CRIS further enhances cross-Region inference by enabling the routing of inference requests to supported commercial Regions worldwide, optimizing available resources and enabling higher model throughput.

Cross-Region Inference operates through the secure AWS network with end-to-end encryption for both data in transit and at rest. When a customer submits an inference request from the Canada (Central) Region, CRIS intelligently routes the request to one of the destination regions configured for the inference profile (US or Global profiles).
The key distinction is that while inference processing (the transient computation) may occur in another Region, all data at rest—including logs, knowledge bases, and any stored configurations—remains exclusively within the Canada (Central) Region. The inference request travels over the AWS Global Network, never traversing the public internet, and responses are returned encrypted to your application in Canada.

Cross-Region inference configuration for Canada
With CRIS, Canadian organizations gain earlier access to foundation models, including cutting-edge models like Claude Sonnet 4.5 with enhanced reasoning capabilities, providing a faster path to innovation. CRIS also delivers enhanced capacity and performance by providing access to capacity across multiple Regions. This enables higher throughput during peak periods such as tax season, Black Friday, and holiday shopping, automatic burst handling without manual intervention, and greater resiliency by serving requests from a larger pool of resources.
Canadian customers can choose between two inference profile types based on their requirements:

CRIS profile
Source Region
Destination Regions
Description

US cross-Region inference
ca-central-1
Multiple US Regions
Requests from Canada (Central) can be routed to supported US Regions with capacity.

Global inference
ca-central-1
Global AWS Regions
Requests from Canada (Central) can be routed to a Region in the AWS global CRIS profile.

Getting started with CRIS from Canada
To begin using cross-Region Inference from Canada, follow these steps:
Configure AWS Identity and Access Management (IAM) permissions
First, verify your IAM role or user has the necessary permissions to invoke Amazon Bedrock models using cross-Region inference profiles.
Here’s an example of a policy for US cross-Region inference:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“bedrock:InvokeModel*”
],
“Resource”: [
“arn:aws:bedrock:ca-central-1::inference-profile/us.anthropic.claude-sonnet-4-5-20250929-v1:0”
]
},
{
“Effect”: “Allow”,
“Action”: [
“bedrock:InvokeModel*”
],
“Resource”: [
“arn:aws:bedrock:*::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0”
],
“Condition”: {
“StringLike”: {
“bedrock:InferenceProfileArn”: “arn:aws:bedrock:ca-central-1::inference-profile/us.anthropic.claude-sonnet-4-5-20250929-v1:0″
}
}
}
]
}

For global CRIS refer to the blog post, Unlock global AI inference scalability using new global cross-Region inference on Amazon Bedrock with Anthropic’s Claude Sonnet 4.5.
Use cross-Region inference profiles
Configure your application to use the relevant inference profile ID. The profiles use prefixes to indicate their routing scope:

Model
Routing scope
Inference profile ID

Claude Sonnet 4.5
US Regions
us.anthropic.claude-sonnet-4-5-20250929-v1:0

Claude Sonnet 4.5
Global
global.anthropic.claude-sonnet-4-5-20250929-v1:0

Claude Haiku 4.5
US Regions
us.anthropic.claude-haiku-4-5-20251001-v1:0

Claude Haiku 4.5
Global
global.anthropic.claude-haiku-4-5-20251001-v1:0

Example code
Here’s how to use the Amazon Bedrock Converse API with a US CRIS inference profile from Canada:

import boto3

# Initialize Bedrock Runtime client
bedrock_runtime = boto3.client(
service_name=”bedrock-runtime”,
region_name=”ca-central-1″ # Canada (Central) Region
)

# Define the inference profile ID
inference_profile_id = “us.anthropic.claude-sonnet-4-5-20250929-v1:0”

# Prepare the conversation
response = bedrock_runtime.converse(
modelId=inference_profile_id,
messages=[
{
“role”: “user”,
“content”: [
{
“text”: “What are the benefits of using Amazon Bedrock for Canadian organizations?”
}
]
}
],
inferenceConfig={
“maxTokens”: 512,
“temperature”: 0.7
}
)

# Print the response
print(f”Response: {response[‘output’][‘message’][‘content’][0][‘text’]}”)

Quota management for Canadian workloads
When using CRIS from Canada, quota management is performed at the source Region level (ca-central-1). This means quota increases requested for the Canada (Central) Region apply to all inference requests originating from Canada, regardless of where they’re processed.
Understanding quota calculations
Important: When calculating your required quota increases, you need to take into account the burndown rate, defined as the rate at which input and output tokens are converted into token quota usage for the throttling system. The following models have a 5x burn down rate for output tokens (1 output token consumes 5 tokens from your quotas):

Anthropic Claude Opus 4
Anthropic Claude Sonnet 4.5
Anthropic Claude Sonnet 4
Anthropic Claude 3.7 Sonnet

For other models, the burndown rate is 1:1 (1 output token consumes 1 token from your quota). For input tokens, the token to quota ratio is 1:1. The calculation for the total number of tokens per request is as follows:
Input token count + Cache write input tokens + (Output token count x Burndown rate)
Requesting quota increases
To request quota increases for CRIS in Canada:

Navigate to the AWS Service Quotas console in the Canada (Central) Region
Search for the specific model quota (for example, “Claude Sonnet 4.5 tokens per minute”)
Submit an increase request based on your projected usage

Migrating from older Claude models to Claude 4.5
Organizations currently using older Claude models should plan their migration to Claude 4.5 to leverage the latest model capabilities.
To plan your migration strategy, incorporate the following elements:

Benchmark current performance: Establish baseline metrics for your existing models.
Test with representative workloads and optimize prompts: Validate Claude 4.5 performance with your specific use cases, and adjust prompt to leverage Claude 4.5’s enhanced capabilities and make use of the Bedrock prompt optimizer tool.
Implement gradual rollout: Transition traffic progressively.
Monitor and adjust: Track performance metrics and adjust quotas as needed.

Choosing between US and Global inference profiles
When implementing CRIS from Canada, organizations can choose between US and Global inference profiles based on their specific requirements.
US cross-Region inference is recommended for organizations with existing US data processing agreements, high throughput and resilience requirements and development and testing environments.
Conclusion
Cross-Region inference for Amazon Bedrock represents an opportunity for Canadian organizations that want to use AI while maintaining data governance. By distinguishing between transient inference processing and persistent data storage, CRIS provides faster access to the latest foundation models without compromising compliance requirements.
With CRIS, Canadian organizations get access to new models within days instead of months. The system scales automatically during peak business periods while maintaining complete audit trails within Canada. This helps you meet compliance requirements and use the same advanced AI capabilities as organizations worldwide. To get started, review your data governance requirements and configure IAM permissions. Then test with the inference profile that matches your needs—US for lower latency to US Regions, or Global for maximum capacity.

About the authors
Daniel Duplessis is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services (AWS), where he guides enterprises in crafting comprehensive AI implementation strategies and establish the foundational capabilities essential for scaling AI across the enterprise.
Dan MacKay is the Financial Services Compliance Specialist for AWS Canada. He advises customers on recommended practices and practical solutions for cloud-related governance, risk, and compliance. Dan specializes in helping AWS customers navigate financial services and privacy regulations applicable to the use of cloud technology in Canada with a focus on third-party risk management and operational resilience.
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Serge Malikov is a Senior Solutions Architect Manager based out of Canada. His focus is on the financial services industry.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Sharadha Kandasubramanian is a Senior Technical Program Manager for Amazon Bedrock. She drives cross-functional GenAI programs for Amazon Bedrock, enabling customers to grow and scale their GenAI workloads. Outside of work, she’s an avid runner and biker who loves spending time outdoors in the sun.

Power up your ML workflows with interactive IDEs on SageMaker HyperPod

Amazon SageMaker HyperPod clusters with Amazon Elastic Kubernetes Service (EKS) orchestration now support creating and managing interactive development environments such as JupyterLab and open source Visual Studio Code, streamlining the ML development lifecycle by providing managed environments for familiar tools to data scientists. This feature introduces a new add-on called Amazon SageMaker Spaces for AI developers to create and manage self-contained environments for running notebooks. Organizations can now maximize their GPU investments by running both interactive workloads and their training jobs on the same infrastructure, with support for fractional GPU allocations to improve cost efficiency. This feature reduces the complexity of managing multiple development environments and focus on building and deploying their AI and ML models.
This post shows how HyperPod administrators can configure Spaces for their clusters, and how data scientists can create and connect to these Spaces. You’ll also learn how to connect directly from your local VS Code environment to Spaces created in HyperPod.
Solution overview
The following diagram showcases the different components involved in creating and managing Spaces on HyperPod clusters.

Here’s how the feature works:

Cluster administrator installs the Spaces add-on from the SageMaker AI console. The administrator can either use a Quick install or a Custom install option to install the add-on.
Once the cluster is set up, data scientists and AI developers can create Spaces using HyperPod Command Line Interface, or kubectl.
Once the Space is created, the user can connect to a running Space through one of the following two options:

Access Space Web UI: This requires setting up an AWS Application Load Balancer (ALB) and setting up or registering your own custom Domain Name System (DNS) in Amazon Route 53. Once the custom domain is set up, the user will be able to connect to the JupyterLab or Code Editor space securely using a presigned URL through their web browser.
Remote IDE connection (connect to the Space remotely from local Visual Studio Code): SSH-over-SSM tunneling is used under the hood to securely connect remote IDEs to SageMaker Spaces pods without requiring customers to manage SSH keys or exposing port 22.

Prerequisites
To follow along, you need the following prerequisites:

An AWS account with permissions to create IAM roles, SageMaker resources such as HyperPod, and access to EKS cluster resources. If you are creating a new SageMaker HyperPod cluster, you will also need permissions to create networking and storage resources, see IAM permissions for cluster creation.
A SageMaker HyperPod cluster orchestrated using EKS, running Kubernetes version 1.30 or later. If you do not have one, you can create by following instructions in Creating a SageMaker HyperPod cluster with Amazon EKS orchestration. This workflow will create a HyperPod cluster, an EKS cluster and the associated resources such as an Amazon Virtual Private Cloud (VPC) and Amazon FSx for Lustre volume for storage.
HyperPod CLI installed (or kubectl).
A local IDE such as VS Code, with the AWS Toolkit for VS Code installed, to connect to the Spaces.

Step 1: Install the Spaces add-on
To get started, first install the Spaces add-on to your SageMaker cluster. This add-on allows users to run JupyterLab and Code Editor applications directly on cluster compute. The Quick install option is the fastest way to get started. With a single click, SageMaker AI automatically creates and configures the required AWS resources with optimized defaults. Here’s how to install it:

In the SageMaker AI console, choose Clusters on the left pane and navigate to your HyperPod cluster
Choose the IDE and Notebooks tab
Choose Quick install

Review the dependencies that will be automatically installed and choose Install.

The Quick install will create the associated dependencies for your Spaces add-on with default settings. They are listed below:

IAM roles for SageMaker Spaces:

Controller pod role for AWS API calls and AWS Systems Manager Session Manager (SSM) operations.
In-cluster router role for AWS Key Management Service (KMS) operations and JWT signing.
SSM managed instance role for remote access to Spaces. A list of the IAM roles and the required permissions are available in Set up permissions.

Remote access components:

Enables SSH connectivity to Spaces including SSM activation and session documents. This activates Systems Manager Advanced tier which includes additional per-instance charges.

Dependent EKS add-ons:

Cert-manager for certificate management.
Amazon Elastic Block Store (EBS) CSI driver for persistent storage volumes.
AWS Load Balancer Controller to manage AWS Elastic Load Balancers.

SageMaker Spaces add-on

Deploys the Spaces controller and in-cluster router for managing Space lifecycle operations.

The Quick install option does not install web UI configurations such as Route 53 DNS records and SSL certificates for accessing Spaces through the web browser. Administrators can either use the Custom install option or configure these properties after installation of the add-on. For instructions on configuring web browser access, see Operator installing – helm/Console.
The installation typically takes 2-5 minutes depending on availability of pre-existing dependencies or if the Spaces add-on will need to provision completely new resources.  After installation completes, administrators can perform the following actions as shown below:

View the Spaces created by data scientists in the Spaces table
Configure namespaces to organize Spaces by team or project
Create Space templates with pre-configured settings for common use cases
Edit configuration at as needed to enable or disable Spaces features or change your configuration settings

For production use cases, we recommend using the Custom install option, where admins can set up fine-grained IAM policies that apply principle of least-privilege. For the full set of configurations that can be set up using the Custom install option, including namespaces and default templates, see Installation.
Step 2: Create or update EKS access entries
To give your users access to create and manage Spaces, grant them access through EKS access entries. The following two access entry policies are required:

AmazonSagemakerHyperpodSpacePolicy
AmazonSagemakerHyperpodSpaceTemplatePolicy

For instructions on creating and editing access entries, see Create access entries and Update access entries.
Step 3: Create and manage Spaces
Data scientists can create JupyterLab and Code Editor Spaces on the cluster using kubectl or the HyperPod CLI. For detailed instructions on creating and managing Spaces, see Hyperpod CLI.
To create a Space, run the following commands:

# set cluster context using hyp CLI
hyp set-cluster-context –cluster-name <your-hyperpod-cluster-name>

# create a space
hyp create hyp-space
    –name “data-science-space”
    –display-name “Data Science Workspace”
    –namespace “default”

The hyp create hyp-space command will create a Space with the default settings. To create a Code Editor space, use the command below:

hyp create hyp-space
–name code-editor-demo
–display-name “code-editor space”
–memory 8Gi
–template-ref name=sagemaker-code-editor-template,namespace=jupyter-k8s-system

You can modify the settings when creating the Space as well, see example below:

hyp create hyp-space
   –name test-space
   –display-name “test space”
   –memory 8Gi
   –volume name=vol,mountPath=/home/,persistentVolumeClaimName=pvcname

Once the Space is created, you can access the Space from either the web UI, or from your local VS Code. To open the Space in VS Code, run:

hyp create hyp-space-access
    –name data-science-space
    –connection-type vscode-remote

If you have set up the custom domain following our documentation, you can get the Space access URL as shown below. This will open your space on your browser.

hyp create hyp-space-access
    –name data-science-space
    –connection-type web-ui

Alternatively, you can connect to the Space from your local VS Code using the AWS toolkit. From your VS Code IDE, open the AWS toolkit panel. From the toolkit, under SageMaker AI, choose HyperPod. Here, you can list, start, stop, and connect to Spaces.

The Spaces need to be created using the HyperPod CLI or kubectl.
HyperPod CLI supports additional CRUD operations to Spaces such as updating, describing and deleting Spaces. For a list of the operations, see HyperPod CLI on Github.
For practitioners familiar with kubectl, they can also create, update and delete Spaces using kubectl. For example, you can create a Space using kubectl as shown below:

kubectl apply -f – <<EOF
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: training-workspace-1
  namespace: hyperpod-training-team
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-training-team-localqueue
    kueue.x-k8s.io/priority-class: ide-priority
spec:
  displayName: “Training Team Workspace 1”
  image: jupyter/minimal-notebook:latest
  desiredStatus: Running
  resources:
    requests:
      cpu: 3
      memory: 12Gi
    limits:
      cpu: 3
      memory: 12Gi
EOF

Best practices
We recommend the following best practices when using SageMaker Spaces.
User management, RBAC, and collaboration
SageMaker Spaces identifies users through Amazon EKS Access Entries, which are derived from your IAM identity when you interact with a Space using either the HyperPod CLI or kubectl. Your EKS captured identity may appear as an IAM user or as an assumed-role session ARN. For assumed roles, the session name can represent the actual user when admin applies IAM policy to enforce assumed role session names that reflect individual identities. If session names are not enforced or do not uniquely map to users, SageMaker Spaces access control falls back to role-based access control, causing all users sharing the same role to be treated as the same identity. For more details see Add users and set up service accounts.
Spaces can either be private, accessible only by the user who created the Spaces, or public, accessible by any user who has access to the hosting Kubernetes namespace. Spaces are public by default. The creator and the administrator group still retain full control, including the ability to update or delete the Space. A Space becomes private only when access is restricted to the creator and the admin group. This model gives teams a flexible foundation: public Spaces support open collaboration within a shared environment, while private Spaces provide isolation.
Multiple users can collaborate on the same Space if it is configured to be shared. When enabled with SageMaker Distribution images for JupyterLab environments, we also support real time collaboration (RTC) which enables multiple users to collaborate on the interactive ML experiments and workloads.
Admin defaults and controls
Templates set up by admins help data scientists quickly use pre-configured Space settings for their use case. SageMaker provides two pre-created system templates, one for JupyterLab and one for Code Editor, so that data scientists to get started without additional configurations needed. Admins can also set up custom templates for data scientists with custom configurations such as image, storage and compute.Templates can be used by data scientists in the cluster and are flexible depending on the needs of admins. Admins can create multiple templates based on specific use cases, projects, or dependency requirements.
Customizing Spaces
Administrators and developers can customize their Spaces using custom images and lifecycle scripts. Use lifecycle scripts for minimal customization such as installing additional packages, setting up default variables, or running clean up tasks, while still using the SageMaker Distribution image capabilities. For organizations that have a standardized image for development and training, SageMaker Spaces also supports custom images and entry points for users. For custom image specifications, see Customization.
Shutdown idle compute
Spaces by default support automatic shutdown of idle workspaces to optimize resource usage. When idle shutdown is enabled, the system periodically checks the Space for activity and if the workspace is idle for the specified timeout duration, the workspace automatically stops, freeing up the compute resources for other tasks. Administrators can set default timeouts and optionally avoid overrides to defaults to enforce the idle shutdown.
Integration with other HyperPod add-ons
For guardrails against excess resource usage, set up HyperPod task governance, which provides comprehensive resource management controls. To help prevent workspaces from being evicted due to changes in unrelated workloads, configure task governance to set interactive ML workloads as the highest priority or schedule them in task governance namespaces with eviction turned off.
Set up the HyperPod Observability plug in to monitor the resource usage of Spaces running within the cluster. With one click install, the observability plugin provides insight into how many resources Spaces are using over time, allowing admins to observe and tune their compute allocations.
Fractional GPU support
SageMaker Spaces support fractional GPU configurations, specifically the MIG technology provided by NVIDIA GPUs. Fractional GPU support with MIG means that users can share GPU instances, optimizing compute usage, while still providing isolation between workloads. This means that experiments running on a fractional GPU profile are unlikely to interfere with other workloads running on the same GPU.
To check if an instance in your cluster supports fractional GPU, run the command:

hyp list-accelerator-partition-type –instance-type <instance type>

If your cluster contains instance groups that support fractional GPU, you can create a space with fractional GPU as shown below:

hyp create hyp-space
–name test-space
–display-name “mig-testing”
–accelerator-partition-type mig-3g.20gb
–accelerator-partition-count 1
–memory 8Gi
–template-ref sagemaker-code-editor-template

Clean up
To avoid incurring unnecessary charges, clean up the resources you created in this walkthrough.

Delete all spaces you created. Run this command for each space you created:

hyp delete hyp-space
–name <space-name>

Remove the SageMaker HyperPod Spaces add-on: From the cluster details page, navigate to the IDE and Notebooks tab, and choose Remove.
If you created a HyperPod cluster for the purposes of this blog, delete the cluster to avoid being charged for unused compute. To delete the cluster, follow the instructions in Deleting a SageMaker HyperPod cluster. Additionally, if you used the console to create the cluster, go to the AWS CloudFormation console and delete the parent stack to remove the additional resources such as storage and networking resources created for the cluster. The parent stack will be in the format sagemaker-<your-hyperpod-cluster-name>-<unique-id>

Conclusion
Spaces in SageMaker HyperPod boosts data scientist and AI developer productivity by providing more secure, managed development environments on purpose-build compute. We walked through the setup steps for administrators and data scientists, showing how teams can quickly create and connect to Spaces. With this feature, teams can now reduce time spent on environment setup and focus on model development, while also maintaining consistent development environments. By integrating with HyperPod task governance features, administrators can optimize for cost and equitable compute allocations.

About the authors
Durga Sury is a Senior Solutions Architect at Amazon SageMaker, helping enterprise customers build secure and scalable AI/ML systems. When she’s not architecting solutions, you can find her enjoying sunny walks with her dog, immersing herself in murder mystery books, or catching up on her favorite Netflix shows.
 Edward Sun is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solutions and simplifying the customer experience to integrate SageMaker Studio with popular technologies in data engineering and ML landscape. In his spare time, Edward is big fan of camping, hiking, and fishing, and enjoys spending time with his family.
Josh Dunne is a Senior UX Designer at SageMaker AI at Amazon Web Services. He has 7+ years of experience across UX and product management, with a focus on ML/AI and cloud computing creating practical, straightforward to use workflows for machine learning builders across SageMaker AI, including HyperPod, SageMaker Studio, SageMaker Unified Studio, and interactive IDEs.  Outside of work, he enjoys exploring the Pacific Northwest and traveling with his wife and their dog and trying new restaurants.
Joshua Towner is a Senior SDE working for SageMaker AI at Amazon Web Services, where he is currently working on building and improving interactive ML solutions for SageMaker Studio and HyperPod. Outside of work, he enjoys traveling, skiing, and watching movies.
Khushboo Srivastava is a Product Manager for Amazon SageMaker, AWS. She enjoys building products that simplify machine learning workflows for users. With over 7+ years in software engineering and data science, and 7+ years in product management, Khushboo has launched several products and services that have helped accelerate speed of AI/ML development for customers. With her background in generative AI and distributed computing, and her passion for democratizing AI, she is committed to sharing insights and empowering others in their AI and open source journey.
Prayag Singh is a Senior SDE working for SageMaker AI at Amazon Web Services. With 10+ years of software development experience, he focuses on integrating customers’ preferred ML tools and IDEs on SageMaker Studio and HyperPod. Outside of work, Prayag enjoys traveling and all things comedy, from stand-up specials to sitcoms. You can find him on LinkedIn.

Claude Opus 4.5 now in Amazon Bedrock

Anthropic’s newest foundation model, Claude Opus 4.5, is now available in Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models from leading AI companies. Opus 4.5 is a meaningful step forward in what AI systems can do and sets a new standard across coding, agents, computer use, and office tasks. It outperforms both Sonnet 4.5 and Opus 4.1 while providing Opus-level at one-third the cost.
In this post, I’ll show you what makes this model different, walk through key business applications, and demonstrate how to use Opus 4.5’s new tool use capabilities on Amazon Bedrock. By the end, you’ll understand how to use this model’s capabilities for production agent deployments.
Claude Opus 4.5: What makes this model different
Opus 4.5 is Anthropic’s most advanced model offering in the Opus class, designed for developers building sophisticated AI agents that can reason, plan, and execute complex tasks with minimal oversight. It upgrades Sonnet 4.5 with better performance on existing use cases and adds new capabilities for complex workflows.
The model excels in professional software engineering, achieving 80.9% on SWE-bench Verified, helping to transform multi-day development projects into hours-long tasks. It works independently including improved multilingual coding capabilities, and enhanced behaviors like more efficient code, better test coverage, and cleaner architecture choices. For office productivity, the model handles complex projects end-to-end. It powers agents that create PowerPoint presentations, Excel spreadsheets, and Word documents with professional polish, including document redlining for contracts and NDAs. The model also produces higher quality React and HTML artifacts. It maintains consistency and accuracy—important for finance and other industries where precision matters—and maintains context across files throughout long projects.
This is Anthropic’s best vision model yet, achieving 80.7% on MMMU, for workflows that depend on complex visual interpretation and multi-step navigation—such as analyzing design mockups, processing documents with complex layouts, or automating browser-based tasks—with computer use performance improving further still.
The model introduces two key improvements for agent developers. The tool search tool lets agents work with hundreds of tools by dynamically discovering and loading only what they need instead of loading all definitions upfront—potentially saving tens of thousands of tokens and preventing schema confusion when scaling to large tool libraries. Tool use examples lets you provide sample tool calls directly in the tool definition, improving accuracy for complex schemas with nested objects or arrays.

Opus 4.5 Performance BenchmarksSource: https://www.anthropic.com/news/claude-opus-4-5

Business applications and use cases
Opus 4.5 excels in the following use cases:

Software development: Build agents that write and refactor code across entire projects, manage full-stack architectures, or design agentic systems that break down high-level goals into executable steps. This generation of Claude spans the full development lifecycle: Opus 4.5 for production code and sophisticated agents (those using 10+ tools in workflows like end-to-end software engineering, cybersecurity, or financial analysis), Sonnet 4.5 for rapid iteration and scaled user experiences, Haiku 4.5 for sub-agents and free-tier products. Opus 4.5 can analyze technical documentation, plan a software implementation, write the required code, and iteratively refine it—while tracking requirements and architectural context throughout the process.
Enterprise operations and office tasks: Manage complex projects from start to finish. Opus 4.5 uses memory to maintain context and consistency across files, alongside improvements in creating spreadsheets, slides, and documents. The model handles ongoing enterprise projects, automating manual workflows.
Financial analysis: Work across complex information systems—regulatory filings, market reports, internal data—enabling predictive modeling and proactive compliance. The model’s consistency and accuracy make it useful for finance and other industries where precision matters.
Cybersecurity: Bring professional-grade analysis to security workflows, correlating logs, security issue databases, and security intelligence for security event detection and automated incident response.

Integration with Amazon Bedrock AgentCore
Amazon Bedrock provides the enterprise foundation for deploying Opus 4.5 in production. The fully managed service provides a unified API for foundation models with enterprise-grade security, compliance, and governance.
Opus 4.5 integrates with Amazon Bedrock AgentCore, which provides the infrastructure and primitives for building production agents. AgentCore includes persistent memory for maintaining context across sessions, Tool Gateway for converting your APIs and Lambda functions into agent-compatible tools, and built-in identity and access management for secure resource access. You can deploy and monitor agents with complete session isolation, long-running workflow support (up to 8 hours), and observability features—so you can focus on building agents instead of managing infrastructure.
Amazon Bedrock AgentCore provides additional capabilities for production deployments. The Tool Gateway converts your existing APIs and Lambda functions into agent-compatible tools with minimal code—working with the model’s tool search feature to orchestrate hundreds of tools. Built-in observability through Amazon CloudWatch tracks token usage, latency, and error rates across your agent workflows.
Getting started
Access the Opus 4.5 model today through Amazon Bedrock. I’ll demonstrate the model’s tool search capability—a feature that lets agents work with hundreds of tools without loading all definitions into context upfront. First, I import the required modules and set up the Amazon Bedrock client:

# Import required libraries
import boto3
import json
# Create a session and Bedrock client
session = boto3.Session()
bedrock_client = session.client(
service_name=’bedrock-runtime’,
region_name=’us-east-1′

For this example, I’ll define multiple tools with defer_loading to enable tool search. This lets the model discover and load only the tools it needs instead of loading all definitions upfront:

# Define tools with tool search enabled
tools = [
# Enable tool search – allows dynamic tool discovery
{
“type”: “tool_search_tool_regex”,
“name”: “tool_search_tool_regex”
},
# Tools marked with defer_loading are discovered on-demand
{
“name”: “get_weather”,
“description”: “Get current weather for a location”,
“input_schema”: {
“type”: “object”,
“properties”: {
“location”: {“type”: “string”},
“unit”: {“type”: “string”, “enum”: [“celsius”, “fahrenheit”]}
},
“required”: [“location”]
},
“defer_loading”: True,
# Provide example inputs to improve accuracy for complex schemas
“input_examples”: [
{“location”: “San Francisco, CA”, “unit”: “fahrenheit”},
{“location”: “Tokyo, Japan”, “unit”: “celsius”}
]
},
{
“name”: “search_documentation”,
“description”: “Search AWS documentation”,
“input_schema”: {
“type”: “object”,
“properties”: {
“query”: {“type”: “string”},
“service”: {“type”: “string”}
},
“required”: [“query”]
},
“defer_loading”: True,
“input_examples”: [
{“query”: “Lambda pricing”, “service”: “lambda”},
{“query”: “S3 bucket policies”}
]
},
{
“name”: “analyze_logs”,
“description”: “Analyze application logs for errors”,
“input_schema”: {
“type”: “object”,
“properties”: {
“log_file”: {“type”: “string”},
“time_range”: {“type”: “string”}
},
“required”: [“log_file”]
},
“defer_loading”: True,
“input_examples”: [
{“log_file”: “/var/log/app.log”, “time_range”: “last 24 hours”},
{“log_file”: “/var/log/error.log”}
]
}
]

Now I call the model using the invoke_model API with the effort parameter set to medium:

# Construct the request with beta features enabled
request_body = {
“anthropic_version”: “bedrock-2023-05-31”,
# Enable beta features: tool search, tool examples, and effort parameter
“anthropic_beta”: [“tool-search-tool-2025-10-19”, “tool-examples-2025-10-29”, “effort-2025-11-24”],
“max_tokens”: 4096,
“temperature”: 0.7,
# Set effort to “medium” for balanced token usage
“output_config”: {
“effort”: “medium”
},
“messages”: [
{
“role”: “user”,
“content”: “What’s the weather in Seattle?”
}
],
“tools”: tools
}

)
# Invoke the model
response = bedrock_client.invoke_model(
modelId=”global.anthropic.claude-opus-4-5-20251101-v1:0″,
body=json.dumps(request_body)

# Parse the response
response_body = json.loads(response[‘body’].read())

The model uses tool search to find the relevant tool (get_weather) from the library without loading all tool definitions upfront. The effort parameter, available in beta, controls how liberally the model spends tokens across thinking, tool calls, and responses. You can set effort to high for best results, medium for balanced usage, or low for conservative token usage.
Key features for agent development
Opus 4.5 has several capabilities that make it well-suited for building production agents. The model maintains coherence across extended workflows for consistent decision-making for agents that run multi-step processes over hours or days. Better tool handling means agents interact more reliably with external systems, APIs, and software interfaces—the model chooses the right tools and interprets results more accurately. Opus 4.5 also tracks information across conversation turns and maintains context, helping agents accumulate knowledge over time and make decisions based on history.
The effort parameter, available in beta, gives you control over token usage. You can set it to high for best results when quality matters most, medium for balanced performance, or low for conservative token usage. Opus 4.5 adjusts token spending across thinking, tool calls, and responses based on this setting. For production deployments, Amazon Bedrock AgentCore provides monitoring and observability through CloudWatch integration, tracking token usage in real-time (useful when tuning the effort parameter), along with latency metrics, session duration, and error rates to help optimize agent performance and manage costs.
Pricing
The model is priced at $5 per million input tokens and $25 per million output tokens, making Opus-level intelligence accessible at one-third the cost of previous offerings.
Availability and access
This model is available today in Amazon Bedrock through cross-Region inference, which automatically routes requests to available capacity across AWS Regions for higher throughput during peak demand.
Use this model for agents that handle long-running tasks, coordinate multiple tools, or maintain context across extended sessions.
For detailed information about availability, pricing, and model specifications, visit the Amazon Bedrock documentation.
Conclusion
This post showed you how to get started with Claude Opus 4.5 in Amazon Bedrock. Opus 4.5 excels at complex, long-running workflows like software development and enterprise operations. Opus 4.5’s capabilities in tool handling, context management, and decision-making make it valuable for building agents that operate reliably in production environments. The model works well for agents in software engineering, research synthesis, and enterprise workflow automation.
I encourage you to experiment with Opus 4.5 for your own agent workflows. Consider how its capabilities could improve manual processes in your organization, or support new types of automation. The combination of Opus 4.5’s capabilities with Amazon Bedrock’s enterprise features provides a foundation for production AI agents.
To get started, try the model in the Amazon Bedrock console, explore the technical documentation, and check out Anthropic’s Claude model detail page for more information about its capabilities. To deploy agents at scale, explore Opus 4.5 in Amazon Bedrock AgentCore for managed infrastructure with tool orchestration and monitoring.
I’d love to hear about what you build with this model—share your experiences and agent use cases in the comments below!

About the authors
Jonathan Evans is a Worldwide Solutions Architect for Generative AI at AWS, where he helps customers leverage cutting-edge AI technologies with Anthropic’s Claude models on Amazon Bedrock, to solve complex business challenges. With a background in AI/ML engineering and hands-on experience supporting machine learning workflows in the cloud, Jonathan is passionate about making advanced AI accessible and impactful for organizations of all sizes.

Moonshot AI Researchers Introduce Seer: An Online Context Learning Sys …

How do you keep reinforcement learning for large reasoning models from stalling on a few very long, very slow rollouts while GPUs sit under used? a team of researchers from Moonshot AI and Tsinghua University introduce ‘Seer’, a new online context learning system that targets a specific systems bottleneck in reinforcement learning for large language models. In synchronous on policy setups, the rollout phase dominates the cost of each iteration. Seer restructures this phase and reports rollout throughput gains of 74 percent to 97 percent and tail latency reductions of 75 percent to 93 percent compared with a strong synchronous baseline called veRL.

https://arxiv.org/pdf/2511.14617

Why synchronous rollout is slow for reasoning models?

Modern reasoning RL workloads use long chain of thought style outputs. In the Seer experiments, the researchers apply GRPO to three different models, Moonlight, Qwen2 VL 72B and Kimi K2. These workloads run on 32 compute nodes with 8 H800 GPUs per node. The three tasks use 32, 128 and 256 GPUs respectively, with 400, 600 and 800 prompts per iteration and 8 or 16 responses per prompt.

Maximum generation length is large. Moonlight is configured for 65,536 tokens, Qwen2 VL 72B for 40,960 tokens and Kimi K2 for 98,304 tokens. A single long chain of thought request can grow from a few hundred megabytes of KVCache to tens of gigabytes as decoding progresses. This memory growth forces instances to reduce concurrency or to preempt requests, which triggers expensive re decoding.

The research team defines tail requests as the last 10 percent of requests to finish in a rollout. For Moonlight and Qwen2 VL 72B, this tail alone can consume up to 50 percent of the total rollout time in the baseline system. Rollout already dominates iteration time, so this tail effect directly slows RL.

https://arxiv.org/pdf/2511.14617

Seer architecture on top of Mooncake and vLLM

Seer keeps the RL algorithm identical to synchronous veRL. Each training iteration uses only data from the current rollout iteration, so the system preserves on policy behavior. The training phase uses Megatron for distributed optimization. The rollout phase uses an in house implementation of vLLM as the inference engine.

To support aggressive request scheduling, Seer relies on a Global KVCache Pool built on the Mooncake disaggregated KVCache architecture used in production for Kimi. Mooncake provides a two tier DRAM and SSD KV cache store shared across inference nodes, which allows Seer to migrate requests without recomputing prefills.

On top of this substrate, Seer introduces three key mechanisms:

Divided Rollout

Context Aware Scheduling

Adaptive Grouped Speculative Decoding

These are orchestrated by a Request Buffer, a Context Manager and an Inference Engine Pool connected to the Global KVCache Pool.

https://arxiv.org/pdf/2511.14617

Divided Rollout, fine grained scheduling and migration

Conventional synchronous rollout assigns whole GRPO groups to inference instances. A group is a set of requests that share one prompt. Once assigned, a group stays on the same instance until all responses finish. Due to large variance in output lengths, this leads to load imbalance and long running stragglers.

Seer breaks groups down in two steps. It first decomposes each group into individual requests. It then divides each request into multiple chunks based on generation length. When the scheduler dispatches a request from the Request Buffer, it sets a small max tokens value such as 8,000 tokens for that chunk. After each chunk, the request is re enqueued until it reaches an end of sequence token or its original max tokens limit.

Because KVCache is stored in the Global KVCache Pool, divided requests can move between instances at chunk boundaries without re running the prefill. The scheduler maintains a concurrency level that keeps memory utilization high while avoiding preemption. This reduces waste and smooths KVCache usage across the iteration.

Context Aware Scheduling using group length statistics

The research team observe that different requests in the same group tend to have correlated output lengths. Seer uses this structure as online context. For each prompt group, it designates one request as the speculative request. The scheduler keeps speculative requests in a high priority queue and serves them with a smallest first policy based on generated tokens so far. Short requests complete quickly and exit. Long requests remain and identify groups that are potential tail candidates.

The Context Manager maintains a length estimate for each group. It updates this estimate to the maximum generated length among completed requests in the group. If no request has finished, it uses the original max tokens as a conservative bound. Once speculative requests are in flight or done, Seer schedules remaining requests with an approximate longest first policy at group level. This design achieves throughput and tail behavior close to an oracle scheduler that knows all output lengths in advance.

https://arxiv.org/pdf/2511.14617

Adaptive Grouped Speculative Decoding

Seer adds Adaptive Grouped Speculative Decoding on top of the previous two components to accelerate decoding, especially for long requests in the tail. It introduces a Distributed Grouped Draft Server, or DGDS. DGDS maintains a Compressed Suffix Tree for each group and aggregates token sequences from all requests in that group. Instances asynchronously append generated tokens to DGDS, periodically fetch updated suffix trees and perform local speculative decoding based on the shared pattern statistics.

The system adjusts draft length and the number of paths according to model architecture, batch size and measured acceptance length. For dense and Mixture of Experts models, it pre-computes different speculation thresholds and uses them to bound draft depth for each batch. In late tail stages, concurrency is low, so Seer increases draft depth and enables multi path drafting to raise accepted tokens per step.

Ablation results show that divided rollout yields up to 35 percent throughput improvement over the baseline. Adding Context Aware Scheduling increases this to up to 47 percent over baseline. Enabling grouped speculative decoding raises the total speedup to 77 percent to 87 percent over the baseline in the evaluated iteration.

End to end impact on RL training

The research team evaluate Seer on three RL tasks built on Moonlight, Qwen2 VL 72B and Kimi K2. They run 10 rollout iterations per task and measure output tokens per second and completion time for each rollout. Seer improves rollout throughput by 74 percent to 97 percent across these workloads relative to veRL with the same RL algorithm and vLLM based inference engine.

Tail latency is reduced by 75 percent to 93 percent. For memory constrained tasks, the baseline system spends up to half of its time on the last 10 percent of requests. Seer removes most of this tail by combining divided rollout, Context Aware Scheduling and Adaptive Grouped Speculative Decoding on top of the Mooncake based Global KVCache Pool.

Key Takeaways

Rollout bottleneck: Seer targets the rollout phase of synchronous RL, which accounts for about 63% to 87% of iteration time and is dominated by long tail requests and KV cache fragmentation.

Three core mechanisms: Seer combines divided rollout, context aware scheduling and adaptive grouped speculative decoding to exploit output length and pattern similarity among GRPO responses that share a prompt.

Fine grained scheduling on a global KV cache: Requests are split into chunks and migrated across a Mooncake style Global KVCache Pool, which preserves synchronous on policy RL while keeping GPU memory utilization high and reducing preemptions.

Online context for tail latency reduction: Group level length statistics from speculative requests drive context aware scheduling that approximates an oracle longest first scheduler and sharply reduces the time spent on the last 10 percent of requests.

Measured end to end gains: On production grade RL workloads with Moonlight, Qwen2 VL 72B and Kimi K2, Seer improves rollout throughput by 74% to 97% and reduces long tail latency by 75% to 93% relative to a state of the art synchronous vLLM based baseline.

Editorial Comments

Seer is an important systems contribution because it optimizes the rollout phase in synchronous RL without changing the underlying GRPO algorithm, so it preserves on policy guarantees and reproducibility while fixing a real infrastructure bottleneck. The combination of divided rollout, context aware scheduling and adaptive grouped speculative decoding offers a practical template for other RL stacks that rely on long chain of thought reasoning models and large KVCache footprints. Overall, Seer shows that online context learning at the systems level is now as critical as model architecture for scaling reasoning RL efficiently.

Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Moonshot AI Researchers Introduce Seer: An Online Context Learning System for Fast Synchronous Reinforcement Learning RL Rollouts appeared first on MarkTechPost.

How to Design a Mini Reinforcement Learning Environment-Acting Agent w …

In this tutorial, we code a mini reinforcement learning setup in which a multi-agent system learns to navigate a grid world through interaction, feedback, and layered decision-making. We build everything from scratch and bring together three agent roles: an Action Agent, a Tool Agent, and a Supervisor, so we can observe how simple heuristics, analysis, and oversight combine to produce more intelligent behavior. Also, we observe how the agents collaborate, refine their strategies, and gradually learn to reach the goal while overcoming obstacles and uncertainty. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
import time
from collections import defaultdict

class GridWorld:
def __init__(self, size=8):
self.size = size
self.agent_pos = [0, 0]
self.goal_pos = [size-1, size-1]
self.obstacles = self._generate_obstacles()
self.visited = set()
self.step_count = 0
self.max_steps = size * size * 2

def _generate_obstacles(self):
obstacles = set()
n_obstacles = self.size
while len(obstacles) < n_obstacles:
pos = (np.random.randint(1, self.size-1),
np.random.randint(1, self.size-1))
if pos != (0, 0) and pos != (self.size-1, self.size-1):
obstacles.add(pos)
return obstacles

def reset(self):
self.agent_pos = [0, 0]
self.visited = {tuple(self.agent_pos)}
self.step_count = 0
return self._get_state()

def _get_state(self):
return {
‘position’: tuple(self.agent_pos),
‘goal’: self.goal_pos,
‘distance_to_goal’: abs(self.agent_pos[0] – self.goal_pos[0]) +
abs(self.agent_pos[1] – self.goal_pos[1]),
‘visited_count’: len(self.visited),
‘steps’: self.step_count,
‘can_move’: self._get_valid_actions()
}

def _get_valid_actions(self):
valid = []
moves = {‘up’: [-1, 0], ‘down’: [1, 0], ‘left’: [0, -1], ‘right’: [0, 1]}
for action, delta in moves.items():
new_pos = [self.agent_pos[0] + delta[0], self.agent_pos[1] + delta[1]]
if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size and
tuple(new_pos) not in self.obstacles):
valid.append(action)
return valid

We set up the entire GridWorld environment and define how the agent, goal, and obstacles exist in it. We establish the structure for state representation and valid movements, and we prepare the environment so we can interact with it dynamically. As we run this part, we see the world taking shape and becoming ready for the agents to explore. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass GridWorld(GridWorld):
def step(self, action):
self.step_count += 1
moves = {‘up’: [-1, 0], ‘down’: [1, 0], ‘left’: [0, -1], ‘right’: [0, 1]}

if action not in moves:
return self._get_state(), -1, False, “Invalid action”

delta = moves[action]
new_pos = [self.agent_pos[0] + delta[0], self.agent_pos[1] + delta[1]]

if not (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size):
return self._get_state(), -1, False, “Hit wall”

if tuple(new_pos) in self.obstacles:
return self._get_state(), -1, False, “Hit obstacle”

self.agent_pos = new_pos
pos_tuple = tuple(self.agent_pos)
reward = -0.1
if pos_tuple not in self.visited:
reward += 0.5
self.visited.add(pos_tuple)

done = False
info = “Moved”
if self.agent_pos == self.goal_pos:
reward += 10
done = True
info = “Goal reached!”
elif self.step_count >= self.max_steps:
done = True
info = “Max steps reached”

return self._get_state(), reward, done, info

def render(self, agent_thoughts=None):
grid = np.zeros((self.size, self.size, 3))
for pos in self.visited:
grid[pos[0], pos[1]] = [0.7, 0.9, 1.0]
for obs in self.obstacles:
grid[obs[0], obs[1]] = [0.2, 0.2, 0.2]
grid[self.goal_pos[0], self.goal_pos[1]] = [0, 1, 0]
grid[self.agent_pos[0], self.agent_pos[1]] = [1, 0, 0]

plt.figure(figsize=(10, 8))
plt.imshow(grid, interpolation=’nearest’)
plt.title(f”Step: {self.step_count} | Visited: {len(self.visited)}/{self.size*self.size}”)
for i in range(self.size + 1):
plt.axhline(i – 0.5, color=’gray’, linewidth=0.5)
plt.axvline(i – 0.5, color=’gray’, linewidth=0.5)
if agent_thoughts:
plt.text(0.5, -1.5, agent_thoughts, ha=’center’, fontsize=9,
bbox=dict(boxstyle=’round’, facecolor=’wheat’, alpha=0.8),
wrap=True, transform=plt.gca().transData)
plt.axis(‘off’)
plt.tight_layout()
plt.show()

We define how each step in the environment works and how the world is visually rendered. We calculate rewards, detect collisions, track progress, and display everything through a clean grid visualization. As we execute this logic, we watch the agent’s journey unfold in real time with clear feedback. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ActionAgent:
def __init__(self):
self.q_values = defaultdict(lambda: defaultdict(float))
self.epsilon = 0.3
self.learning_rate = 0.1
self.discount = 0.95

def choose_action(self, state):
valid_actions = state[‘can_move’]
if not valid_actions:
return None
pos = state[‘position’]
if np.random.random() < self.epsilon:
action = np.random.choice(valid_actions)
reasoning = f”Exploring randomly: chose ‘{action}'”
else:
action_values = {a: self.q_values[pos][a] for a in valid_actions}
action = max(action_values, key=action_values.get)
reasoning = f”Exploiting: chose ‘{action}’ (Q={self.q_values[pos][action]:.2f})”
return action, reasoning

def learn(self, state, action, reward, next_state):
pos = state[‘position’]
next_pos = next_state[‘position’]
current_q = self.q_values[pos][action]
next_max_q = max([self.q_values[next_pos][a] for a in next_state[‘can_move’]], default=0)
new_q = current_q + self.learning_rate * (
reward + self.discount * next_max_q – current_q)
self.q_values[pos][action] = new_q

class ToolAgent:
def analyze(self, state, action_taken, reward, history):
suggestions = []
distance = state[‘distance_to_goal’]
if distance <= 3:
suggestions.append(” Very close to goal! Prioritize direct path.”)
exploration_rate = state[‘visited_count’] / (state[‘steps’] + 1)
if exploration_rate < 0.5 and distance > 5:
suggestions.append(” Low exploration rate. Consider exploring more.”)
if len(history) >= 5:
recent_rewards = [h[2] for h in history[-5:]]
avg_reward = np.mean(recent_rewards)
if avg_reward < -0.5:
suggestions.append(” Negative reward trend. Try different strategy.”)
elif avg_reward > 0.3:
suggestions.append(” Good progress! Current strategy working.”)
if len(state[‘can_move’]) <= 2:
suggestions.append(” Limited movement options. Be careful.”)
return suggestions

We implement the Action Agent and Tool Agent, giving the system both learning capability and analytical feedback. We observe how the Action Agent chooses actions through a balance of exploration and exploitation, while the Tool Agent evaluates performance and suggests improvements. Together, they create a learning loop that evolves with experience. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SupervisorAgent:
def decide(self, state, proposed_action, tool_suggestions):
if not proposed_action:
return None, “No valid actions available”

decision = proposed_action
reasoning = f”Approved action ‘{proposed_action}'”

for suggestion in tool_suggestions:
if “goal” in suggestion.lower() and “close” in suggestion.lower():
goal_direction = self._get_goal_direction(state)
if goal_direction in state[‘can_move’]:
decision = goal_direction
reasoning = f”Override: Moving ‘{goal_direction}’ toward goal”
break

return decision, reasoning

def _get_goal_direction(self, state):
pos = state[‘position’]
goal = state[‘goal’]
if goal[0] > pos[0]:
return ‘down’
elif goal[0] < pos[0]:
return ‘up’
elif goal[1] > pos[1]:
return ‘right’
else:
return ‘left’

We introduce the Supervisor Agent, which acts as the final decision-maker in the system. We see how it interprets suggestions, overrides risky choices, and ensures that actions remain aligned with overall goals. As we use this component, we experience a coordinated multi-agent decision flow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef train_multi_agent(episodes=5, visualize=True):
env = GridWorld(size=8)
action_agent = ActionAgent()
tool_agent = ToolAgent()
supervisor = SupervisorAgent()

episode_rewards = []
episode_steps = []

for episode in range(episodes):
state = env.reset()
total_reward = 0
done = False
history = []

print(f”n{‘=’*60}”)
print(f”EPISODE {episode + 1}/{episodes}”)
print(f”{‘=’*60}”)

while not done:
action_result = action_agent.choose_action(state)
if action_result is None:
break
proposed_action, action_reasoning = action_result

suggestions = tool_agent.analyze(state, proposed_action, total_reward, history)
final_action, supervisor_reasoning = supervisor.decide(state, proposed_action, suggestions)

if final_action is None:
break

next_state, reward, done, info = env.step(final_action)
total_reward += reward
action_agent.learn(state, final_action, reward, next_state)
history.append((state, final_action, reward, next_state))

if visualize:
clear_output(wait=True)
thoughts = (f”Action Agent: {action_reasoning}n”
f”Supervisor: {supervisor_reasoning}n”
f”Tool Agent: {‘, ‘.join(suggestions[:2]) if suggestions else ‘No suggestions’}n”
f”Reward: {reward:.2f} | Total: {total_reward:.2f}”)
env.render(thoughts)
time.sleep(0.3)

state = next_state

episode_rewards.append(total_reward)
episode_steps.append(env.step_count)

print(f”nEpisode {episode+1} Complete!”)
print(f”Total Reward: {total_reward:.2f}”)
print(f”Steps Taken: {env.step_count}”)
print(f”Cells Visited: {len(env.visited)}/{env.size**2}”)

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(episode_rewards, marker=’o’)
plt.title(‘Episode Rewards’)
plt.xlabel(‘Episode’)
plt.ylabel(‘Total Reward’)
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(episode_steps, marker=’s’, color=’orange’)
plt.title(‘Episode Steps’)
plt.xlabel(‘Episode’)
plt.ylabel(‘Steps to Complete’)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

return action_agent, tool_agent, supervisor

if __name__ == “__main__”:
print(” Multi-Agent RL System: Grid World Navigation”)
print(“=” * 60)
print(“Components:”)
print(” • Action Agent: Proposes actions using Q-learning”)
print(” • Tool Agent: Analyzes performance and suggests improvements”)
print(” • Supervisor Agent: Makes final decisions”)
print(“=” * 60)

trained_agents = train_multi_agent(episodes=5, visualize=True)

We run the full training loop where all agents collaborate inside the environment across multiple episodes. We track rewards, observe movement patterns, and visualize learning progression with each trial. As we complete this loop, we see the multi-agent system improving and becoming more efficient at navigating the grid world.

In conclusion, we see how a multi-agent RL system emerges from clean components and how each layer contributes to smarter navigation: the Action Agent learns via Q-updates, the Tool Agent guides improvements, and the Supervisor ensures safe, goal-oriented action selection. We appreciate how this simple yet dynamic grid world helps us visualize learning, exploration, and decision-making in real time.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Mini Reinforcement Learning Environment-Acting Agent with Intelligent Local Feedback, Adaptive Decision-Making, and Multi-Agent Coordination appeared first on MarkTechPost.

Google DeepMind Introduces Nano Banana Pro: the Gemini 3 Pro Image Mod …

Nano Banana Pro, also called Gemini 3 Pro Image, is Google DeepMind’s new image generation and editing model built on Gemini 3 Pro. It is positioned as a state of the art system for creating and editing images that must respect structure, world knowledge and text layout, not only style. Nano Banana Pro follows Nano Banana, which was based on Gemini 2.5 Flash Image and focused on fast, casual image editing such as restoring photos and generating figurines.

From Gemini 2.5 Flash Image to Gemini 3 Pro Image

The earlier Nano Banana model targeted quick creative edits for casual creators. It helped restore old photos and build stylized 3D mini figurines with a simple prompt. Nano Banana Pro keeps that editing flow but runs on top of Gemini 3 Pro, which brings stronger reasoning and real world knowledge into the image stack.

The model can turn prototypes, data tables and handwritten notes into diagrams and infographics that reflect the underlying information, rather than producing only decorative art.

Reasoning Guided, Search Grounded Visuals

A core design point for Nano Banana Pro is reasoning guided generation. Using Gemini 3 Pro, the model can consume text, structured content and references and then plan the image as an explanation of that content. Nano Banana Pro can also connect to Google Search, using the search index as a real time knowledge source.

Clear Text and Multilingual Layouts

Text inside images is a long standing failure mode for many diffusion based generators. Nano Banana Pro addresses this explicitly. Google states that it is the best model in the Gemini family for producing images with correctly rendered and legible text, for both short taglines and full paragraphs.

Gemini 3 Pro’s multilingual reasoning flows into the image model. Nano Banana Pro can render text in multiple languages and also translate text that already appears in products or posters. The documentation shows beverage cans where English text is translated into Korean while the visual design and layout stay unchanged.

Studio Level Control, Consistency and Upscaling

Nano Banana Pro exposes a set of controls aimed at design and production workflows rather than single shot art prompts. On the composition side, the model can use up to 14 input images and maintain the consistency and resemblance of up to 5 people in one workflow. This supports tasks such as combining reference photos into a single fashion editorial, transforming sketches into product shots or keeping the same cast across multiple scenes.

The studio control section of the model page lists several families of controls. Users can vary camera angle and shot type, including wide shot, panoramic and close up, while controlling depth of field and focus on specific subjects in the image. Color and lighting can be adjusted, for example changing day to night, replacing volumetric lighting with bokeh or applying a strong chiaroscuro effect without losing subject identity.

Nano Banana Pro supports explicit upscaling. The official Google blog states that it can generate crisp visuals at 1k, 2k or 4k resolution, and provides examples of progressive zoom in operations that keep detail and composition. Aspect ratio is also programmable. Prompts can convert between ratios such as 1:1, 4:3, 16:9 and cinematic formats while keeping the main character locked in place and adjusting only the background.

Key Takeaways

Nano Banana Pro is Gemini 3 Pro Image, an upgraded image generation and editing model that succeeds Nano Banana, which was based on Gemini 2.5 Flash Image, and is optimized for higher quality and control.

The model integrates Gemini 3 Pro reasoning and Google Search grounding so it can turn factual content, documents and real time data into infographics, recipes, process diagrams and other information dense visuals.

It provides strong text rendering and multilingual support, producing legible typography in images and enabling translation or localization of existing on image text while preserving layout and design.

Nano Banana Pro supports up to 14 input images and maintains resemblance for up to 5 people, with studio style controls for camera angle, depth of field, lighting, aspect ratios and upscaling to 1k, 2k and 4k resolutions.

The model is being deployed across Gemini app, AI Mode in Search, NotebookLM, Google Ads, Workspace apps, Gemini API, Google AI Studio, Vertex AI, Antigravity and Flow, with all outputs watermarked using SynthID plus tier specific visible watermarks.

Editorial Comments

Nano Banana Pro positions Gemini 3 Pro Image as a production oriented image system that links Gemini 3 Pro reasoning, Google Search grounding and structured controls for layout, text and upscaling. It directly addresses long standing issues in text rendering, multilingual localization and subject consistency, while keeping SynthID and visible watermarks as default provenance signals across tiers and surfaces. This launch moves Google’s image stack closer to an integrated, API first visual platform for developers and enterprises.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Introduces Nano Banana Pro: the Gemini 3 Pro Image Model for Text Accurate and Studio Grade Visuals appeared first on MarkTechPost.

An Implementation of Fully Traced and Evaluated Local LLM Pipeline Usi …

In this tutorial, we implement a complete workflow for building, tracing, and evaluating an LLM pipeline using Opik. We structure the system step-by-step, beginning with a lightweight model, adding prompt-based planning, creating a dataset, and finally running automated evaluations. As we move through each snippet, we see how Opik helps us track every function span, visualize the pipeline’s behavior, and measure output quality with clear, reproducible metrics. By the end, we have a fully instrumented QA system that we can extend, compare, and monitor with ease. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q opik transformers accelerate torch

import torch
from transformers import pipeline
import textwrap

import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio

device = 0 if torch.cuda.is_available() else -1
print(“Using device:”, “cuda” if device == 0 else “cpu”)

opik.configure()
PROJECT_NAME = “opik-hf-tutorial”

We set up our environment by installing the required libraries and initializing Opik. We load the core modules, detect the device, and configure our project so that every trace flows into the correct workspace. We lay the foundation for the rest of the tutorial. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserllm = pipeline(
“text-generation”,
model=”distilgpt2″,
device=device,
)

def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
result = llm(
prompt,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.3,
pad_token_id=llm.tokenizer.eos_token_id,
)[0][“generated_text”]
return result[len(prompt):].strip()

We load a lightweight Hugging Face model and create a small helper function to generate text cleanly. We prepare the LLM to operate locally without external APIs. This gives us a reliable and reproducible generation layer for the rest of the pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserplan_prompt = Prompt(
name=”hf_plan_prompt”,
prompt=textwrap.dedent(“””
You are an assistant that creates a plan to answer a question
using ONLY the given context.

Context:
{{context}}

Question:
{{question}}

Return exactly 3 bullet points as a plan.
“””).strip(),
)

answer_prompt = Prompt(
name=”hf_answer_prompt”,
prompt=textwrap.dedent(“””
You answer based only on the given context.

Context:
{{context}}

Question:
{{question}}

Plan:
{{plan}}

Answer the question in 2–4 concise sentences.
“””).strip(),
)

We define two structured prompts using Opik’s Prompt class. We control the planning phase and answering phase through clear templates. This helps us maintain consistency and observe how structured prompting impacts model behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserDOCS = {
“overview”: “””
Opik is an open-source platform for debugging, evaluating,
and monitoring LLM and RAG applications. It provides tracing,
datasets, experiments, and evaluation metrics.
“””,
“tracing”: “””
Tracing in Opik logs nested spans, LLM calls, token usage,
feedback scores, and metadata to inspect complex LLM pipelines.
“””,
“evaluation”: “””
Opik evaluations are defined by datasets, evaluation tasks,
scoring metrics, and experiments that aggregate scores,
helping detect regressions or issues.
“””,
}

@track(project_name=PROJECT_NAME, type=”tool”, name=”retrieve_context”)
def retrieve_context(question: str) -> str:
q = question.lower()
if “trace” in q or “span” in q:
return DOCS[“tracing”]
if “metric” in q or “dataset” in q or “evaluate” in q:
return DOCS[“evaluation”]
return DOCS[“overview”]

We construct a tiny document store and a retrieval function that Opik tracks as a tool. We let the pipeline select context based on the user’s question. This allows us to simulate a minimal RAG-style workflow without needing an actual vector database. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@track(project_name=PROJECT_NAME, type=”llm”, name=”plan_answer”)
def plan_answer(context: str, question: str) -> str:
rendered = plan_prompt.format(context=context, question=question)
return hf_generate(rendered, max_new_tokens=80)

@track(project_name=PROJECT_NAME, type=”llm”, name=”answer_from_plan”)
def answer_from_plan(context: str, question: str, plan: str) -> str:
rendered = answer_prompt.format(
context=context,
question=question,
plan=plan,
)
return hf_generate(rendered, max_new_tokens=120)

@track(project_name=PROJECT_NAME, type=”general”, name=”qa_pipeline”)
def qa_pipeline(question: str) -> str:
context = retrieve_context(question)
plan = plan_answer(context, question)
answer = answer_from_plan(context, question, plan)
return answer

print(“Sample answer:n”, qa_pipeline(“What does Opik help developers do?”))

We bring together planning, reasoning, and answering in a fully traced LLM pipeline. We capture each step with Opik’s decorators so we can analyze spans in the dashboard. By testing the pipeline, we confirm that all components integrate smoothly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclient = Opik()

dataset = client.get_or_create_dataset(
name=”HF_Opik_QA_Dataset”,
description=”Small QA dataset for HF + Opik tutorial”,
)

dataset.insert([
{
“question”: “What kind of platform is Opik?”,
“context”: DOCS[“overview”],
“reference”: “Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.”,
},
{
“question”: “What does tracing in Opik log?”,
“context”: DOCS[“tracing”],
“reference”: “Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.”,
},
{
“question”: “What are the components of an Opik evaluation?”,
“context”: DOCS[“evaluation”],
“reference”: “An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.”,
},
])

We create and populate a dataset inside Opik that our evaluation will use. We insert multiple question–answer pairs that cover different aspects of Opik. This dataset will serve as the ground truth for our QA evaluation later. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserequals_metric = Equals()
lev_metric = LevenshteinRatio()

def evaluation_task(item: dict) -> dict:
output = qa_pipeline(item[“question”])
return {
“output”: output,
“reference”: item[“reference”],
}

We define the evaluation task and select two metrics—Equals and LevenshteinRatio—to measure model quality. We ensure the task produces outputs in the exact format required for scoring. This connects our pipeline to Opik’s evaluation engine. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserevaluation_result = evaluate(
dataset=dataset,
task=evaluation_task,
scoring_metrics=[equals_metric, lev_metric],
experiment_name=”HF_Opik_QA_Experiment”,
project_name=PROJECT_NAME,
task_threads=1,
)

print(“nExperiment URL:”, evaluation_result.experiment_url)

We run the evaluation experiment using Opik’s evaluate function. We keep the execution sequential for stability in Colab. Once complete, we receive a link to view the experiment details inside the Opik dashboard. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browseragg = evaluation_result.aggregate_evaluation_scores()

print(“nAggregated scores:”)
for metric_name, stats in agg.aggregated_scores.items():
print(metric_name, “=>”, stats)

We aggregate and print the evaluation scores to understand how well our pipeline performs. We inspect the metric results to see where outputs align with references and where improvements are needed. This closes the loop on our fully instrumented LLM workflow.

In conclusion, we set up a small but fully functional LLM evaluation ecosystem powered entirely by Opik and a local model. We observe how traces, prompts, datasets, and metrics come together to give us transparent visibility into the model’s reasoning process. As we finalize our evaluation and review the aggregated scores, we appreciate how Opik lets us iterate quickly, experiment systematically, and validate improvements in a structured and reliable way.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows appeared first on MarkTechPost.

Perplexity AI Releases TransferEngine and pplx garden to Run Trillion …

How can teams run trillion parameter language models on existing mixed GPU clusters without costly new hardware or deep vendor lock in? Perplexity’s research team has released TransferEngine and the surrounding pplx garden toolkit as open source infrastructure for large language model systems. This provides a way to run models with up to 1 trillion parameters across mixed GPU clusters, without locking into a single cloud provider or buying new GB200 class hardware.

https://arxiv.org/pdf/2510.27656

The real bottleneck, network fabrics not FLOPs

Modern deployments of Mixture of Experts models such as DeepSeek V3 with 671 billion parameters and Kimi K2 with 1 trillion parameters no longer fit on a single 8 GPU server. They must span multiple nodes, so the main constraint becomes the network fabric between GPUs.

Here the hardware landscape is fragmented. NVIDIA ConnectX 7 typically uses Reliable Connection transport with in order delivery. AWS Elastic Fabric Adapter uses Scalable Reliable Datagram transport that is reliable but out of order, and a single GPU may need 4 network adapters at 100 Gbps, or 2 at 200 Gbps, to reach 400 Gbps.

Existing libraries such as DeepEP, NVSHMEM, MoonCake and NIXL tend to optimize for one vendor and degrade or lack support on the other side. Perplexity’s research team directly states in the research paper that there was no viable cross provider solution for LLM inference before this work.

TransferEngine, a portable RDMA layer for LLM systems

TransferEngine addresses this by targeting only the intersection of guarantees across Network Interface Controllers. It assumes that the underlying RDMA transport is reliable, but does not assume any ordering of messages. On top of this, it exposes one sided WriteImm operations and an ImmCounter primitive for completion notification.

The library provides a minimal API in Rust. It offers two sided Send and Recv for control messages, and three main one sided operations, submit_single_write, submit_paged_writes, and submit_scatter, plus a submit_barrier primitive for synchronization across a group of peers. A NetAddr structure identifies peers and an MrDesc structure describes registered memory regions. An alloc_uvm_watcher call creates a device side watcher for CPU GPU synchronization in advanced pipelines.

Internally, TransferEngine spawns one worker thread per GPU and builds a DomainGroup per GPU that coordinates between 1 and 4 RDMA Network Interface Controllers. A single ConnectX 7 provides 400 Gbps. On EFA, the DomainGroup aggregates 4 network adapters at 100 Gbps, or 2 at 200 Gbps, to reach the same bandwidth. The sharding logic knows about all Network Interface Controllers and can split a transfer across them.

Across hardware, the research team reports peak throughput of 400 Gbps on both NVIDIA ConnectX 7 and AWS EFA. This matches single platform solutions and confirms that the abstraction layer does not leave large performance on the table.

https://arxiv.org/pdf/2510.27656

pplx garden, the open source package

TransferEngine ships as part of the pplx garden repository on GitHub under an MIT license. The directory structure is straightforward. fabric-lib contains the RDMA TransferEngine library, p2p-all-to-all implements a Mixture of Experts all to all kernel, python-ext provides the Python extension module from the Rust core, and python/pplx_garden contains the Python package code.

The system requirements reflect a modern GPU cluster. Perplexity research team recommends Linux kernel 5.12 or newer for DMA BUF support, CUDA 12.8 or newer, libfabric, libibverbs, GDRCopy, and an RDMA fabric with GPUDirect RDMA enabled. Each GPU should have at least one dedicated RDMA Network Interface Controller.

Disaggregated prefill and decode

The first production use case is disaggregated inference. Prefill and decode run on separate clusters, so the system must stream KvCache from prefill GPUs to decode GPUs at high speed.

TransferEngine uses alloc_uvm_watcher to track progress in the model. During prefill, the model increments a watcher value after each layer’s attention output projection. When the worker observes a change, it issues paged writes for the KvCache pages of that layer, followed by a single write for the remaining context. This approach allows layer by layer streaming of cache pages without fixed world membership, and it avoids the strict ordering constraints of collectives.

https://arxiv.org/pdf/2510.27656

Fast weight transfer for reinforcement learning

The second system is asynchronous reinforcement learning fine tuning, where training and inference run on separate GPU pools. Traditional designs gather updated parameters to a single rank then broadcast them, which limits throughput to one Network Interface Controller.

Perplexity research team instead uses TransferEngine to perform point to point weight transfer. Each training GPU writes its parameter shard directly into the corresponding inference GPUs using one sided writes. A pipelined execution splits each tensor into stages, host to device copy when Fully Sharded Data Parallel offloads weights, reconstruction and optional quantization, RDMA transfer, and a barrier implemented through scatter and ImmCounter.

In production, this setup delivers weight updates for models such as Kimi K2 at 1 trillion parameters and DeepSeek V3 at 671 billion parameters in about 1.3 seconds from 256 training GPUs to 128 inference GPUs.

https://arxiv.org/pdf/2510.27656

Mixture of Experts routing across ConnectX and EFA

The third piece in pplx garden is a point to point Mixture of Experts dispatch and combine kernel. It uses NVLink for intra node traffic and RDMA for inter node traffic. Dispatch and combine are split into separate send and receive phases so that the decoder can micro batch and overlap communication with grouped general matrix multiply.

A host proxy thread polls GPU state and calls TransferEngine when send buffers are ready. Routes are exchanged first, then each rank computes contiguous receive offsets for each expert and writes tokens into private buffers that can be reused between dispatch and combine. This reduces memory footprint and keeps writes large enough to use the full link bandwidth.

On ConnectX 7, Perplexity research team reports state of the art decode latency that is competitive with DeepEP across expert counts. On AWS EFA, the same kernel delivers the first viable MoE decode latencies with higher but still practical values.

In multi node tests with DeepSeek V3 and Kimi K2 on AWS H200 instances, distributing the model across nodes reduces latency at medium batch sizes, which is the common regime for production serving.

Comparison Table

Key pointTransferEngine (pplx garden)DeepEPNVSHMEM (generic MoE use)MooncakePrimary rolePortable RDMA point to point for LLM systemsMoE all to all dispatch and combineGeneral GPU shared memory and collectivesDistributed KV cache for LLM inferenceHardware focusNVIDIA ConnectX 7 and AWS EFA, multi NIC per GPUNVIDIA ConnectX with GPU initiated RDMA IBGDANVIDIA GPUs on RDMA fabrics including EFARDMA NICs in KV centric serving stacksEFA statusFull support, peak 400 Gbps reportedNo support, requires IBGDA on ConnectXAPI works but MoE use shows severe degradation on EFAPaper reports no EFA support in its RDMA enginePortability for LLM systemsCross vendor, single API across ConnectX 7 and EFAVendor specific and ConnectX focusedNVIDIA centric, not viable for EFA MoE routingFocused on KV sharing, no cross provider support

Key Takeaways

TransferEngine gives a single RDMA point to point abstraction that works on both NVIDIA ConnectX 7 and AWS EFA, and manages multiple Network Interface Controllers per GPU transparently.

The library exposes one sided WriteImm with ImmCounter, and achieves peak 400 Gbps throughput on both NIC families, which lets it match single vendor stacks while remaining portable.

Perplexity team uses TransferEngine in three production systems, disaggregated prefill decode with KvCache streaming, reinforcement learning weight transfer that updates trillion parameter models in about 1.3 seconds, and Mixture of Experts dispatch combine for large models like Kimi K2.

On ConnectX 7, pplx garden’s MoE kernels provide state of the art decode latency and exceed DeepEP on the same hardware, while on EFA they deliver the first practical MoE latencies for trillion parameter workloads.

Because TransferEngine is open source in pplx garden under an MIT license, teams can run very large Mixture of Experts and dense models on heterogeneous H100 or H200 clusters across cloud providers, without rewriting for each vendor specific networking stack.

Editorial Comments

Perplexity’s release of TransferEngine and pplx garden is a practical contribution for LLM infra teams who are blocked by vendor specific networking stacks and expensive fabric upgrades. A portable RDMA abstraction that reaches peak 400 Gbps on both NVIDIA ConnectX 7 and AWS EFA, supports KvCache streaming, fast reinforcement learning weight transfer, and Mixture of Experts routing, directly addresses trillion parameter serving constraints for real systems.

Check out the Paper and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Perplexity AI Releases TransferEngine and pplx garden to Run Trillion Parameter LLMs on Existing GPU Clusters appeared first on MarkTechPost.

Streamline AI operations with the Multi-Provider Generative AI Gateway …

As organizations increasingly adopt AI capabilities across their applications, the need for centralized management, security, and cost control of AI model access is a required step in scaling AI solutions. The Generative AI Gateway on AWS guidance addresses these challenges by providing guidance for a unified gateway that supports multiple AI providers while offering comprehensive governance and monitoring capabilities.
The Generative AI Gateway is a reference architecture for enterprises looking to implement end-to-end generative AI solutions featuring multiple models, data-enriched responses, and agent capabilities in a self-hosted way. This guidance combines the broad model access of Amazon Bedrock, unified developer experience of Amazon SageMaker AI, and the robust management capabilities of LiteLLM, all while supporting customer access to models from external model providers in a more secure and reliable manner.
LiteLLM is an open source project that addresses common challenges faced by customers deploying generative AI workloads. LiteLLM simplifies multi-provider model access while standardizing production operational requirements including cost tracking, observability, prompt management, and more. In this post we’ll introduce how the Multi-Provider Generative AI Gateway reference architecture provides guidance for deploying LiteLLM into an AWS environment for production generative AI workload management and governance.
The challenge: Managing multi-provider AI infrastructure
Organizations building with generative AI face several complex challenges as they scale their AI initiatives:

Provider fragmentation: Teams often need access to different AI models from various providers—Amazon Bedrock, Amazon SageMaker AI, OpenAI, Anthropic, and others—each with different APIs, authentication methods, and billing models.
Decentralized governance model: Without a unified access point, organizations struggle to implement consistent security policies, usage monitoring, and cost controls across different AI services.
Operational complexity: Managing multiple access paradigms ranging from AWS Identity and Access Management roles to API keys, model-specific rate limits, and failover strategies across providers creates operational overhead and increases the risk of service disruptions.
Cost management: Understanding and controlling AI spending across multiple providers and teams becomes increasingly difficult, particularly as usage scales.
Security and compliance: Facilitating consistent security policies and audit trails across different AI providers presents significant challenges for enterprise governance.

Multi-Provider Generative AI Gateway reference architecture
This guidance addresses these common customer challenges by providing a centralized gateway that abstracts the complexity of multiple AI providers behind a single, managed interface.

Built on AWS services and using the open source LiteLLM project, organizations can use this solution to integrate with AI providers while maintaining centralized control, security, and observability.

Flexible deployment options on AWS
The Multi-Provider Generative AI Gateway supports multiple deployment patterns to meet diverse organizational needs:
Amazon ECS deployment For teams preferring containerized applications with managed infrastructure, the ECS deployment provides serverless container orchestration with automatic scaling and integrated load balancing.
Amazon EKS deployment Organizations with existing Kubernetes expertise can use the EKS deployment option, which provides full control over container orchestration while benefiting from a managed Kubernetes control plane. Customers can deploy a new cluster or leverage existing clusters for deployment.
The reference architecture provided for these deployment options is subject to additional security testing based on your organization’s specific security requirements. Conduct additional security testing and review as necessary before deploying anything into production.
Network architecture options
The Multi-Provider Generative AI Gateway supports multiple network architecture options:
Global Public-Facing Deployment For AI services with global user bases, combine the gateway with Amazon CloudFront (CloudFront) and Amazon Route 53. This configuration provides:

Enhanced security with AWS Shield DDoS protection
Simplified HTTPS management with the Amazon CloudFront default certificates
Global edge caching for improved latency
Intelligent traffic routing across regions

Regional direct access For single-Region deployments prioritizing low latency and cost optimization, direct access to the Application Load Balancer (ALB) removes the CloudFront layer while maintaining security through properly configured security groups and network ACLs.
Private internal access Organizations requiring complete isolation can deploy the gateway within a private VPC without internet exposure. This configuration makes sure that the AI model access remains within your secure network perimeter, with ALB security groups restricting traffic to authorized private subnet CIDRs only.
Comprehensive AI governance and management
The Multi-Provider Generative AI Gateway is built to enable robust AI governance standards from a straightforward administrative interface. In addition to policy-based configuration and access management, users can configure advanced capabilities like load-balancing and prompt caching.
Centralized administration interface
The Generative AI Gateway includes a web-based administrative interface in LiteLLM that supports comprehensive management of LLM usage across your organization.
Key capabilities include:
User and team management: Configure access controls at granular levels, from individual users to entire teams, with role-based permissions that align with your organizational structure.
API key management: Centrally manage and rotate API keys for the connected AI providers while maintaining audit trails of key usage and access patterns.
Budget controls and alerting: Set spending limits across providers, teams, and individual users with automated alerts when thresholds are approached or exceeded.
Comprehensive cost controls: Costs are influenced by AWS infrastructure and LLM providers. While it is the customer’s responsibility to configure this solution to meet their cost requirements, customers may review the existing cost settings for additional guidance.
Supports multiple model providers: Compatible with Boto3, OpenAI, and LangGraph SDK, allowing customers to use the best model for the workload regardless of the provider.
Support for Amazon Bedrock Guardrails: Customers can leverage guardrails created on Amazon Bedrock Guardrails for their generative AI workloads, regardless of the model provider.
Intelligent routing and resilience
Common considerations around model deployment include model and prompt resiliency. These factors are important to consider how failures are handled when responding to a prompt or accessing data stores.
Load balancing and failover: The gateway implements sophisticated routing logic that distributes requests across multiple model deployments and automatically fails over to backup providers when issues are detected.
Retry logic: Built-in retry mechanisms with exponential back-off facilitate reliable service delivery even when individual providers experience transient issues.
Prompt caching: Intelligent caching helps reduce costs by avoiding duplicate requests to expensive AI models while maintaining response accuracy.
Advanced policy management
Model deployment architecture can range from the simple to highly complex. The Multi-Provider Generative AI Gateway features the advanced policy management tools needed to maintain a strong governance posture.
Rate limiting: Configure sophisticated rate limiting policies that can vary by user, API key, model type, or time of day to facilitate fair resource allocation and help prevent abuse.
Model access controls: Restrict access to specific AI models based on user roles, making sure that sensitive or expensive models are only accessible to authorized personnel.
Custom routing rules: Implement business logic that routes requests to specific providers based on criteria such as request type, user location, or cost optimization requirements.
Monitoring and observability
As AI workloads grow to include more components, so to do observability needs. The Multi-Provider Generative AI Gateway architecture integrates with Amazon CloudWatch. This integration enables users to configure myriad monitoring and observability solutions, including open-source tools such as Langfuse.
Comprehensive logging and analytics
The gateway interactions are automatically logged to CloudWatch, providing detailed insights into:

Request patterns and usage trends across providers and teams
Performance metrics including latency, error rates, and throughput
Cost allocation and spending patterns by user, team, and model type
Security events and access patterns for compliance reporting

Built-in troubleshooting
The administrative interface provides real-time log viewing capabilities so administrators can quickly diagnose and resolve usage issues without needing to access CloudWatch directly.

Amazon SageMaker integration for expanded model access
Amazon SageMaker helps enhance the Multi-Provider Generative AI Gateway guidance by providing a comprehensive machine learning system that seamlessly integrates with the gateway’s architecture. By using the Amazon SageMaker managed infrastructure for model training, deployment, and hosting, organizations can develop custom foundation models or fine-tune existing ones that can be accessed through the gateway alongside models from other providers. This integration removes the need for separate infrastructure management while facilitating consistent governance across both custom and third-party models. SageMaker AI model hosting capabilities expands the gateway’s model access to include self-hosted models, as well as those available on Amazon Bedrock, OpenAI, and other providers.
Our open source contributions
This reference architecture builds upon our contributions to the LiteLLM open source project, enhancing its capabilities for enterprise deployment on AWS. Our enhancements include improved error handling, enhanced security features, and optimized performance for cloud-native deployments.
Getting started
The Multi-Provider Generative AI Gateway reference architecture is available today through our GitHub repository, complete with:

Infrastructure-as-Code: Amazon CloudFormation and AWS Cloud Development Kit (CDK) templates for automated deployment into an Amazon ECS cluster
Comprehensive documentation: Step-by-step deployment guides and configuration examples
Interactive workshop: Hands-on learning experience to explore the gateway capabilities
Detailed deployment guide: Deployment blog on AWS Builder Center

The code repository describes several flexible deployment options to get started.
Public gateway with global CloudFront distribution
Use CloudFront to provide a globally distributed, low-latency access point for your generative AI services. The CloudFront edge locations deliver content quickly to users around the world, while AWS Shield Standard helps protect against DDoS attacks. This is the recommended configuration for public-facing AI services with a global user base.
Custom domain with CloudFront
For a more branded experience, you can configure the gateway to use your own custom domain name, while still benefiting from the performance and security features of CloudFront. This option is ideal if you want to maintain consistency with your company’s online presence.
Direct access via public Application Load Balancer
Customers who prioritize low-latency over global distribution can opt for a direct-to-ALB deployment, without the CloudFront layer. This simplified architecture can offer cost savings, though it requires extra consideration for web application firewall protection.
Private VPC-only access
For a high level of security, you can deploy the gateway entirely within a private VPC, isolated from the public internet. This configuration is well-suited for processing sensitive data or deploying internal-facing generative AI services. Access is restricted to trusted networks like VPN, Direct Connect, VPC peering, or AWS Transit Gateway.
Learn more and deploy today
Ready to simplify your multi-provider AI infrastructure? Access the complete solution package to explore an interactive learning experience with step-by-step guidance describing each step of the deployment and management process.
Conclusion
The Multi-Provider Generative AI Gateway is a solution guidance intended to help customers get started working on generative AI solutions in a well-architected manner, while taking advantage of the AWS environment of services and complimentary open-source packages. Customers can work with models from Amazon Bedrock, Amazon SageMaker JumpStart, or third-party model providers. Operations and management of workloads is conducted via the LiteLLM management interface, and customers can choose to host on ECS or EKS based on their preference.
In addition, we have published a sample that integrates the gateway into an agentic customer service application. The agentic system is orchestrated using LangGraph and deployed on Amazon Bedrock AgentCore. LLM calls are routed through the gateway, providing the flexibility to test agents with different models–whether hosted on AWS or another provider.
This guidance is just one part of a mature generative AI foundation on AWS. For deeper reading on the components of a generative AI system on AWS, see Architect a mature generative AI foundation on AWS, which describes additional components of a generative AI system.

About the authors
Dan Ferguson is a Sr. Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.
Bobby Lindsey is a Machine Learning Specialist at Amazon Web Services. He’s been in technology for over a decade, spanning various technologies and multiple roles. He is currently focused on combining his background in software engineering, DevOps, and machine learning to help customers deliver machine learning workflows at scale. In his spare time, he enjoys reading, research, hiking, biking, and trail running.
Nick McCarthy is a Generative AI Specialist at AWS. He has worked with AWS clients across various industries including healthcare, finance, sports, telecoms and energy to accelerate their business outcomes through the use of AI/ML. Outside of work he loves to spend time traveling, trying new cuisines and reading about science and technology. Nick has a Bachelors degree in Astrophysics and a Masters degree in Machine Learning.
Chaitra Mathur is as a GenAI Specialist Solutions Architect at AWS. She works with customers across industries in building scalable generative AI platforms and operationalizing them. Throughout her career, she has shared her expertise at numerous conferences and has authored several blogs in the Machine Learning and Generative AI domains.
Sreedevi Velagala is a Solution Architect within the World-Wide Specialist Organization Technology Solutions team at Amazon Web Services, based in New Jersey. She has been focused on delivering tailored solutions and guidance aligned with the unique needs of diverse clientele across AI/ML, Compute, Storage, Networking and Analytics domains. She has been instrumental in helping customers learn how AWS can lower the compute costs for machine learning workloads using Graviton, Inferentia and Trainium. She leverages her deep technical knowledge and industry expertise to deliver tailored solutions that align with each client’s unique business needs and requirements.

Deploy geospatial agents with Foursquare Spatial H3 Hub and Amazon Sag …

Organizations have used geospatial machine learning (ML) for property risk assessment, disaster response, and infrastructure planning. These systems worked well but couldn’t scale beyond specialized use cases. Each question required multiple geospatial datasets, each with its own model and often its own workflow, limiting these capabilities to a handful of high-value use cases at the largest enterprises that could afford the investment. In this post, you’ll learn how to deploy geospatial AI agents that can answer complex spatial questions in minutes instead of months. By combining Foursquare Spatial H3 Hub’s analysis-ready geospatial data with reasoning models deployed on Amazon SageMaker AI, you can build agents that enable nontechnical domain experts to perform sophisticated spatial analysis through natural language queries—without requiring geographic information system (GIS) expertise or custom data engineering pipelines.
Geospatial intelligence adoption barriers
Two technical barriers have prevented these specialized geospatial systems from achieving broader adoption. First, geospatial data arrives in a bewildering array of formats—satellite imagery stored as GeoTIFF rasters, administrative boundaries stored as shapefile vectors, weather models stored as NetCDF grids, and property records in proprietary cadastral formats—each requiring different parsing libraries and custom data pipelines. Second, joining datasets across spatial granularities is nontrivial: property insurance data geocoded to individual addresses must combine with climate risk data at 1 km grid cells and census demographics aggregated to block groups, requiring organizations to spend months building custom processing pipelines before answering their first business question. In short, there is no universal join key to combine these datasets. This means organizations can’t experiment with geospatial intelligence without first building data engineering pipelines to normalize diverse formats, implement spatial processing for coordinate transformations and resolution resampling, and deploy specialized computing infrastructure.
Solving technical barriers alone wasn’t sufficient. Earlier systems still required 6–12 month implementations with specialized GIS teams. Five enterprise requirements remained unaddressed: making geospatial analysis accessible to nontechnical domain experts, showing how AI reaches conclusions, supporting flexible analysis, delivering interactive response times, and offering cost predictability at scale.
Three technologies converging to address adoption challenges
Addressing these technical and enterprise barriers requires a fundamentally different approach. This architecture combines three technologies to address those gaps:

Foursquare Spatial H3 Hub for analysis-ready data – This service transforms inaccessible raster and vector geospatial data into analysis-ready features, indexed to the H3 hierarchical grid system, in tabular format that data scientists can query using familiar tools such as Spark, Python, and DuckDB. Datasets containing latitude and longitude coordinates, city names, or zip codes can be easily enriched by joining on a common H3 cell, eliminating months of data preparation and specialized GIS expertise.
Reasoning models and agentic AI for adaptive workflows – Models such as DeepSeek-R1 and Llama 3 break down complex problems, reason through multistep workflows, and orchestrate actions across data sources. They dynamically determine which datasets to combine and plan analytical sequences that previously required GIS expertise—transforming static, preconfigured workflows into adaptive reasoning systems.
Amazon SageMaker AI for cost-effective generative AI inference – This Amazon SageMaker AI capability provides managed infrastructure for deploying open source models with optimized inference runtimes, auto scaling, and operational tooling. Teams can focus on building geospatial intelligence capabilities rather than managing underlying infrastructure.

Together, these technologies enable organizations to access analysis-ready geospatial data, deploy adaptive reasoning agents, and run production inference without building specialized infrastructure. In this post, we demonstrate a production geospatial agent that combines Foursquare Spatial H3 Hub with reasoning models deployed on Amazon SageMaker AI.
Analysis-ready geospatial data with Foursquare Spatial H3 Hub
Foursquare’s Spatial H3 Hub eliminates traditional geospatial adoption barriers through a proprietary H3 indexing engine. This engine has transformed dozens of disparate geospatial datasets into an Iceberg catalog ready for immediate analysis, replacing months of data engineering with instant access to analysis-ready geospatial features.
The H3 indexing engine addresses the root cause of geospatial complexity: the vast array of formats and coordinate systems that have historically limited access to geographic information. The engine converts spatial data, raster imagery, or vector datasets by indexing it into the H3 hierarchical spatial grid at global scale. H3 divides the entire Earth into nested hexagonal cells, creating a universal grid system where every location has a standardized identifier. The engine extracts data from raster images or diverse vector shapes such as census tract polygons and converts them into features attached to H3 cell IDs in tabular format, where the cell ID becomes a universal join key that abstracts away format complexity and coordinate systems. An insurance company’s property data, National Oceanic and Atmospheric Administration (NOAA) climate projections, census demographics, and infrastructure networks can all be combined because they share this common spatial index.

The engine also handles the methodological complexities that traditionally required GIS expertise. It can index data to H3 cells at any precision from resolution 0 (about 1,000 km hexagons covering continents) down to resolution 15 (about 1 meter hexagons covering individual buildings). You can choose the appropriate resolution for each use case—coarser resolutions for regional climate analysis, finer resolutions for property-level assessment. When boundaries don’t align perfectly—like a census tract overlapping multiple H3 hexagons—the engine intelligently handles partial overlaps through either fast centroid-based approximation or exact proportional allocation based on intersection areas. It also automatically aggregates or disaggregates data when combining datasets at different scales, eliminating the manual preprocessing that traditionally consumed months of GIS specialist time.
Built on this indexing foundation, Foursquare Spatial H3 Hub delivers an Iceberg catalog containing datasets spanning energy infrastructure, environmental conditions, and natural hazards all originally in diverse raster and vector formats, now pre-indexed to H3 cells at resolution 8 (with additional resolutions available on demand). You can query this data with familiar tools such as SQL, Python, Spark, Snowflake, and Databricks without proprietary GIS software. H3 cell identifiers become straightforward column values that join like any other attribute, so you can rapidly validate geospatial hypotheses by joining their proprietary data with Foursquare’s H3 catalog.
Reasoning models for spatial Intelligence
Reasoning models such as DeepSeek-R1 change how AI handles geospatial intelligence. Traditional geospatial systems operated as collections of static, purpose-built models, with separate models for flood risk, wildfire exposure, and earthquake vulnerability. Each model was trained on specific datasets and incapable of answering questions outside its narrow domain. When requirements shifted or new data emerged, organizations faced months of retraining. Reasoning models change this paradigm by decomposing complex problems, planning multistep workflows, and orchestrating actions across data sources dynamically. Rather than requiring pre-trained models for every question, these systems reason through novel scenarios by combining available data in ways never explicitly programmed. Asked “which neighborhoods face compounding climate and economic risks?”, a reasoning agent determines it needs flood exposure data, household income, property density, and neighborhood boundaries and then executes that analytical pipeline by calling appropriate tools and data sources. The agent understands spatial relationships conceptually: point data aggregates to polygons, grid cells map to administrative boundaries, proximity requires appropriate distance metrics. At each step, it reasons about what information comes next and adjusts when data reveals unexpected patterns, transforming geospatial analysis from pre-scripted queries into adaptive investigation.
Deploying agents on Amazon SageMaker AI
Analysis-ready geospatial data and reasoning-capable models solve critical parts of the puzzle, but production deployment creates new challenges. Geospatial agents need sustained inference capacity to process queries, execute reasoning chains, retrieve data, and generate visualizations. Organizations face a choice: build custom inference infrastructure with GPU clusters, load balancers, and auto scaling policies, or rely on commercial large language model (LLM) APIs where costs scale unpredictably with usage and data governance becomes complex.
Amazon SageMaker AI provides managed infrastructure for deploying and operating open source generative AI models in production. You can deploy models from Hugging Face or Amazon SageMaker AI JumpStart—including reasoning models such as DeepSeek-R1, Llama 3, or Qwen—to SageMaker AI real-time or asynchronous inference endpoints without managing underlying infrastructure. Amazon SageMaker AI Inference handles instance provisioning, supports optimized serving runtimes like vLLM and SGLang, and provides auto scaling based on traffic patterns.
Amazon SageMaker AI Inference capabilities address several operational challenges specific to agent architectures. Geospatial agents handling variable query loads throughout the day benefit from automatic scaling on GPU instances such as G5, P4d, and P5 based on request volume or custom metrics. Long-running spatial analyses that exceed typical API timeouts can route to asynchronous inference endpoints where SageMaker AI queues request, process them, and deliver results to Amazon Simple Storage Service (Amazon S3), enabling complex multi-dataset analyses without client-side timeout issues. For architectures employing multiple models, multi-container endpoints host different models on shared infrastructure with independent scaling policies and traffic routing. Built-in integration with Amazon CloudWatch for monitoring, AWS Identity and Access Management (IAM) for access control, and Amazon Virtual Private Cloud (Amazon VPC) for network isolation simplifies operational requirements.
Foursquare Spatial H3 Hub and Amazon SageMaker AI together reduce operational complexity. Data scientists can focus on building agent capabilities, defining which H3 Hub datasets to query for specific questions, refining prompting strategies for spatial reasoning, and optimizing tool-calling patterns rather than managing underlying infrastructure. Organizations can also experiment with different open source models. Such initiatives, which previously required separate teams for data engineering, model development, and platform operations, have now become accessible to smaller teams without specialized infrastructure expertise.
Designing the Foursquare Spatial Agent
The Foursquare Spatial Agent architecture combines reasoning models deployed on SageMaker AI with tool-calling capabilities that query Foursquare Spatial H3 Hub directly. The agent orchestrates the complete workflow from natural language question to visualization without manual intervention.
Agent workflow
When a user poses a natural language question about spatial relationships—such as “Which neighborhoods in Los Angeles face both high flood risk and economic vulnerability?”—the agent executes a multistep reasoning process. The reasoning model first analyzes the question and identifies required information: flood risk scores, economic indicators like income and employment, and neighborhood boundaries. It then determines which H3 Hub datasets contain relevant information by reasoning over dataset descriptions. With datasets selected, the model calls H3 Hub query tools, constructing SQL queries that join datasets on H3 cell IDs. After executing these queries, the model analyzes results to identify spatial patterns and statistical relationships. Finally, it generates Vega specifications for charts and Kepler.gl specifications for maps that visualize the findings.
This workflow uses the reasoning model’s ability to plan, adapt, and recover from errors. If initial queries return unexpected results, the model can refine its approach, select additional datasets, or adjust spatial operations—capabilities of that static, preprogrammed workflow.
Design decisions addressing enterprise requirements
Building a production geospatial agent required addressing the five enterprise requirements identified through deployment analysis. Three key design decisions illustrate how the architecture balances accessibility, transparency, and flexibility.
Insurance underwriters understand flood risk and property exposure but don’t write SQL or Python. The agent architecture makes geospatial analysis accessible by accepting natural language questions and translating them into appropriate H3 Hub queries. The reasoning model interprets domain-specific terminology like “vulnerable neighborhoods” or “high-risk areas” and maps these concepts to relevant datasets and analytical operations. This eliminates the bottleneck where domain experts must submit analysis requests to data teams, enabling self-service exploration.
Domain experts also need to understand how the agent arrived at conclusions, especially when analyses inform business decisions. The agent can log its reasoning process at each step: which datasets were considered and why, what spatial operations were planned, which queries were executed, and how results were interpreted. Every visualization includes metadata showing which H3 cells and source datasets contributed to the analysis. This transparency means users can validate the agent’s analytical approach and understand the data sources behind conclusions. If an insurance underwriter sees a high-risk assessment for a property, they can trace back through the reasoning chain to see it combined flood exposure data from Federal Emergency Management Agency (FEMA), wildfire risk from state forestry data, and property characteristics from local assessor records—building confidence in AI-generated insights. Implementation uses structured logging to capture reasoning steps, making the agent’s decision-making process inspectable and debuggable rather than a black box.
Pre-built dashboards serve known questions but fail when analysts need to explore variations. The agent architecture provides flexibility by using tool-calling to dynamically compose analyses. Rather than predefining workflows for every scenario, the reasoning model determines which H3 Hub datasets to query and how to combine them based on the specific question. This enables the agent to handle unforeseen analytical questions without requiring new engineering work for each variation. The agent uses function calling APIs supported by models such as Llama 3 and DeepSeek-R1 to interact with H3 Hub. The model receives tool descriptions specifying available datasets, query parameters, and return formats, then constructs appropriate tool calls during reasoning. SageMaker AI endpoints handle the inference, while custom application logic manages tool execution and result assembly.
SageMaker AI deployment architecture
The Foursquare Spatial Agent deploys on SageMaker AI real-time inference endpoints with configuration optimized for production geospatial workloads. The deployment uses G5 instances such as g5.2xlarge for development and g5.12xlarge for production, providing cost-effective GPU inference for models in the 7B–70B parameter range commonly used for agent reasoning. A target tracking scaling policy based on the InvocationsPerInstance metric maintains response times during variable load while minimizing costs during low-traffic periods. Spatial analyses involving large geographic extents or many datasets join route to asynchronous inference endpoints, allowing queries that can take 60 seconds or more to complete without exceeding typical API timeout limits while maintaining responsive behavior for more straightforward queries.
CloudWatch metrics track inference latency, error rates, and token throughput across the deployment. Custom metrics log reasoning chain depth, number of tool calls per query, and dataset access patterns, enabling continuous optimization of agent performance. This deployment architecture provides production-grade reliability while maintaining flexibility for experimentation with different models and prompting strategies.
Foursquare Spatial Agent in action
The following demonstrations show how organizations across insurance, banking, and urban planning can use this capability to answer complex spatial questions in minutes—collapsing timelines that previously stretched across quarters into interactive workflows accessible to domain experts without specialized technical skills. In insurance risk assessment, the agent predicts which areas in the Los Angeles region are likely to witness increased insurance rates by computing a composite risk score from flood risk, fire hazard severity, crime rates and the FEMA national risk index datasets at different spatial resolutions and formats, now queryable through common H3 cell IDs. An underwriter asks the question in natural language, and the agent handles dataset selection, spatial joins, risk aggregation, and map visualization without requiring GIS expertise.

For banking market analysis, the agent provides a 360-degree view of Los Angeles’s bank network planning. It combines demographic data including population, income, and age distribution with healthcare facility locations, crime statistics, and points of interest to identify under-served markets and expansion opportunities. This analysis informs data-driven decisions for branch placement, product targeting, and financial inclusion initiatives. Previously, assembling these datasets and performing spatial analysis required weeks of GIS specialist time. Now, the agent delivers results in minutes through conversational interaction.

For urban infrastructure planning, the agent helps the city of Chandler, Arizona, plan sustainable urban development over the next decade. It combines population growth projections, housing development patterns, median income trends, and infrastructure data including buildings, power lines, and cell towers—all indexed to H3 cells. Urban planners explore scenarios by asking questions like “which areas will experience population growth but lack adequate infrastructure?” The agent reasons through the analytical requirements, executes appropriate spatial queries, and generates visualizations showing infrastructure gaps that need investment.

The democratization of geospatial intelligence
Foursquare Spatial H3 Hub, reasoning models, and Amazon SageMaker AI together remove the barriers. Organizations can now access standardized geospatial data, deploy reasoning agents with tool-calling capabilities, and run production inference without building specialized infrastructure.
To deploy geospatial AI agents:

Access Foursquare Spatial H3 Hub for analysis-ready datasets.
Deploy reasoning models on Amazon SageMaker AI with SageMaker JumpStart or Hugging Face.
Build agent capabilities that connect models to H3 Hub datasets through tool-calling.

About the authors
Vikram Gundeti currently serves as the Chief Technology Officer (CTO) of Foursquare, where he leads the technical strategy, decision making, and research for the company’s Geospatial Platform. Before joining Foursquare, Vikram held the position of Principal Engineer at Amazon, where he made his mark as a founding engineer on the Amazon Alexa team.
Amit Modi is a Senior Manager of Product Management at Amazon SageMaker AI, where he focuses on ModelOps and Inference. His analysis of enterprise adoption patterns and design of the SageMaker deployment approach described in this post emerged from work with enterprise customers.
Aditya Badhwar is a Senior Solutions Architect at AWS based out of New York. He works with customers providing technical assistance and architectural guidance on various AWS services. Prior to AWS, Aditya worked for over 16 years in software engineering and architecture roles for various large-scale enterprises.

How Wipro PARI accelerates PLC code generation using Amazon Bedrock

This post is co-written with Rejin Surendran from Wipro Enterprises Limited and Bakrudeen K from ShellKode.
In manufacturing environments, industrial automation engineers face a significant challenge: how to rapidly convert complex process requirements into Programmable Logic Controller (PLC) ladder text code. This traditional, manual process typically requires 3-4 days per query, creating bottlenecks in production workflows. The complexity stems from multiple factors: engineers must meticulously translate high-level requirements into precise machine instructions while managing multiple states and transitions, facilitate compliance with the international PLC programming standard IEC 61131-3, handle complex variable declarations, maintain detailed documentation for industrial compliance, and conduct thorough testing of safety protocols and execution paths.
Wipro PARI is one of the largest global automation companies with over 1,300 employees and three facilities worldwide, with its headquarters in Pune, India. Wipro PARI has the vision to utilize its expertise and resources to bring the best solutions in automation and robotics to its customers.
In this post, we share how Wipro implemented advanced prompt engineering techniques, custom validation logic, and automated code rectification to streamline the development of industrial automation code at scale using Amazon Bedrock. We walk through the architecture along with the key use cases, explain core components and workflows, and share real-world results that show the transformative impact on manufacturing operations.
Why Wipro PARI chose Amazon Bedrock?
Wipro PARI partnered with AWS and ShellKode to develop an innovative solution that transforms this time-intensive PLC code generation process using AI. Using Amazon Bedrock and Anthropic’s Claude models, we have developed a system that:

Reduces PLC code generation time from 3–4 days to approximately 10 minutes per requirement
Improves code accuracy up to 85%
Automates validation against industry standards
Handles complex state management and transition logic automatically
Facilitates proper variable declarations and naming conventions
Maintains compliance documentation and audit trails
Provides a user-friendly interface for industrial engineers

Wipro PARI selected Amazon Bedrock as the foundation for this PLC code generation solution due to its unique combination of enterprise capabilities that align with industrial automation requirements. With the broad model choice available in Amazon Bedrock, the team can use Anthropic’s Claude 3.5 Sonnet for complex code generation while maintaining flexibility to switch models as newer, more capable versions become available without infrastructure changes. The fully managed service reduces the operational overhead of hosting and scaling machine learning (ML) infrastructure, helping Wipro PARI’s engineers focus on domain-specific automation logic rather than model deployment.
Critically for industrial applications, Amazon Bedrock makes sure that the customer data—including proprietary control logic and manufacturing specifications—remains within the AWS environment and is not used to train underlying foundation models (FMs), thereby maintaining strict data privacy and intellectual property protection. This security posture, combined with the AWS compliance certifications, provides the enterprise-grade governance required for manufacturing environments handling sensitive operational data.
Solution overview
In this section, we present the solution architecture and user workflow of the Wipro PLC Code Generator. The following diagram illustrates the end-to-end architecture.

Architecture components
The architecture consists of the following key components:

Frontend client layer – The frontend client layer consists of a React-based, responsive web application that makes it possible for industrial engineers to upload control logic spreadsheets, configure generation settings, and verify generated ladder code with full traceability.
Backend application services layer – The WIPRO PARI solution implements a React and FastAPI microservices architecture with over 30 specialized APIs deployed on load-balanced Amazon Elastic Compute Cloud (Amazon EC2) instances within a secure virtual private cloud (VPC) environment for industrial automation PLC code generation, with plans to migrate to Amazon Elastic Container Service (Amazon ECS) in future iterations. The VPC configuration includes public and private subnet isolation with bastion server access control for secure remote management of the industrial control system development service. The backend application services layer is organized into distinct components, including controllers for request handling, core services for business logic, authentication modules for user management, file processing engines for spreadsheet handling, and spreadsheet parsers for extracting control logic specifications from industrial automation documentation.
AI/ML processing layer – The solution includes a dedicated AI/ML processing layer that integrates with Amazon Bedrock and uses multiple Anthropic Claude models depending on task complexity and requirements. The large language model (LLM) integration services transform control logic requirements into intermediate structured pseudo queries, which are then converted into standardized PLC ladder text code through multi-iteration processing. The system handles complex industrial automation scenarios, including parallel execution paths, fork/defork logic, and Boolean expressions commonly found in manufacturing control systems.
Data and storage layer – The generated PLC code undergoes intelligent rectification to fix syntax and logical errors specific to ladder logic programming, followed by systematic validation against predefined industrial guidelines to facilitate code quality and safety compliance. Amazon Simple Storage Service (Amazon S3) buckets store generated code artifacts, templates, and version history for industrial project management. The system uses Amazon Relational Database Service (Amazon RDS) for PostgreSQL databases for persistent state management, project tracking, and maintaining relationships between control logic specifications and generated code.

User workflow
The code generation workflow consists of the following steps:

User input and authentication – An industrial engineer logs in to the React web application, authenticates through role-based access controls, and uploads Excel spreadsheets.
Data processing and transformation – The system processes the uploaded spreadsheets containing control logic specifications for PLC programming requirements through Excel parsers. It extracts the control logic data, validates input specifications against industrial standards, and transforms raw data into structured format suitable for AI processing.
AI-powered code generation – LLM integration services send structured requirements to Amazon Bedrock using Anthropic’s Claude 3.5 Sonnet, which generates intermediate pseudo queries, converts them into standardized PLC ladder text code, and handles complex industrial automation scenarios including parallel execution paths and Boolean expressions. A pseudo query is an intermediate structured representation that translates human-readable control logic requirements from Excel spreadsheets into a standardized format that can be processed by AI models to generate PLC code.

Example specification – When temperature > 80°C AND pressure < 5 bar, turn on cooling pump
Pseudo query – IF (TEMP_SENSOR > 80) AND (PRESSURE_SENSOR < 5) THEN SET COOLING_PUMP = TRUE

Validation and storage – The generated PLC code undergoes automated quality validation against IEC 61131-3 standards, intelligent rectification fixes syntax and logical errors, and validated code artifacts are stored in Amazon S3 with version control and traceability.
Engineer review – The industrial engineer reviews the generated ladder code through the web interface, verifies code quality and safety compliance, downloads validated PLC code for deployment, and maintains project history with a full audit trail for industrial compliance requirements.

The following GIF illustrates the complete user workflow from Excel upload to PLC code generation and download.

Security and compliance
User authentication and authorization are managed through Amazon Cognito, which validates user credentials and enforces role-based access controls to make sure only authorized personnel can access PLC code generation capabilities and sensitive industrial automation data. Security is implemented through AWS Identity and Access Management (IAM) based access controls managing engineer permissions and service-to-service authentication for industrial data protection. Amazon GuardDuty provides continuous threat detection, and AWS CloudTrail maintains comprehensive audit logging of the code generation activities for industrial compliance requirements.
In the following sections, we break down each functionality in detail. The modules used in the solution are integrated through a streamlined workflow to maximize automation and accuracy.
Data formatter
The solution begins with processing the pseudo query inputs, as shown in the following diagram. This crucial first step transforms various input formats into a standardized structure that can be effectively processed by the language model.

The workflow follows these steps:

Users upload the control logic available in a spreadsheet as inputs through the UI interface.
From the uploaded spreadsheet, the formatter intelligently extracts state definitions, transition numbers, associated actions, and forking/de-forking path relationships. This extracted information is useful in the downstream process to validate the PLC code.
The extracted information is stored in S3 buckets for persistence and future reference.
The data formatter constructs a comprehensive prompt containing the original spreadsheet data and specific processing instructions.
This prompt is sent to Anthropic’s Claude 3.5 Sonnet to convert the control logic into a structured pseudo query format. Lengthy descriptions are abbreviated to 20 characters to conform to PLC variable naming conventions.
The data formatter then passes control to the PLC code generator module.

The following code is a sample intermediate pseudo query (the output from the data formatter module). The pseudo query implements a safety monitoring system for industrial machinery that makes sure the machine only operates when the safety conditions are met. It monitors safety doors and emergency buttons, and includes proper reset procedures after a safety violation. Each state network contains the state numbers, the transition variables, and the actions to be performed for each transition.

State Number: 25
Description: Machine Safety Check
State Name: MchSafetyCheck
Action:
Transitions:
 – Condition: IF iSafetyDoorClosed & iEmergencyButtonReleased
   – Goto State Number: 28
 – Condition: IF !iSafetyDoorClosed | iEmergencyButtonPressed
   – Goto State Number: 26

State Number: 26
Description: Machine Safety Violation
State Name: MchSafetyViolation
Action:
  – SET oAlarmLight = TRUE
  – SET oMachineStop = TRUE
Transitions:
 – Condition: IF iAcknowledgeButton & iSafetyDoorClosed & iEmergencyButtonReleased
   – Goto State Number: 27

PLC code generator
To maximize the accuracy of ladder text generation, the solution employs sophisticated prompt engineering techniques and uses Anthropic’s Claude 3.5 Sonnet for code generation. The workflow steps for this part of the solution are shown in the following diagram.

Prompt creation
The prompt creation process consists of the following steps:

The intermediate pseudo query from the data formatter is passed to the PLC code generator module, which initiates the prompt creation process.
The prompt builder builds a detailed task prompt to generate the initial batch of PLC code and the subsequent batches as well. It includes:

PLC programming domain knowledge (state/transition variable naming conventions, network creation patterns for forking/de-forking, condition network structures) .
Few-shot examples demonstrating pseudo query to ladder text conversion.
Explicit instructions for handling state transitions, variable declarations, and complex Boolean expressions.

The prompt builder also creates a continuation prompt that instructs the FM to continue generating the PLC code from where it has left off in the previous iteration.

Few-shot sampling
We used a few-shot learning strategy to generate domain-specific outputs by providing relevant examples in the prompt context. Pseudo queries and related metadata including structural characteristics (state transitions, actions, control flow patterns) were indexed in a vector store. At inference, a hybrid retrieval strategy combines semantic similarity and lexical matching with the metadata to fetch the most relevant structurally aligned examples and their corresponding PLC code, which are then dynamically injected into the prompt. See the following code:

PLC_PROMPT = “””You are expert in writing code in PLC text ladder code …
##DYNAMIC EXAMPLES
{retrieved_examples}
##DOMAIN VARIABLES
{business_specific_variables}
##USER INPUT
{user_pseudo_code}
##FUNCTIONAL GUIDELINES
{custom_instructions}
“””

PLC code generation
The PLC code generation process consists of the following steps (as numbered in the preceding diagram):

The task prompt is passed to Anthropic’s Claude 3.5 Sonnet, which processes the prompt to generate the initial ladder text code containing up to 4,096 tokens (the maximum output tokens limit for the FM).
Because ladder text typically exceeds this limit, our solution implements an iterative generation approach with specialized continuation prompting. The system checks if generation is complete and requests additional continuation prompts as needed.
This continuation method maintains context between sequential generations, facilitating consistency throughout the entire code base.
The process continues iteratively until the PLC ladder code is fully generated. The completed code segments are then consolidated and passed to the code rectifier module for further processing.

The following code block shows a sample PLC code generated:

FUNCTION_BLOCK “Machine_Safety_Monitoring”
{ S7_Optimized_Access := ‘FALSE’ }
VERSION : 0.1
   VAR_INPUT
      iSafetyDoorClosed : Bool;
      iEmergencyButtonReleased : Bool;
      iEmergencyButtonPressed : Bool;
      iAutoRunning : Bool;
      iReset_fault : Bool;
   END_VAR

   VAR
      s25_MchSafetyCheck : Bool;
      s25_MchSafetyCheck_T1 : Bool;
      s25_MchSafetyCheck_T2 : Bool;
      SEQ01_ResetComplete : Bool;
      sStWtResetRel_T1 : Bool;
   END_VAR

NETWORK
TITLE = Transition for STATE Num:25 Machine Safety Check
      A #s25_MchSafetyCheck;
      AN #sStWtResetRel;
      A #sSst;
      A #iSafetyDoorClosed;
      A #iEmergencyButtonReleased;
      = #s25_MchSafetyCheck_T1;
      A #s25_MchSafetyCheck;
      AN #sStWtResetRel;
      A #sSst;
      AN #iSafetyDoorClosed;
      O #iEmergencyButtonPressed;
      = #s25_MchSafetyCheck_T2;
NETWORK
TITLE = STATE Num:25 Machine Safety Check
      A(;
      O #s25_MchSafetyCheck;
      O #sStWtResetRel_T1;
      );
      AN #sStWtResetRel;
      AN #s25_MchSafetyCheck_T1;
      AN #s25_MchSafetyCheck_T2;
      = %L1.0;
      A %L1.0;
      BLD 102;
      = #s25_MchSafetyCheck;
      A %L1.0;
      JNB Label_25;
      L 25;
      T #StateNo;
Label_25:      NOP 0;

Code rectifier
Because PLC ladder logic is inherently complex, LLMs might miss critical functionalities during initial code generation. The solution incorporates a sophisticated rectification system to address these gaps and facilitate high-quality output. The rectification uses a hybrid approach of custom logic containing business guidelines and an FM to perform the rectification task.The following diagram illustrates the workflow.

The rectifier module performs the following steps to help enhance code accuracy:

PLC code generated by the generator module is transferred to the rectifier module for enhancement.
The module facilitates proper handling of parallel execution paths, where sequences split into multiple branches and later re-converge, maintaining proper logic flow throughout the PLC program. This is done by invoking Anthropic’s Claude 3.7 Sonnet, which provides enhanced reasoning capabilities required for complex parallel execution path corrections, with a specialized prompt and the generated PLC code. Node/network mapping scripts are used to track state transitions and sequence tracking.
The module uses data extracted by the formatter (including transition variables’ source and destination states stored in Amazon S3) through the following phases:

Identification phase – Uses specialized Python algorithms to analyze the PLC code structure and cross-references transition variables against their declared source and destination states, flagging incorrect connections.
Remediation phase – Employs targeted Python routines to systematically remove incorrect connections while preserving the overall logic structure integrity.
Reconstruction phase – Implements custom Python logic to establish proper connections between states following correct sequential execution patterns.

The generated code might contain syntax errors, undeclared variables, or non-compliant naming. Using Anthropic’s Claude 3.5 Sonnet and custom logic, this process involves:

Identifying missing variables that are used within the code but not declared.
Adding missing variables to the declaration section.
Standardizing variable names to make sure the variables follow the Siemens S7-1517 PLC naming conventions.

The rectified PLC code and associated metadata are stored in Amazon S3.

Code evaluator
After rectification, the code undergoes a comprehensive validation process:

The validator module analyzes the rectified ladder text against the critical guidelines:

Unique state flags – Verifies that each state has a unique identifier with no duplicates.
Unique transition flags – Confirms the transition identifiers are unique throughout the code.
Proper connection verification – Validates that each transition connects to the correct destination state.
Input transition completeness – Makes sure every state has at least one input transition condition to trigger state changes.
Mutually exclusive conditions – Checks that transition variables within the same state are mutually exclusive to help prevent logic conflicts.

For each validation check, the system generates a detailed pass/fail result with specific information about the issues detected.
A comprehensive validation report is compiled, highlighting remaining issues that might require manual attention from engineers, with clear indicators of their location and nature in the code.
This multi-layered rectification and validation approach significantly helps improve the quality of the generated ladder text, reducing the need for manual intervention and accelerating the overall code development process.

UI and user interaction
The solution provides an intuitive UI that helps engineers interact with the system efficiently.The workflow for this part of the solution follows these steps:

Users access the web-based interface to upload control logic spreadsheets or structured text inputs.
The interface provides options to select different models and adjust parameters to optimize generation.
Advanced users can edit the prompts directly to customize the generation process.
The system displays the generated ladder text, pseudo query, and validation report, allowing engineers to quickly assess the output quality.

The entire process from upload to validated code typically completes in 3–7 minutes, depending on the complexity of the input query.The following GIF demonstrates the settings interface where users can configure model parameters including temperature, Top-P, Top-K values, select different models, and customize prompt settings for various projects.

Results and business impact
The solution improves upon Wipro PARI’s previous approach, demonstrating consistent performance across various test cases:

Average validation completion percentage across test cases was 85%
Processing time reduced from 3–4 days to approximately 10 minutes per query
Cost per query generation was approximately $0.40–$0.60
Perfect (100%) validation scores achieved on less complex queries such as “Conveyor controls”
Even complex queries with multiple state transitions achieved validation scores of 70–90%

This automation approach has transformed Wipro PARI’s PLC programming workflow, delivering measurable business impact including 5,000 work-hours saved across projects while minimizing manual coding errors. The solution helped their 200 engineers focus on high-value tasks like code design and application development while accelerating the code generation process. It also helped Wipro PARI win over key automotive clients and create a competitive advantage for complex automation projects. They plan to expand to other major PLC systems, including Rockwell Automation, Schneider Electric, and ABB in the future, helping Wipro PARI to scale their automotive industry expertise.
Conclusion
In this post, we explored how AWS collaborated with Wipro PARI to develop an AI-powered PLC Code Generator that transforms the time-intensive process of creating ladder text code from a given control logic. By using Amazon Bedrock with multiple Anthropic Claude models and a custom validation framework, the solution achieves an average accuracy of 85% while reducing code generation time from 3–4 days to approximately 10 minutes per query.
The Wipro PLC Code Generator represents a milestone in industrial automation programming, directly addressing the productivity challenges faced by Wipro PARI’s engineering consultants. The solution’s approach—combining prompt engineering, iterative code generation, automated rectification, and systematic validation—creates a robust framework that can be applied across various PLC programming scenarios.
Building on the current implementation, Wipro PARI is planning to expand the solution’s capabilities using additional Amazon Bedrock features. The team will implement Amazon Bedrock Guardrails to help enforce content filtering policies that help prevent generation of unsafe control logic and facilitate compliance with IEC 61131-3 standards at the model output level. The roadmap includes building multi-agent workflows using AWS Strands Agents, an open source SDK designed for autonomous AI agents, where specialized agents will handle distinct tasks: one agent for requirements analysis, another for code generation, and a third for automated documentation generation. To scale these agents in production, Wipro PARI will use Amazon Bedrock AgentCore, which provides serverless infrastructure for deploying and scaling agents with enterprise-grade security, session isolation, and built-in identity management. Amazon Bedrock AgentCore Memory will enable the system to maintain context across engineering sessions, allowing agents to remember previous interactions and build upon prior work, and an Amazon Bedrock AgentCore gateway will help securely connect agents to existing PLC validation tools and internal automation systems. Wipro PARI intends to build agents for automated testing, security scanning and automated document generation. In addition, Wipro PARI plans to expand this solution by incorporating additional validation rules, helping enhance the UI, and adding support for complex sequence types and integration with SIEMENS software for direct code deployment.
As industrial automation continues to evolve with increasing complexity, AI-assisted programming tools like the Wipro PLC Code Generator help accelerate development cycles and improve code quality. By reducing the manual burden of code generation and validation, engineers can focus on higher-value tasks such as system optimization and innovation, ultimately contributing to more efficient and reliable manufacturing operations across industries.
To learn more about the resources used in this solution, refer to the following additional resources:

Amazon Bedrock Documentation
Getting started with Amazon Bedrock
Claude by Anthropic in Amazon Bedrock
AWS Industrial Automation Solutions
AWS Blog: Generative AI for Industrial Applications

About the authors
Aparajithan Vaidyanathan is a Principal Enterprise Solutions Architect at AWS. He supports enterprise customers migrate and modernize their workloads on AWS cloud. He is a Cloud Architect with 25+ years of experience designing and developing enterprise, large-scale and distributed software systems. He specializes in Generative AI & Machine Learning with focus on moving Enterprise GenAI/ML applications to production, at scale.
Charu Dixit is a Solutions Architect at Amazon Web Services (AWS), helping GSI customers with cloud transformation strategies and solution design, focusing on containers, networking, and generative AI. With over 8 years of experience at AWS, she specializes in Amazon EKS and ELB, guiding customers through building and modernizing containerized applications at scale. Outside of work, Charu enjoys traveling, drawing and painting, and spending quality time with her family.
Debasish Mishra is a Senior Data Scientist at the AWS Generative AI Innovation Center, where he helps customers leverage AWS AI/ML services to solve complex business challenges through generative AI solutions. With experience spanning fintech, healthcare, sports, automotive, retail, manufacturing, he brings cross-industry expertise to diverse use cases. His specializations include code generation, AI agent frameworks, fine-tuning vision language models and robot foundation models, RAG systems, and multimodal applications. Debasish is passionate about enabling organizations to implement practical, impactful AI solutions.
Divakaran Ullampuzha Mana is the Head of Solution Architecture for Global Service Integrators (GSI) & IT/ITeS at AWS India. He leads solution architects who advise enterprise customers on cloud transformation strategies, with expertise in cloud computing, AI/ML, Generative AI, and digital transformation. Prior to AWS, he held executive leadership positions at Kyndryl and IBM, where he established and scaled cloud migration practices. He is an active thought leader, regularly speaking at industry events and mentoring technologists.
Rejin Surendran is the Global CIO at Wipro Enterprises Limited, where he leads digital transformation initiatives across the enterprise. With over 25 years of experience in technology leadership, he has driven large-scale transformation projects across commercial, supply chain, people, and finance functions. He holds a Master of Management from IIT Bombay and a B.Tech in Electrical & Electronics Engineering from NIT Warangal.
Bakrudeen K is an AWS Ambassador and leads the AI/ML practice at ShellKode, driving innovation in Generative and Agentic AI. He builds advanced AI solutions and Agentic Assistants that enable enterprises to scale intelligent systems responsibly. In 2025, he became the first-ever recipient of the AWS Ambassador Golden Jacket for Agentic AI, a global first within the AWS Ambassador Program.

Allen Institute for AI (AI2) Introduces Olmo 3: An Open Source 7B and …

Allen Institute for AI (AI2) is releasing Olmo 3 as a fully open model family that exposes the entire ‘model flow’, from raw data and code to intermediate checkpoints and deployment ready variants.

Olmo 3 is a dense transformer suite with 7B and 32B parameter models. The family includes Olmo 3-Base, Olmo 3-Think, Olmo 3-Instruct, and Olmo 3-RL Zero. Both 7B and 32B variants share a context length of 65,536 tokens and use the same staged training recipe.

https://allenai.org/blog/olmo3

Dolma 3 Data Suite

At the core of the training pipeline is Dolma 3, a new data collection designed for Olmo 3. Dolma 3 consists of Dolma 3 Mix, Dolma 3 Dolmino Mix, and Dolma 3 Longmino Mix. Dolma 3 Mix is a 5.9T token pre training dataset with web text, scientific PDFs, code repositories, and other natural data. The Dolmino and Longmino subsets are constructed from filtered, higher quality slices of this pool.

Dolma 3 Mix supports the main pre training stage for Olmo 3-Base. AI2 research team then applies Dolma 3 Dolmino Mix, a 100B token mid training set that emphasizes math, code, instruction following, reading comprehension, and thinking oriented tasks. Finally, Dolma 3 Longmino Mix adds 50B tokens for the 7B model and 100B tokens for the 32B model, with a strong focus on long documents and scientific PDFs processed with the olmOCR pipeline. This staged curriculum is what pushes the context limit to 65,536 tokens while maintaining stability and quality.

Large Scale Training on H100 Clusters

Olmo 3-Base 7B trains on Dolma 3 Mix using 1,024 H100 devices, reaching about 7,700 tokens per device per second. Later stages use 128 H100s for Dolmino mid training and 256 H100s for Longmino long context extension.

Base Model Performance Against Open Families

On standard capability benchmarks, Olmo 3-Base 32B is positioned as a leading fully open base model. AI2 research team reports that it is competitive with prominent open weight families such as Qwen 2.5 and Gemma 3 at similar sizes. Compared across a wide suite of tasks, Olmo 3-Base 32B ranks near or above these models while keeping the full data and training configuration open for inspection and reuse.

Reasoning Focused Olmo 3 Think

Olmo 3-Think 7B and Olmo 3-Think 32B sit on top of the base models as reasoning focused variants. They use a three stage post training recipe that includes supervised fine tuning, Direct Preference Optimization, and Reinforcement Learning with Verifiable Rewards within the OlmoRL framework. Olmo 3-Think 32B is described as the strongest fully open reasoning model and it narrows the gap to Qwen 3 32B thinking models while using about six times fewer training tokens.

https://allenai.org/blog/olmo3

Olmo 3 Instruct for Chat and Tool Use

Olmo 3-Instruct 7B is tuned for fast instruction following, multi turn chat, and tool use. It starts from Olmo 3-Base 7B and applies a separate Dolci Instruct data and training pipeline that covers supervised fine tuning, DPO, and RLVR for conversational and function calling workloads. AI2 research team reports that Olmo 3-Instruct matches or outperforms open weight competitors such as Qwen 2.5, Gemma 3, and Llama 3.1 and is competitive with Qwen 3 families at similar scales for several instruction and reasoning benchmarks.

RL Zero for Clean RL Research

Olmo 3-RL Zero 7B is designed for researchers who care about reinforcement learning on language models but need clean separation between pre training data and RL data. It is built as a fully open RL pathway on top of Olmo 3-Base and uses Dolci RL Zero datasets that are decontaminated with respect to Dolma 3.

Comparison Table

Model variantTraining or post training dataPrimary use caseReported position vs other open modelsOlmo 3 Base 7BDolma 3 Mix pre training, Dolma 3 Dolmino Mix mid training, Dolma 3 Longmino Mix long contextGeneral foundation model, long context reasoning, code, mathStrong fully open 7B base, designed as foundation for Think, Instruct, RL Zero, evaluated against leading open 7B scale basesOlmo 3 Base 32BSame Dolma 3 staged pipeline as 7B, with 100B Longmino tokens for long contextHigh end base for research, long context workloads, RL setupsDescribed as the best fully open 32B base, comparable to Qwen 2.5 32B and Gemma 3 27B and outperforming Marin, Apertus, LLM360Olmo 3 Think 7BOlmo 3 Base 7B, plus Dolci Think SFT, Dolci Think DPO, Dolci Think RL in OlmoRL frameworkReasoning focused 7B model with internal thinking tracesFully open reasoning model at efficient scale that enables chain of thought research and RL experiments on modest hardwareOlmo 3 Think 32BOlmo 3 Base 32B, plus the same Dolci Think SFT, DPO, RL pipelineFlagship reasoning model with long thinking tracesStated as the strongest fully open thinking model, competitive with Qwen 3 32B thinking models while training on about 6x fewer tokensOlmo 3 Instruct 7BOlmo 3 Base 7B, plus Dolci Instruct SFT, Dolci Instruct DPO, Dolci Instruct RL 7BInstruction following, chat, function calling, tool useReported to outperform Qwen 2.5, Gemma 3, Llama 3 and to narrow the gap to Qwen 3 families at similar scaleOlmo 3 RL Zero 7BOlmo 3 Base 7B, plus Dolci RLZero Math, Code, IF, Mix datasets, decontaminated from Dolma 3RLVR research on math, code, instruction following, mixed tasksIntroduced as a fully open RL pathway for benchmarking RLVR on top of a base model with fully open pre training data

Key Takeaways

End to end transparent pipeline: Olmo 3 exposes the full ‘model flow’ from Dolma 3 data construction, through staged pre training and post training, to released checkpoints, evaluation suites, and tooling, enabling fully reproducible LLM research and fine grained debugging.

Dense 7B and 32B models with 65K context: The family covers 7B and 32B dense transformers, all with a 65,536 token context window, trained via a three stage Dolma 3 curriculum, Dolma 3 Mix for main pre training, Dolma 3 Dolmino for mid training, and Dolma 3 Longmino for long context extension.

Strong open base and reasoning models: Olmo 3 Base 32B is positioned as a top fully open base model at its scale, competitive with Qwen 2.5 and Gemma 3, while Olmo 3 Think 32B is described as the strongest fully open thinking model and approaches Qwen 3 32B thinking models using about 6 times fewer training tokens.

Task tuned Instruct and RL Zero variants: Olmo 3 Instruct 7B targets instruction following, multi turn chat, and tool use using Dolci Instruct SFT, DPO, and RLVR data, and is reported to match or outperform Qwen 2.5, Gemma 3, and Llama 3.1 at similar scale. Olmo 3 RL Zero 7B provides a fully open RLVR pathway with Dolci RLZero datasets decontaminated from pre training data for math, code, instruction following, and general chat.

Editorial Comments

Olmo 3 is an unusual release because it operationalizes openness across the full stack, Dolma 3 data recipes, staged pre training, Dolci post training, RLVR in OlmoRL, and evaluation with OLMES and OlmoBaseEval. This reduces ambiguity around data quality, long context training, and reasoning oriented RL, and it creates a concrete baseline for extending Olmo 3 Base, Olmo 3 Think, Olmo 3 Instruct, and Olmo 3 RL Zero in controlled experiments. Overall, Olmo 3 sets a rigorous reference point for transparent, research grade LLM pipelines.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Allen Institute for AI (AI2) Introduces Olmo 3: An Open Source 7B and 32B LLM Family Built on the Dolma 3 and Dolci Stack appeared first on MarkTechPost.

How to Build a Fully Offline Multi-Tool Reasoning Agent with Dynamic P …

In this tutorial, we explore how to build a fully offline, multi-step reasoning agent that uses the Instructor library to generate structured outputs and reliably orchestrate complex tool calls. In this implementation, we design an agent capable of choosing the right tool, validating inputs, planning multi-stage workflows, and recovering from errors. We bring together Instructor, Transformers, and carefully crafted Pydantic schemas to create an intelligent, adaptive system that mirrors real-world agentic AI behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys

def install_dependencies():
import torch
packages = [
“instructor”,
“transformers>=4.35.0”,
“torch”,
“accelerate”,
“pydantic>=2.0.0”,
“numpy”,
“pandas”
]
if torch.cuda.is_available():
packages.append(“bitsandbytes”)
print(” GPU detected – installing quantization support”)
else:
print(” No GPU detected – will use CPU (slower but works)”)
for package in packages:
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, package])

try:
import instructor
except ImportError:
print(” Installing dependencies…”)
install_dependencies()
print(” Installation complete!”)

from typing import Literal, Optional, List, Union, Dict, Any
from pydantic import BaseModel, Field, validator
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import instructor
import json
from datetime import datetime
import re

We set up our environment by installing all required dependencies and importing the core libraries. As we lay the foundation for the system, we ensure that everything, from the Instructor to the Transformers, is ready for offline execution. This lets us start with a clean and reliable base for building the agent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SQLQuery(BaseModel):
“””Complex SQL generation with validation”””
table: str
columns: List[str]
where_conditions: Optional[Dict[str, Any]] = None
joins: Optional[List[Dict[str, str]]] = None
aggregations: Optional[Dict[str, str]] = None
order_by: Optional[List[str]] = None

@validator(‘columns’)
def validate_columns(cls, v):
if not v:
raise ValueError(“Must specify at least one column”)
return v

class DataTransformation(BaseModel):
“””Schema for complex data pipeline operations”””
operation: Literal[“filter”, “aggregate”, “join”, “pivot”, “normalize”]
source_data: str = Field(description=”Reference to data source”)
parameters: Dict[str, Any]
output_format: Literal[“json”, “csv”, “dataframe”]

class APIRequest(BaseModel):
“””Multi-endpoint API orchestration”””
endpoints: List[Dict[str, str]] = Field(description=”List of endpoints to call”)
authentication: Dict[str, str]
request_order: Literal[“sequential”, “parallel”, “conditional”]
error_handling: Literal[“stop”, “continue”, “retry”]
max_retries: int = Field(default=3, ge=0, le=10)

class CodeGeneration(BaseModel):
“””Generate and validate code snippets”””
language: Literal[“python”, “javascript”, “sql”, “bash”]
purpose: str
code: str = Field(description=”The generated code”)
dependencies: List[str] = Field(default_factory=list)
test_cases: List[Dict[str, Any]] = Field(default_factory=list)

@validator(‘code’)
def validate_code_safety(cls, v, values):
dangerous = [‘eval(‘, ‘exec(‘, ‘__import__’, ‘os.system’]
if values.get(‘language’) == ‘python’:
if any(d in v for d in dangerous):
raise ValueError(“Code contains potentially dangerous operations”)
return v

class MultiToolPlan(BaseModel):
“””Plan for multi-step tool execution”””
goal: str
steps: List[Dict[str, Any]] = Field(description=”Ordered list of tool calls”)
dependencies: Dict[str, List[str]] = Field(description=”Step dependencies”)
fallback_strategy: Optional[str] = None
estimated_duration: float = Field(description=”Seconds”)

class ToolCall(BaseModel):
“””Enhanced tool selection with context”””
reasoning: str
confidence: float = Field(ge=0.0, le=1.0)
tool_name: Literal[“sql_engine”, “data_transformer”, “api_orchestrator”,
“code_generator”, “planner”, “none”]
tool_input: Optional[Union[SQLQuery, DataTransformation, APIRequest,
CodeGeneration, MultiToolPlan]] = None
requires_human_approval: bool = False

class ExecutionResult(BaseModel):
“””Rich result with metadata”””
success: bool
data: Any
execution_time: float
warnings: List[str] = Field(default_factory=list)
metadata: Dict[str, Any] = Field(default_factory=dict)

We define all the advanced Pydantic schemas that structure how our agent understands SQL queries, data pipelines, API calls, code generation, and multi-step plans. As we build these models, we give our agent strong validation, safety, and clarity in interpreting complex instructions. This becomes the backbone of our agent’s reasoning process. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef sql_engine_tool(params: SQLQuery) -> ExecutionResult:
import time
start = time.time()
mock_tables = {
“users”: [
{“id”: 1, “name”: “Alice”, “age”: 30, “country”: “USA”},
{“id”: 2, “name”: “Bob”, “age”: 25, “country”: “UK”},
{“id”: 3, “name”: “Charlie”, “age”: 35, “country”: “USA”},
],
“orders”: [
{“id”: 1, “user_id”: 1, “amount”: 100, “status”: “completed”},
{“id”: 2, “user_id”: 1, “amount”: 200, “status”: “pending”},
{“id”: 3, “user_id”: 2, “amount”: 150, “status”: “completed”},
]
}
data = mock_tables.get(params.table, [])
if params.where_conditions:
data = [row for row in data if all(
row.get(k) == v for k, v in params.where_conditions.items()
)]
data = [{col: row.get(col) for col in params.columns} for row in data]
warnings = []
if params.aggregations:
warnings.append(“Aggregation simplified in mock mode”)
return ExecutionResult(
success=True,
data=data,
execution_time=time.time() – start,
warnings=warnings,
metadata={“rows_affected”: len(data), “query_type”: “SELECT”}
)

def data_transformer_tool(params: DataTransformation) -> ExecutionResult:
import time
start = time.time()
operations = {
“filter”: lambda d, p: [x for x in d if x.get(p[‘field’]) == p[‘value’]],
“aggregate”: lambda d, p: {“count”: len(d), “operation”: p.get(‘function’, ‘count’)},
“normalize”: lambda d, p: [{k: v/p.get(‘factor’, 1) for k, v in x.items()} for x in d]
}
mock_data = [{“value”: i, “category”: “A” if i % 2 else “B”} for i in range(10)]
op_func = operations.get(params.operation)
if op_func:
result_data = op_func(mock_data, params.parameters)
else:
result_data = mock_data
return ExecutionResult(
success=True,
data=result_data,
execution_time=time.time() – start,
warnings=[],
metadata={“operation”: params.operation, “input_rows”: len(mock_data)}
)

def api_orchestrator_tool(params: APIRequest) -> ExecutionResult:
import time
start = time.time()
results = []
warnings = []
for i, endpoint in enumerate(params.endpoints):
if params.error_handling == “retry” and i == 1:
warnings.append(f”Endpoint {endpoint.get(‘url’)} failed, retrying…”)
results.append({
“endpoint”: endpoint.get(‘url’),
“status”: 200,
“data”: f”Mock response from {endpoint.get(‘url’)}”
})
return ExecutionResult(
success=True,
data=results,
execution_time=time.time() – start,
warnings=warnings,
metadata={“endpoints_called”: len(params.endpoints), “order”: params.request_order}
)

def code_generator_tool(params: CodeGeneration) -> ExecutionResult:
import time
start = time.time()
warnings = []
if len(params.code) > 1000:
warnings.append(“Generated code is quite long, consider refactoring”)
if not params.test_cases:
warnings.append(“No test cases provided for generated code”)
return ExecutionResult(
success=True,
data={“code”: params.code, “language”: params.language, “dependencies”: params.dependencies},
execution_time=time.time() – start,
warnings=warnings,
metadata={“lines_of_code”: len(params.code.split(‘n’))}
)

def planner_tool(params: MultiToolPlan) -> ExecutionResult:
import time
start = time.time()
warnings = []
if len(params.steps) > 10:
warnings.append(“Plan has many steps, consider breaking into sub-plans”)
for step_id, deps in params.dependencies.items():
if step_id in deps:
warnings.append(f”Circular dependency detected in step {step_id}”)
return ExecutionResult(
success=True,
data={“plan”: params.steps, “estimated_time”: params.estimated_duration},
execution_time=time.time() – start,
warnings=warnings,
metadata={“total_steps”: len(params.steps)}
)

TOOLS = {
“sql_engine”: sql_engine_tool,
“data_transformer”: data_transformer_tool,
“api_orchestrator”: api_orchestrator_tool,
“code_generator”: code_generator_tool,
“planner”: planner_tool
}

We implement the actual tools, SQL execution, data transformation, API orchestration, code validation, and planning. As we write these tool functions, we simulate realistic workflows with controlled outputs and error handling. This allows us to test the agent’s decision-making in an environment that mirrors real-world tasks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedToolAgent:
“””Agent with complex reasoning, error recovery, and multi-step planning”””

def __init__(self, model_name: str = “HuggingFaceH4/zephyr-7b-beta”):
import torch
print(f” Loading model: {model_name}”)
model_kwargs = {“device_map”: “auto”}
if torch.cuda.is_available():
print(” GPU detected – using 8-bit quantization”)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0
)
model_kwargs[“quantization_config”] = quantization_config
else:
print(” CPU mode – using smaller model for better performance”)
model_name = “google/flan-t5-base”
model_kwargs[“torch_dtype”] = “auto”
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
**model_kwargs
)
self.pipe = pipeline(
“text-generation”, model=self.model, tokenizer=self.tokenizer,
max_new_tokens=768, temperature=0.7, do_sample=True
)
self.client = instructor.from_pipe(self.pipe)
self.execution_history = []
print(” Agent initialized!”)

def route_to_tool(self, user_query: str, context: Optional[str] = None) -> ToolCall:
tool_descriptions = “””
Advanced Tools:
– sql_engine: Execute complex SQL queries with joins, aggregations, filtering
– data_transformer: Multi-step data pipelines (filter→aggregate→normalize)
– api_orchestrator: Call multiple APIs with dependencies, retries, error handling
– code_generator: Generate safe, validated code with tests in multiple languages
– planner: Create multi-step execution plans with dependency management
– none: Answer directly using reasoning
“””
prompt = f”””{tool_descriptions}

User query: {user_query}
{f’Context from previous steps: {context}’ if context else ”}

Analyze the complexity and choose the appropriate tool. For multi-step tasks, use the planner.”””
return self.client(prompt, response_model=ToolCall)

def execute_with_recovery(self, tool_call: ToolCall, max_retries: int = 2) -> ExecutionResult:
for attempt in range(max_retries + 1):
try:
if tool_call.tool_name == “none”:
return ExecutionResult(
success=True, data=”Direct response”, execution_time=0.0,
warnings=[], metadata={}
)
tool_func = TOOLS.get(tool_call.tool_name)
if not tool_func:
return ExecutionResult(
success=False, data=None, execution_time=0.0,
warnings=[f”Tool {tool_call.tool_name} not found”], metadata={}
)
result = tool_func(tool_call.tool_input)
self.execution_history.append({
“tool”: tool_call.tool_name,
“success”: result.success,
“timestamp”: datetime.now().isoformat()
})
return result
except Exception as e:
if attempt < max_retries:
print(f” Attempt {attempt + 1} failed, retrying…”)
continue
return ExecutionResult(
success=False, data=None, execution_time=0.0,
warnings=[f”Failed after {max_retries + 1} attempts: {str(e)}”],
metadata={“error”: str(e)}
)

We construct the agent itself, loading the model, building the routing pipeline, and implementing recovery logic. As we define methods for tool selection and execution, we give the agent the ability to understand queries, choose strategies, and gracefully handle failures. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def run(self, user_query: str, verbose: bool = True) -> Dict[str, Any]:
if verbose:
print(f”n{‘=’*70}”)
print(f” Complex Query: {user_query}”)
print(f”{‘=’*70}”)
if verbose:
print(“n Step 1: Analyzing query complexity & routing…”)
tool_call = self.route_to_tool(user_query)
if verbose:
print(f” → Tool: {tool_call.tool_name}”)
print(f” → Confidence: {tool_call.confidence:.2%}”)
print(f” → Reasoning: {tool_call.reasoning}”)
if tool_call.requires_human_approval:
print(f” Requires human approval!”)
if verbose:
print(“n Step 2: Executing tool with error recovery…”)
result = self.execute_with_recovery(tool_call)
if verbose:
print(f” → Success: {result.success}”)
print(f” → Execution time: {result.execution_time:.3f}s”)
if result.warnings:
print(f” → Warnings: {‘, ‘.join(result.warnings)}”)
print(f” → Data preview: {str(result.data)[:200]}…”)
if verbose and result.metadata:
print(f”n Metadata:”)
for key, value in result.metadata.items():
print(f” • {key}: {value}”)
if verbose:
print(f”n{‘=’*70}n”)
return {
“query”: user_query,
“tool_used”: tool_call.tool_name,
“result”: result,
“history_length”: len(self.execution_history)
}

def main():
agent = AdvancedToolAgent()
hard_queries = [
“Generate a SQL query to find all users from USA who have completed orders worth more than $150, and join with their order details”,
“Create a data pipeline that filters records where category=’A’, then aggregates by count, and normalizes the results by a factor of 100”,
“I need to call 3 APIs sequentially: first authenticate at /auth, then fetch user data at /users/{id}, and finally update preferences at /preferences. If any step fails, retry up to 3 times”,
“Write a Python function that validates email addresses using regex, includes error handling, and has at least 2 test cases. Make sure it doesn’t use any dangerous operations”,
“Create a multi-step plan to: 1) Extract data from a database, 2) Transform it using pandas, 3) Generate a report, 4) Send via email. Show dependencies between steps”
]
print(“n” + ” HARD MODE: COMPLEX QUERIES “.center(70, “=”) + “n”)
for i, query in enumerate(hard_queries, 1):
print(f”n{‘#’*70}”)
print(f”# CHALLENGE {i}/{len(hard_queries)}”)
print(f”{‘#’*70}”)
try:
agent.run(query, verbose=True)
except Exception as e:
print(f” Critical error: {e}n”)
print(“n” + f” COMPLETED {len(agent.execution_history)} TOOL EXECUTIONS “.center(70, “=”) + “n”)
print(f” Success rate: {sum(1 for h in agent.execution_history if h[‘success’]) / len(agent.execution_history) * 100:.1f}%”)

if __name__ == “__main__”:
main()

We tie everything together with a run() method and a demo main() function that executes multiple hard-mode queries. As we watch the agent analyze, route, execute, and report results, we see the full power of the architecture in action. This final step lets us experience how the system performs under complex, realistic scenarios.

In conclusion, we have built a powerful agent capable of understanding intricate instructions, routing execution across multiple tools, and gracefully recovering from errors, all within a compact, offline system. As we test it on challenging queries, we watch it plan, reason, and execute with clarity and structure. We now appreciate how modular schemas, validated tool calls, and layered execution logic allow us to create agents that behave reliably in complex environments.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Fully Offline Multi-Tool Reasoning Agent with Dynamic Planning, Error Recovery, and Intelligent Function Routing appeared first on MarkTechPost.

Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Conce …

How do you reliably find, segment and track every instance of any concept across large image and video collections using simple prompts? Meta AI Team has just released Meta Segment Anything Model 3, or SAM 3, an open-sourced unified foundation model for promptable segmentation in images and videos that operates directly on visual concepts instead of only pixels. It detects, segments and tracks objects from both text prompts and visual prompts such as points, boxes and masks. Compared with SAM 2, SAM 3 can exhaustively find all instances of an open vocabulary concept, for example every ‘red baseball cap’ in a long video, using a single model.

From Visual Prompts to Promptable Concept Segmentation

Earlier SAM models focused on interactive segmentation. A user clicked or drew a box and the model produced a single mask. That workflow did not scale to tasks where a system must find all instances of a concept across large image or video collections. SAM 3 formalizes Promptable Concept Segmentation (PCS), which takes concept prompts and returns instance masks and stable identities for every matching object in images and videos.

Concept prompts combine short noun phrases with visual exemplars. The model supports detailed phrases such as ‘yellow school bus’ or ‘player in red’ and can also use exemplar crops as positive or negative examples. Text prompts describe the concept, while exemplar crops help disambiguate fine grained visual differences. SAM 3 can also be used as a vision tool inside multimodal large language models that generate longer referring expressions and then call SAM 3 with distilled concept prompts.

https://ai.meta.com/blog/segment-anything-model-3/?

Architecture, Presence Token and Tracking Design

The SAM 3 model has 848M parameters and consists of a detector and a tracker that share a single vision encoder. The detector is a DETR based architecture that is conditioned on three inputs, text prompts, geometric prompts and image exemplars. This separates the core image representation from the prompting interfaces and lets the same backbone serve many segmentation tasks.

A key change in SAM 3 is the presence token. This component predicts whether each candidate box or mask actually corresponds to the requested concept. It is especially important when the text prompts describe related entities, such as ‘a player in white’ and ‘a player in red’. The presence token reduces confusion between such prompts and improves open vocabulary precision. Recognition, meaning classifying a candidate as the concept, is decoupled from localization, meaning predicting the box and mask shape.

For video, SAM 3 reuses the transformer encoder decoder tracker from SAM 2, but connects it tightly to the new detector. The tracker propagates instance identities across frames and supports interactive refinement. The decoupled detector and tracker design minimizes task interference, scales cleanly with more data and concepts, and still exposes an interactive interface similar to earlier Segment Anything models for point based refinement.

https://ai.meta.com/research/publications/sam-3-segment-anything-with-concepts/

SA-Co Dataset and Benchmark Suite

To train and evaluate Promptable Concept Segmentation (PCS), Meta introduces the SA-Co family of datasets and benchmarks. The SA-Co benchmark contains 270K unique concepts, which is more than 50 times the number of concepts in previous open vocabulary segmentation benchmarks. Every image or video is paired with noun phrases and dense instance masks for all objects that match each phrase, including negative prompts where no objects should match.

The associated data engine has automatically annotated more than 4M unique concepts, which makes SA-Co the largest high quality open vocabulary segmentation corpus as mentioned by Meta. The engine combines large ontologies with automated checks and supports hard negative mining, for example phrases that are visually similar but semantically distinct. This scale is essential for learning a model that can respond robustly to diverse text prompts in real world scenes.

Image and Video Performance

On the SA-Co image benchmarks, SAM 3 reaches between 75 percent and 80 percent of human performance measured with the cgF1 metric. Competing systems such as OWLv2, DINO-X and Gemini 2.5 lag significantly behind. For example, on SA-Co Gold box detection, SAM 3 reports cgF1 of 55.7, while OWLv2 reaches 24.5, DINO-X reaches 22.5 and Gemini 2.5 reaches 14.4. This shows that a single unified model can outperform specialized detectors on open vocabulary segmentation.

In videos, SAM 3 is evaluated on SA-V, YT-Temporal 1B, SmartGlasses, LVVIS and BURST. On SA-V test it reaches 30.3 cgF1 and 58.0 pHOTA. On YT-Temporal 1B test it reaches 50.8 cgF1 and 69.9 pHOTA. On SmartGlasses test it reaches 36.4 cgF1 and 63.6 pHOTA, while on LVVIS and BURST it reaches 36.3 mAP and 44.5 HOTA respectively. These results confirm that a single architecture can handle both image PCS and long horizon video tracking.

https://ai.meta.com/research/publications/sam-3-segment-anything-with-concepts/

SAM 3 as a Data-Centric Benchmarking Opportunity for Annotation Platforms

For data-centric platforms like Encord, SAM 3 is a natural next step after their existing integrations of SAM and SAM 2 for auto-labeling and video tracking, which already let customers auto-annotate more than 90 percent of images with high mask accuracy using foundation models inside Encord’s QA driven workflows. Similar platforms such as CVAT, SuperAnnotate and Picsellia are standardizing on Segment Anything style models for zero shot labeling, model in the loop annotation and MLOps pipelines. SAM 3’s promptable concept segmentation and unified image video tracking create clear editorial and benchmarking opportunities here, for example, quantifying label cost reductions and quality gains when Encord like stacks move from SAM 2 to SAM 3 in dense video datasets or multimodal settings.

Key Takeaways

SAM 3 unifies image and video segmentation into a single 848M parameter foundation model that supports text prompts, exemplars, points and boxes for Promptable Concept Segmentation.

The SA-Co data engine and benchmark introduce about 270K evaluated concepts and over 4M automatically annotated concepts, making SAM 3’s training and evaluation stack one of the largest open vocabulary segmentation resources available.

SAM 3 substantially outperforms prior open vocabulary systems, reaching around 75 to 80 percent of human cgF1 on SA Co and more than doubling OWLv2 and DINO-X on key SA-Co Gold detection metrics.

The architecture decouples a DETR based detector from a SAM 2 style video tracker with a presence head, enabling stable instance tracking across long videos while keeping interactive SAM style refinement.

Editorial Comments

SAM 3 advances Segment Anything from Promptable Visual Segmentation to Promptable Concept Segmentation in a single 848M parameter model that unifies image and video. It leverages the SA-Co benchmark with about 270K evaluated concepts and over 4M automatically annotated concepts to approximate 75 to 80 percent of human performance on cgF1. The decoupled DETR based detector and SAM 2 style tracker with a presence head makes SAM 3 a practical vision foundation model for agents and products. Overall, SAM 3 is now a reference point for open vocabulary segmentation at production scale.

Check out the Paper, Repo and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Concept Segmentation in Images and Videos appeared first on MarkTechPost.

MSD explores applying generative Al to improve the deviation managemen …

This post is co-written with Hossein Salami and Jwalant Vyas from MSD. 
In the biopharmaceutical industry, deviations in the manufacturing process are rigorously addressed. Each deviation is thoroughly documented, and its various aspects and potential impacts are closely examined to help ensure drug product quality, patient safety, and compliance. For leading pharmaceutical companies, managing these deviations robustly and efficiently is crucial to maintaining high standards and minimizing disruptions.
Recently, the Digital Manufacturing Data Science team at Merck & Co., Inc., Rahway, NJ, USA (MSD) recognized an opportunity to streamline aspects of their deviation management process using emerging technologies including vector databases and generative AI, powered by AWS services such as Amazon Bedrock and Amazon OpenSearch. This innovative approach aims to use the organization’s past deviations as a vast, diverse, and reliable knowledge source. Such knowledge can potentially help reduce the time and resources required for—and increase the efficiency of—researching and addressing each new deviation by using learnings from similar cases across the manufacturing network, while maintaining the rigorous standards demanded by Good Manufacturing Practices (GMP) requirements.
Industry trends: AI in pharmaceutical manufacturing
The pharmaceutical industry has been increasingly turning to advanced technologies to enhance various aspects of their operations, from early drug discovery to manufacturing and quality control. The application of AI, particularly generative AI, in streamlining complex processes is a growing trend. Many companies are exploring how these technologies can be applied to areas that traditionally require significant human expertise and time investment, including the above-mentioned deviation management. This shift towards AI-assisted processes is not only about improving efficiency, but also about enhancing the quality and consistency of outcomes in critical areas.
Innovative solution: Generative AI for deviation management
To address some of the major challenges in deviation management, the Digital Manufacturing Data Science team at MSD devised an innovative solution using generative AI (see How can language models assist with pharmaceuticals manufacturing deviations and investigations?). The approach involves first, creating a comprehensive knowledge base from past deviation reports, which can be intelligently queried to provide various insights including helpful information for addressing new cases. In addition to the routine metadata, the knowledge base includes important unstructured data such as observations, analysis processes, and conclusions, typically recorded as natural language text. The solution is designed to facilitate the interaction of different users in manufacturing sites, with different personas and roles, with this knowledge sources. For example, users can quickly and accurately identify and access information about similar past incidents and use that information to hypothesize about the potential root causes and define resolutions for a current case. This is facilitated by a hybrid and domain-specific search mechanism implemented through Amazon OpenSearch Service. Subsequently, the information is processed by a large language model (LLM) and is presented to the user based on their persona and need. This functionality not only saves time but also uses the wealth of experience and knowledge from previous deviations.
Solution overview: Goals, risks, and opportunities
Deviation investigations have traditionally been a time-consuming, manual process that requires significant human effort and expertise. Investigation teams often spend extensive hours collecting, analyzing, and documenting information, sifting through historical records, and drawing conclusions—a workflow that is not only labor-intensive but also prone to potential human error and inconsistency. The solution aims to achieve several key goals:

Significantly reduce the time and effort required for investigation and closure of a deviation
Provide users with easy access to relevant knowledge, historical information, and data with high accuracy and flexibility based on user persona
Make sure that the information used to derive conclusions is traceable and verifiable

The team is also mindful of potential risks, such as over-reliance on AI-generated suggestions or the possibility of outdated information influencing current investigations. To mitigate these risks, the solution mostly limits the generative AI content creation to low-risk areas and incorporates human oversight and other guardrails. An automated data pipeline helps the knowledge base remain up-to-date with the most recent information and data. To protect proprietary and sensitive manufacturing information, the solution includes data encryption and access controls on different elements.
Additionally, the team sees opportunities for incorporating new elements in the architecture, particularly in the form of agents that can handle specific requests common to certain user personas such as high-level statistics and visualizations for site managers.
Technical architecture: RAG approach with AWS services
The solution architecture uses a Retrieval-Augmented Generation (RAG) approach to enhance the efficiency, relevance, and traceability of deviation investigations. This architecture integrates multiple AWS managed services to build a scalable, secure, and domain-aware AI-driven system.
At the core of the solution is a hybrid retrieval module (leveraging the hybrid search capabilities of Amazon OpenSearch Service) that combines both semantic (vector-based) and keyword (lexical) search for high-accuracy information retrieval. This module is built on Amazon OpenSearch Service, which functions as the vector store. OpenSearch indexes embeddings generated from past deviation reports and related documents, enriched with domain-specific metadata such as deviation type, resolution date, impacted product lines, and root cause classification. This is for both deep semantic search and efficient filtering based on structured fields.
To support structured data storage and management, the system uses Amazon Relational Database Service (Amazon RDS). RDS stores normalized tabular information associated with each deviation case, such as investigation timelines, responsible personnel, and other operational metadata. With RDS you can make complex queries across structured dimensions and supports reporting, compliance audits, and trend analysis.
A RAG pipeline orchestrates the flow between the retrieval module and a large language model (LLM) hosted in Amazon Bedrock. When a user issues a query, the system first retrieves relevant documents from OpenSearch and structured case data from RDS. These results are then passed as context to the LLM, which generates grounded, contextualized outputs such as:

Summarized investigation histories
Root cause patterns
Comparable past incidents
Suggested next steps or knowledge gaps

High-level architecture of the solution. Domain-specific deviation data are located on Amazon RDS and OpenSearch. Text vector embeddings along with relevant metadata are located on OpenSearch to support a variety of search functionalities.

Conclusion and next steps
This blog post has explored how MSD is harnessing the power of generative AI and databases to optimize and transform its manufacturing deviation management process. By creating an accurate and multifaceted knowledge base of past events, deviations, and findings, the company aims to significantly reduce the time and effort required for each new case while maintaining the highest standards of quality and compliance.
As next steps, the company plans to conduct a comprehensive review of use cases in the pharma quality domain and build a generative AI-driven enterprise scale product by integrating structured and unstructured sources using methods from this innovation. Some of the key capabilities coming from this innovation include data architecture, data modeling, including metadata curation, and generative AI-related components. Looking ahead, we plan to use the capabilities of Amazon Bedrock Knowledge Bases, which will provide more advanced semantic search and retrieval capabilities while maintaining seamless integration within the AWS environment. If successful, this approach could set a new standard for not only deviation management at MSD, but also pave the way for more efficient, integrated, and knowledge-driven manufacturing quality processes including complaints, audits, and so on.

About the authors
Hossein Salami is a Senior Data Scientist at the Digital Manufacturing organization at MSD. As a Chemical Engineering Ph.D. with a background of more than 9 years of laboratory and process R&D experience, he takes part in leveraging advanced technologies to build data science and AI/ML solutions that address core business problems and applications.
Jwalant (JD) Vyas is the Digital Product Line Lead for the Investigations Digital Product Portfolio at MSD, bringing 25+ years of biopharmaceutical experience across Quality Operations, QMS, Plant Operations, Manufacturing, Supply Chain, and Pharmaceutical Product Development. He leads the digitization of Quality Operations to improve efficiency, strengthen compliance, and enhance decision-making. With deep business domain and technology expertise, he bridges technical depth with strategic leadership.
Duverney Tavares is a Senior Solutions Architect at Amazon Web Services (AWS), specializing in guiding Life Sciences companies through their digital transformation journeys. With over two decades of experience in Data Warehousing, Big Data & Analytics, and Database Management, he uses his expertise to help organizations harness the power of data to drive business growth and innovation.