Google’s Gemini 3 Pro turns sparse MoE and 1M token context into a p …

How do we move from language models that only answer prompts to systems that can reason over million token contexts, understand real world signals, and reliably act as agents on our behalf? Google just released Gemini 3 family with Gemini 3 Pro as the centerpiece that positions as a major step toward more general AI systems. The research team describes Gemini 3 as its most intelligent model so far, with state of the art reasoning, strong multimodal understanding, and improved agentic and vibe coding capabilities. Gemini 3 Pro launches in preview and is already wired into the Gemini app, AI Mode in Search, Gemini API, Google AI Studio, Vertex AI, and the new Google Antigravity agentic development platform.

Sparse MoE transformer with 1M token context

Gemini 3 Pro is a sparse mixture of experts transformer model with native multimodal support for text, images, audio and video inputs. Sparse MoE layers route each token to a small subset of experts, so the model can scale total parameter count without paying proportional compute cost per token. Inputs can span up to 1M tokens and the model can generate up to 64k output tokens, which is significant for code bases, long documents, or multi hour transcripts. The model is trained from scratch rather than as a fine tune of Gemini 2.5.

Training data covers large scale public web text, code in many languages, images, audio and video, combined with licensed data, user interaction data, and synthetic data. Post training uses multimodal instruction tuning and reinforcement learning from human and critic feedback to improve multi step reasoning, problem solving and theorem proving behaviour. The system runs on Google Tensor Processing Units TPUs, with training implemented in JAX and ML Pathways.

Reasoning benchmarks and academic style tasks

On public benchmarks, Gemini 3 Pro clearly improves over Gemini 2.5 Pro and is competitive with other frontier models such as GPT 5.1 and Claude Sonnet 4.5. On Humanity’s Last Exam, which aggregates PhD level questions across many scientific and humanities domains, Gemini 3 Pro scores 37.5 percent without tools, compared to 21.6 percent for Gemini 2.5 Pro, 26.5 percent for GPT 5.1 and 13.7 percent for Claude Sonnet 4.5. With search and code execution enabled, Gemini 3 Pro reaches 45.8 percent.

On ARC AGI 2 visual reasoning puzzles, Gemini 3 Pro scores 31.1 percent, up from 4.9 percent for Gemini 2.5 Pro, and ahead of GPT 5.1 at 17.6 percent and Claude Sonnet 4.5 at 13.6 percent. For scientific question answering on GPQA Diamond, Gemini 3 Pro reaches 91.9 percent, slightly ahead of GPT 5.1 at 88.1 percent and Claude Sonnet 4.5 at 83.4 percent. In mathematics, the model achieves 95.0 percent on AIME 2025 without tools and 100.0 percent with code execution, while also setting 23.4 percent on MathArena Apex, a challenging contest style benchmark.

https://blog.google/products/gemini/gemini-3/#learn-anything

Multimodal understanding and long context behaviour

Gemini 3 Pro is designed as a native multimodal model instead of a text model with add ons. On MMMU Pro, which measures multimodal reasoning across many university level subjects, it scores 81.0 percent versus 68.0 percent for Gemini 2.5 Pro and Claude Sonnet 4.5, and 76.0 percent for GPT 5.1. On Video MMMU, which evaluates knowledge acquisition from videos, Gemini 3 Pro reaches 87.6 percent, ahead of Gemini 2.5 Pro at 83.6 percent and other frontier models.

User interface and document understanding are also stronger. ScreenSpot Pro, a benchmark for locating elements on a screen, shows Gemini 3 Pro at 72.7 percent, compared to 11.4 percent for Gemini 2.5 Pro, 36.2 percent for Claude Sonnet 4.5 and 3.5 percent for GPT 5.1. On OmniDocBench 1.5, which reports overall edit distance for OCR and structured document understanding, Gemini 3 Pro achieves 0.115, lower than all baselines in the comparison table.

For long context, Gemini 3 Pro is evaluated on MRCR v2 with 8 needle retrieval. At 128k average context, it scores 77.0 percent, and at a 1M token pointwise setting it reaches 26.3 percent, ahead of Gemini 2.5 Pro at 16.4 percent, while competing models do not yet support that context length in the published comparison.

Coding, agents and Google Antigravity

For software developers, the main story is coding and agentic behaviour. Gemini 3 Pro tops the LMArena leaderboard with an Elo score of 1501 and achieves 1487 Elo in WebDev Arena, which evaluates web development tasks. On Terminal Bench 2.0, which tests the ability to operate a computer through a terminal via an agent, it reaches 54.2 percent, above GPT 5.1 at 47.6 percent, Claude Sonnet 4.5 at 42.8 percent and Gemini 2.5 Pro at 32.6 percent. On SWE Bench Verified, which measures single attempt code changes across GitHub issues, Gemini 3 Pro scores 76.2 percent compared to 59.6 percent for Gemini 2.5 Pro, 76.3 percent for GPT 5.1 and 77.2 percent for Claude Sonnet 4.5.

Gemini 3 Pro also performs well on τ2 bench for tool use, at 85.4 percent, and on Vending Bench 2, which evaluates long horizon planning for a simulated business, where it produces a mean net worth of 5478.16 dollars versus 573.64 dollars for Gemini 2.5 Pro and 1473.43 dollars for GPT 5.1.

These capabilities are exposed in Google Antigravity, an agent first development environment. Antigravity combines Gemini 3 Pro with the Gemini 2.5 Computer Use model for browser control and the Nano Banana image model, so agents can plan, write code, run it in the terminal or browser, and verify results inside a single workflow.

Key Takeaways

Gemini 3 Pro is a sparse mixture of experts transformer with native multimodal support and a 1M token context window, designed for large scale reasoning over long inputs.

The model shows large gains over Gemini 2.5 Pro on difficult reasoning benchmarks such as Humanity’s Last Exam, ARC AGI 2, GPQA Diamond and MathArena Apex, and is competitive with GPT 5.1 and Claude Sonnet 4.5.

Gemini 3 Pro delivers strong multimodal performance on benchmarks like MMMU Pro, Video MMMU, ScreenSpot Pro and OmniDocBench, which target university level questions, video understanding and complex document or UI comprehension.

Coding and agentic use cases are a primary focus, with high scores on SWE Bench Verified, WebDev Arena, Terminal Bench and tool use and planning benchmarks such as τ2 bench and Vending Bench 2.

Editorial Comments

Gemini 3 Pro is a clear escalation in Google’s strategy toward more AGI, combining sparse mixture of experts architecture, 1M token context, and strong performance on ARC AGI 2, GPQA Diamond, Humanity’s Last Exam, MathArena Apex, MMMU Pro, and WebDev Arena. The focus on tool use, terminal and browser control, and evaluation under the Frontier Safety Framework positions it as an API ready workhorse for agentic, production facing systems. Overall, Gemini 3 Pro is a benchmark driven, agent focused response to the next phase of large scale multimodal AI.

Check out the Technical details and Docs. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google’s Gemini 3 Pro turns sparse MoE and 1M token context into a practical engine for multimodal agentic workloads appeared first on MarkTechPost.

Bringing tic-tac-toe to life with AWS AI services

Large language models (LLMs) now support a wide range of use cases, from content summarization to the ability to reason about complex tasks. One exciting new topic is taking generative AI to the physical world by applying it to robotics and physical hardware.
Inspired by this, we developed a game for the AWS re:Invent 2024 Builders Fair using Amazon Bedrock, Strands Agents, AWS IoT Core, AWS Lambda, and Amazon DynamoDB. Our goal was to demonstrate how LLMs can reason about game strategy, complex tasks, and control physical robots in real time.
RoboTic-Tac-Toe is an interactive game where two physical robots move around a tic-tac-toe board, with both the gameplay and robots’ movements orchestrated by LLMs. Players can control the robots using natural language commands, directing them to place their markers on the game board. In this post, we explore the architecture and prompt engineering techniques used to reason about a tic-tac-toe game and decide the next best game strategy and movement plan for the current player.
An interactive experience
RoboTic-Tac-Toe demonstrates an intuitive interaction between humans, robots, and AI. Participants can access the game portal by scanning a QR code, and choose from multiple modes:

Player vs. Player – Challenge a human opponent
Player vs. LLM – Test your skills against an AI-powered LLM
LLM vs. LLM – Watch two AI models strategize and compete autonomously

When a player chooses a target cell, the two robots, positioned beside a tic-tac-toe board, respond to commands by executing precise movements to place X or O markers. The following video shows this in action.
Solution overview
RoboTic-Tac-Toe features a seamless integration of AWS services, alleviating the need for pre-programmed sequences. Instead, AI dynamically generates descriptive instructions in real time. The following diagram describes the architecture built on AWS IoT Core, which enables communication between Raspberry Pi Controlled robots and the cloud.

The solution uses the following key services:

Amazon Bedrock LLM – Uses LLMs and prompt engineering to generate movement plans and game strategies
Strands Agents – An open-source SDK that takes a model-driven approach for building and running AI agents
Amazon SageMaker – Powers AI-driven decision-making and robot movement planning
AWS Lambda – Executes the game logic, resulting in smooth operation and real-time responsiveness
Amazon Simple Storage Service (Amazon S3) – Stores game state data and images captured during play

Hardware and software

The project’s physical setup includes a tic-tac-toe board embedded with LED indicators to highlight placements for X and O.
The two robots (modified toy models) operate through Raspberry Pi controllers equipped with infrared and RF modules.
A mounted Raspberry Pi camera enables vision-based analysis, capturing the board’s state and transmitting data for further computer vision processing. Additionally, a dedicated hardware controller acts as an IoT device that connects to AWS IoT Core, which promotes smooth gameplay interactions.

On the software side, AWS Lambda handles invoking the supervisor Strands Agent, for the core game logic and orchestration.
Computer vision capabilities, powered by OpenCV, analyze the board’s layout and power precise robot movements. Amazon Bedrock agents orchestrate tasks to generate movement plans and game strategies.

Strands Agents in action
Strands Agents automate tasks for your application users by orchestrating interactions between the foundation model (FM), data sources, software applications, and user conversations.
Supervisor Agent
The Supervisor Agent acts as an orchestrator that manages both the Move Agent and the Game Agent, coordinating and streamlining decisions across the system. This process consists of the following steps:

The agent receives high-level instructions or gameplay events (for example, “Player X moved to 2B, generate the robot’s response”) and determines which specialized agent—Move Agent or Game Agent—must be invoked.
The Supervisor AWS Lambda function serves as the central controller. When triggered, it parses the incoming request, validates the context, and then routes the request to the appropriate Strands Agent. Tracing is enabled for the entire workflow to allow for monitoring and debugging.
Depending on the request type:

If it involves updating or analyzing the game state, the Supervisor invokes the Game Agent, which retrieves the board status and generates the next AI-driven move.
If it involves physical robot navigation, the Supervisor invokes the Move Agent, which produces the movement instructions in Python code.

The Supervisor Agent consolidates the responses from the underlying agents and structures them into a unified output format. This allows for consistency whether the outcome is a robot command, a game move, or a combination of both.
The interactions, including decision paths and final outputs, are logged in an S3 bucket. This logging mechanism provides traceability across multiple agents and supports error handling by returning structured error messages when issues arise.

This module provides a governance layer over the AI-powered environment, enabling scalable orchestration across agents. By intelligently directing requests and unifying responses, the Supervisor Agent facilitates reliable execution, simplified monitoring, and enhanced user experience.
Move Agent
The Move Agent generates step-by-step Python code. This process consists of the following steps:

The agent receives a start and destination position on a grid (for example, “3A to 4B North”), determines the necessary movements, and sends commands to the appropriate robot.
The LLM Navigator AWS Lambda function generates movement instructions for robots using Strands Agents. When triggered, it receives a request containing a session ID and an input text specifying the robot’s starting position and destination. The function then invokes the Strands Agent, sending the request along with tracing enabled to allow for debugging.
The response from the agent consists of movement commands such as turning and moving forward in centimeters.
These commands are processed and logged in an S3 bucket under a CSV file. If the log file exists, new entries are appended. Otherwise, a new file is created.
The function returns a JSON response containing the generated instructions and the time taken to execute the request. If an error occurs, a structured error message is returned.

This module provides efficient and traceable navigation for robots by using AI-powered instruction generation while maintaining a robust logging mechanism for monitoring and debugging.
Game Agent
The Game Agent functions as an opponent, capable of playing against human users. To enhance accessibility, players use a mobile-friendly web portal to interact with the game, which includes an admin panel for managing AI-driven matches. The LLM player is a serverless application that combines AWS Lambda, Amazon DynamoDB, and Strands Agent to manage and automate the moves. It tracks game progress by storing move history in an Amazon DynamoDB table, allowing it to reconstruct the current board state whenever requested. The gameplay process consists of the following steps:

When a player makes a move, the supervisor Strands Agent retrieves this state function and then calls the Strands Agent function to generate the next move. The agent selection depends on the player’s marker (‘X’ or ‘O’), making sure that the correct model is used for decision-making.
The agent processes the current game board as input and returns the recommended next move through an event stream.
The entire workflow is orchestrated by the supervisor Strands Agent. This agent receives API requests, validates inputs, retrieves the board state, invokes the LLM model, and returns a structured response containing the updated game status.

This system allows for real-time, AI-driven gameplay, making it possible for players to compete against an intelligent opponent powered by LLMs.
Powering robot navigation with computer vision
In our RoboTic-Tac-Toe project, computer vision plays a crucial role in producing precise robot movements and gameplay accuracy. Let’s walk through how we implemented the solution using AWS services and advanced computer vision techniques. Our setup includes a Raspberry Pi camera mounted above the game board, continuously monitoring the robots’ positions and movements. The camera captures images that are automatically uploaded to Amazon S3, forming the foundation of our vision processing pipeline.
We use Principal Component Analysis (PCA) to accurately detect and track robot orientation and position on the game board. This technique helps reduce dimensionality while maintaining essential features for robot tracking. The orientation angle is calculated based on the principal components of the robot’s visual features.
Our OpenCV module is containerized and deployed as an Amazon SageMaker endpoint. It processes images stored in Amazon S3 to determine the following:

Precise robot positioning on the game board
Current orientation angles
Movement validation

A dedicated AWS Lambda function orchestrates the vision processing workflow. It handles the following:

SageMaker endpoint invocation
Processing of vision analysis results
Real-time position and orientation updates

This computer vision system facilitates accurate robot navigation and game state tracking, contributing to the seamless gameplay experience in RoboTic-Tac-Toe. The combination of PCA for orientation detection, OpenCV for image processing, and AWS services for deployment helps create a robust and scalable computer vision solution.

Conclusion
RoboTic-Tac-Toe showcases how AI, robotics, and cloud computing can converge to create interactive experiences. This project highlights the potential of AWS IoT, machine learning (ML), and generative AI in gaming, education, and beyond. As AI-driven robotics continue to evolve, RoboTic-Tac-Toe serves as a glimpse into the future of intelligent, interactive gaming.
Stay tuned for future enhancements, expanded gameplay modes, and even more engaging AI-powered interactions.

About the authors
Georges Hamieh is a Senior Technical Account Manager at Amazon Web Services, specialized in Data and AI. Passionate about innovation and technology, he partners with customers to accelerate their digital transformation and cloud adoption journeys. An experienced public speaker and mentor, Georges enjoys capturing life through photography and exploring new destinations on road trips with his family.
Mohamed Salah is a Senior Solutions Architect at Amazon Web Services, supporting customers across the Middle East and North Africa in building scalable and intelligent cloud solutions. He’s passionate about Generative AI, Digital Twins, and helping organizations turn innovation into impact. Outside work, Mohamed enjoys playing PlayStation, building LEGO sets, and watching movies with his family.
Saddam Hussain is a Senior Solutions Architect at Amazon Web Services, specializing in Aerospace, Generative AI, and Innovation & Transformation practice areas. Drawing from Amazon.com’s pioneering journey in AI/ML and Generative AI, he helps organizations understand proven methodologies and best practices that have scaled across millions of customers. His main focus is helping Public Sector customers across UAE to innovate on AWS, guiding them through comprehensive Cloud adoption framework (CAF) to strategically adopt cutting-edge technologies while building sustainable capabilities.
Dr. Omer Dawelbeit is a Principal Solutions Architect at AWS. He is passionate about tackling complex technology challenges and working closely with customers to design and implement scalable, high-impact solutions. Omer has over two decades of financial services, public sector and telecoms experience across startups, enterprises, and large-scale technology transformations.

HyperPod enhances ML infrastructure with security and storage

Amazon SageMaker HyperPod is a purpose-built infrastructure for optimizing foundation model training and inference at scale. SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure for training foundation models (FMs).
As AI moves towards deployment adopting to a multitude of domains and use cases, the need for security and multiple storage options is becoming more pertinent. Large enterprises want to make sure that the GPU clusters follow the organization wide policies and security rules. Two new features in SageMaker HyperPod EKS enhance this control and flexibility for production deployment of large-scale machine learning workloads. These features include support for continuous scaling, custom Amazon Machine Images, and customer managed key (CMK) integration.

Customer managed keys (CMK) support: HyperPod EKS now allows customers to encrypt primary and secondary EBS volumes attached to HyperPod instances or their custom AMI with their own encryption keys. To learn more about creating a custom AMI for your HyperPod cluster, please see our blog post and documentation.
Amazon EBS CSI support: HyperPod EKS now supports the Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver, which manages the lifecycle of Amazon EBS volumes as storage for the Kubernetes volumes that you create.

Prerequisites
In order to use these features verify you have the following prerequisites:

The AWS CLI is installed and configured with your account
You have a SageMaker HyperPod cluster with Amazon EKS orchestration. To create your HyperPod cluster, please see Creating a SageMaker HyperPod cluster with Amazon EKS orchestration
CMK support can only be used with HyperPod cluster with NodeProvisioningMode set to Continuous. EBS CSI driver support can be used on the NodeProvisioningMode settings. For more details on how to create your cluster to use continuous provisioning, please see Continuous provisioning for enhanced cluster operations on Amazon EKS.

Customer managed key support
With CMK support you control the encryption capabilities required for compliance and security governance, ultimately helping to resolve the critical business risk of unmet regulatory and organizational security requirements, such as HIPAA and FIPS compliance. CMK support allows customers to encrypt EBS volumes attached to their HyperPod instances using their own encryption keys. When creating a cluster, updating a cluster, or adding new instance groups, customers can specify a CMK for both root and secondary EBS volumes. Additionally, customers can encrypt their custom AMIs with CMK, providing comprehensive data-at-rest protection with customer-controlled keys throughout the instance lifecycle.
Here are the key points about CMK configuration:
For EBS volumes:

CMK is optional – if not specified, volumes will be encrypted with AWS managed keys
You cannot update/change the CMK for existing volumes (CMK is immutable)
Each instance group can have:

One root volume configuration with CMK
One secondary volume configuration with CMK

Root volume configurations cannot specify volume size
Secondary volume configurations must specify volume size
You can specify different CMKs for root and secondary volumes

For custom AMIs:

You can encrypt custom AMIs with CMK independently of volume encryption
Unlike volume CMK, custom AMI CMK is mutable – customers can patch clusters using AMIs encrypted with different CMKs

Important: When using customer managed keys, we strongly recommend that you use different KMS keys for each instance group in your cluster. Using the same customer managed key across multiple instance groups might lead to unintentional continued permissions even if you try to revoke a grant. For example:

If you revoke an AWS KMS grant for one instance group’s volumes, that instance group might still allow scaling and patching operations due to grants existing on other instance groups using the same key
To help prevent this issue, make sure that you assign unique KMS keys to each instance group in your cluster

Configuring CMK on HyperPod
In this section, we will demonstrate how to set up CMK for your HyperPod cluster. As a prerequisite, make sure you have the following:

Verify that the AWS IAM execution role that you’re using for your CMK-enabled instance group has the following permissions for AWS KMS added. The kms:CreateGrant permission allows HyperPod to take the following actions using permissions to your KMS key:

Scaling out your instance count (UpdateCluster operations)
Adding cluster nodes (BatchAddClusterNodes operations)
Patching software (UpdateClusterSoftware operations)

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Effect”: “Allow”,
            “Action”: [
                “kms:CreateGrant”,
                “kms:DescribeKey”
            ],
            “Resource”: “*”
        }
    ]
}

Include this in your KMS key policy:

You can modify your key policy following the Change a key policy documentation. Replace variables <iam-hp-execution-role>, <region>, <account-id> , and <key-id> with your HyperPod execution role (the role that is linked to your instance group using CMKs), AWS Region your HyperPod cluster is deployed in, your account ID, and your KMS key ID, respectively.

{
    “Version”: “2012-10-17”,
    “Id”: “hyperpod-key-policy”,
    “Statement”: [
        {
            “Sid”: “Enable IAM User Permissions”,
            “Effect”: “Allow”,
            “Principal”: {
                “AWS”: “arn:aws:iam::<account-id>:root”
            },
            “Action”: “kms:*”,
            “Resource”: “*”
        },
        {
            “Effect”: “Allow”,
            “Principal”: {
                “AWS”: “arn:aws:iam::<account-id>:role/<iam-hp-execution-role>”
            },
            “Action”: “kms:CreateGrant”,
            “Resource”: “arn:aws:kms:<region>:<account-id>:key/<key-id>”,
            “Condition”: {
                “StringEquals”: {
                    “kms:ViaService”: “sagemaker.<region>.amazonaws.com”
                },
                “Bool”: {
                    “kms:GrantIsForAWSResource”: “true”
                }
            }
        },
        {
            “Effect”: “Allow”,
            “Principal”: {
                “AWS”: “arn:aws:iam::<account-id>:role/<iam-hp-execution-role>”
            },
            “Action”: “kms:DescribeKey”,
            “Resource”: “arn:aws:kms:<region>:<account-id>:key/<key-id>”,
            “Condition”: {
                “StringEquals”: {
                    “kms:ViaService”: “sagemaker.<region>.amazonaws.com”
                }
            }
        }
    ]
}

Now, let’s use the CMK.
You can specify your customer managed keys when creating or updating a cluster using the CreateCluster and UpdateCluster API operations. The InstanceStorageConfigs structure allows up to two EbsVolumeConfig configurations, in which you can configure the root Amazon EBS volume and, optionally, a secondary volume. You can use the same KMS key or a different KMS key for each volume, depending on your needs.
When you are configuring the root volume, the following requirements apply:

RootVolume must be set to True. The default value is False, which configures the secondary volume instead.
The VolumeKmsKeyId field is required and you must specify your customer managed key. This is because the root volume must be encrypted with either an AWS owned key or a customer managed key (if you don’t specify your own, then an AWS owned key is used).
You can’t specify the VolumeSizeInGB field for root volumes since HyperPod determines the size of the root volume for you.

When configuring the secondary volume, the following requirements apply:

RootVolume must be False (the default value of this field is False).
The VolumeKmsKeyId field is optional. You can use the same customer managed key you specified for the root volume, or you can use a different key.
The VolumeSizeInGB field is required, since you must specify your desired size for the secondary volume.

Example of creating cluster with CMK support:

aws sagemaker create-cluster
  –cluster-name <your-hyperpod-cluster>
  –instance-groups ‘[{
    “ExecutionRole”: “arn:aws:iam::<account-id>:role/<your-SageMaker-Execution-Role>”,
    “InstanceCount”: 2,
    “InstanceGroupName”: “<your-ig-name>”,
    “InstanceStorageConfigs”: [
            {
                “EbsVolumeConfig”: {
                    “RootVolume”: True,
                    “VolumeKmsKeyId”: “arn:aws:kms:<region>:<account-id>:key/<root-volume-key-id>”
                }
            },
            {
                “EbsVolumeConfig”: {
                    “VolumeSizeInGB”: 100,
                    “VolumeKmsKeyId”: “arn:aws:kms:<region>:<account-id>:key/<secondary-volume-key-id>”
                }
            }
    ],
    “InstanceType”: “<desired-instance-type>”
  }]’
  –vpc-config ‘{
    “SecurityGroupIds”: [“<sg-id>”],
    “Subnets”: [“<subnet-id>”]
  }’

Example of updating a cluster with CMK support:

aws sagemaker update-cluster
  –cluster-name <your-hyperpod-cluster>
  –instance-groups ‘[{
    “InstanceGroupName”: “<your-ig-name>”,
    “InstanceStorageConfigs”: [
            {
                “EbsVolumeConfig”: {
                    “RootVolume”: true,
                    “VolumeKmsKeyId”: “arn:aws:kms:<region>:<account-id>:key/<root-volume-key-id>”
                }
            },
            {
                “EbsVolumeConfig”: {
                    “VolumeSizeInGB”: 100,
                    “VolumeKmsKeyId”: “arn:aws:kms:<region>:<account-id>:key/<secondary-volume-key-id>”
                }
            }
    ]
  }]’

To use a custom AMI with CMK encryption, you would first have to build your custom AMI with your CMK. You can do this with the following tools, but note that these commands are sample snippets. Follow the linked documentation to generate the AMI.

EC2 Image Builder:

aws imagebuilder create-image-recipe
    –name “hyperpod-custom-recipe”
    –version “1.0.0”
    –parent-image “<hyperpod-base-image-id>”
    –components “componentArn=<component-arn>” 
    –block-device-mappings DeviceName=”/dev/xvda”,Ebs={VolumeSize=100,VolumeType=gp3,Encrypted=true,KmsKeyId=arn:aws:kms:us-east-1:111122223333:key/key-id,DeleteOnTermination=true}

Amazon EC2 Console:

Right-click on your customized Amazon EC2 instance and choose Create Image.
In the Encryption section, select Encrypt snapshots.
Select your KMS key from the dropdown. For example: arn:aws:kms:us-east-2:111122223333:key/<your-kms-key-id> or use the key alias: alias/<your-hyperpod-key>.

AWS CLI:

aws ec2 create-image
    –instance-id “<instance-id>”
    –name “MyCustomHyperPodAMI”
    –description “Custom HyperPod AMI”
    –block-device-mappings ‘[
        {
            “DeviceName”: “/dev/xvda”,
            “Ebs”: {
                “Encrypted”: true,
                “KmsKeyId”: “arn:aws:kms:us-east-1:111122223333:key/<key-id>”,
                “VolumeType”: “gp2”
            }
        }
    ]’

To use this encrypted custom AMI, please follow our blog or documentation on using your custom AMI on HyperPod.
Amazon EBS CSI driver support
With Amazon Elastic Block Storage (EBS) Container Storage Interface (CSI) support in HyperPod you can manage the lifecycle of Amazon EBS volumes as storage for the Kubernetes Volumes created for your EKS clusters. Supporting both ephemeral and persistent volumes, this enhancement addresses the need for dynamic storage management in large-scale AI workloads, efficiently handling the massive datasets and model artifacts for foundation model training and inference.
HyperPod now offers two flexible approaches for provisioning and mounting additional Amazon EBS volumes on nodes. The first method, which isn’t new, uses InstanceStorageConfigs for cluster-level volume provisioning when creating or updating instance groups, requiring users to set the local path to /opt/sagemaker in their Pod configuration file. Alternatively, users can implement the Amazon EBS CSI driver for dynamic Pod-level volume management, providing greater control over storage allocation.
This feature was previously supported exclusively only on Amazon EKS clusters, now it unlocks new storage capabilities for the SageMaker HyperPod too. To read more about the capabilities yourself, follow the official documentation page.
Demo of the Amazon EBS CSI driver on SageMaker HyperPod
In this section, we will demo one of the capabilities of Amazon EBS CSI, such as volume resizing.
Setup EBS CSI Driver
In the following sections we will ask you to substitute some parameters with the values unique to your demo. When we refer to <eks-cluster-name>, that’s the name of the underlying Amazon EKS cluster, not the SageMaker HyperPod cluster. Configure your kubernetes config to add a new context, so the utils will interact with your new EKS cluster. Run the following:

aws eks update-kubeconfig
        –region <region>
        –name <eks-cluster-name>

Secondly, we need to create a IAM Service Account with an appropriate policy to work with Amazon EBS CSI. The IAM Service Account is the IAM entity for Amazon EKS to interact with other AWS services. We chose eksctl to create the policy and attach the required policy in a single command, however there are other ways to do the same.

eksctl create iamserviceaccount
        –name ebs-csi-controller-sa
        –namespace kube-system
        –cluster <eks-cluster-name>
        –role-name DemoRole 
        –attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy
        –approve

After the successful execution of the command, we should expect three outcomes:

IAM Service account with the name ebs-csi-controller-sa is created
IAM role named DemoRole is created with policy arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy attached
The ebs-csi-controller-sa service account consumes the DemoRole

During this demo you should see an output to the previous command, for example:

2025-08-19 12:44:17 [ℹ]  3 existing iamserviceaccount(s) (kube-system/aws-load-balancer-controller,kube-system/fsx-csi-controller-sa,kube-system/s3-csi-driver-sa) will be excluded
2025-08-19 12:44:17 [ℹ]  1 iamserviceaccount (kube-system/ebs-csi-controller-sa) was included (based on the include/exclude rules)
2025-08-19 12:44:17 [!]  serviceaccounts that exist in Kubernetes will be excluded, use –override-existing-serviceaccounts to override
2025-08-19 12:44:17 [ℹ]  1 task: {
    2 sequential sub-tasks: {
        create IAM role for serviceaccount “kube-system/ebs-csi-controller-sa”,
        create serviceaccount “kube-system/ebs-csi-controller-sa”,
    } }2025-08-19 12:44:17 [ℹ]  building iamserviceaccount stack “eksctl-sagemaker-hyperpod-eks-cluster-b94d57bb-eks-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa”
2025-08-19 12:44:17 [ℹ]  deploying stack “eksctl-sagemaker-hyperpod-eks-cluster-b94d57bb-eks-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa”
2025-08-19 12:44:17 [ℹ]  waiting for CloudFormation stack “eksctl-sagemaker-hyperpod-eks-cluster-b94d57bb-eks-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa”
2025-08-19 12:44:48 [ℹ]  waiting for CloudFormation stack “eksctl-sagemaker-hyperpod-eks-cluster-b94d57bb-eks-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa”
2025-08-19 12:44:49 [ℹ]  created serviceaccount “kube-system/ebs-csi-controller-sa”

The final step of the IAM Service Account configuration is to attach extra policies required for the interaction between Amazon EKS and SageMaker HyperPod, mentioned in the feature’s documentation. We will do this with an inline policy, created from the terminal.
The following code snippet creates a temporary file and attaches it to the newly created policy, where you need to put in three values, related to your demo process:

<region>
<account-id>
<eks-cluster-name>

cat > inline_policy.json << ‘EOF’
{
    “Version”: “2012-10-17”,
    “Statement”:
    [
        {
            “Effect”: “Allow”,
            “Action”:
            [
                “sagemaker:AttachClusterNodeVolume”,
                “sagemaker:DetachClusterNodeVolume”
            ],
            “Resource”: “arn:aws:sagemaker:*:*:cluster/*”
        },
        {
            “Effect”: “Allow”,
            “Action”:
            [
                “eks:DescribeCluster”
            ],
            “Resource”: “arn:aws:eks:<region>:<account-id>:cluster/<eks-cluster-name>”
        }
    ]
}
EOF

Once the file is configured with your parameters, apply the policy to the DemoRole created before using eksctl:

aws iam put-role-policy
        –role-name DemoRole
        –policy-name HyperPodEBS
        –policy-document file://inline_policy.json

To observe the results of the creation, we can use kubectl to inspect the service account’s state and an IAM role consumed by it:

kubectl get sa ebs-csi-controller-sa -n kube-system -o json
{
    “apiVersion”: “v1”,
    “kind”: “ServiceAccount”,
    “metadata”: {
        “annotations”: {
            “eks.amazonaws.com/role-arn”: “arn:aws:iam::<account-id>:role/DemoRole”
        },
        “creationTimestamp”: “2025-08-19T12:10:05Z”,
        “labels”: {
            “app.kubernetes.io/managed-by”: “eksctl”
        },
        “name”: “ebs-csi-controller-sa”,
        “namespace”: “kube-system”,
        “resourceVersion”: “17982”,
        “uid”: “679cc698-88dd-4934-a11f-0b8edee5277c”
    }
}

To observe the role, we can check both attached managed policies and inline policies.For the attached managed:

$ aws iam list-attached-role-policies –role-name DemoRole
{
    “AttachedPolicies”: [
        {
            “PolicyName”: “AmazonEBSCSIDriverPolicy”,
            “PolicyArn”: “arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy”
        }
    ]
}

For the inline policies:

aws iam list-role-policies —role-name DemoRole
{
    “PolicyNames”: [
        “HyperPodEBS”
    ]
}

Now, we are ready to create and install the Amazon EBS CSI add-on on the EKS cluster. For this example, use the following command:

eksctl create addon
        –cluster <eks-cluster-name>
        –name aws-ebs-csi-driver
        –version latest
        –service-account-role-arn arn:aws:iam::<account-id>:role/DemoRole 
        –force

You will see an output indicating that the creation has started, for example:

:27:47 [ℹ] Kubernetes version “1.31” in use by cluster “sagemaker-hyperpod-eks-cluster-b94d57bb-eks”
:27:48 [ℹ] IRSA is set for “aws-ebs-csi-driver” addon; will use this to configure IAM permissions
2025-08-19 13:27:48 [!] the recommended way to provide IAM permissions for “aws-ebs-csi-driver” addon is via pod identity associations; after addon creation is completed, run
:27:48 [ℹ] using provided ServiceAccountRoleARN “arn:aws:iam::000182341198:role/DemoRole”
:27:48 [ℹ] creating addon: aws-ebs-csi-driver

To track the status of add-on creation, you can use the watch utility from the terminal.
Note: If the status is stuck on CREATING for more than 5 minutes, you should debug the state of your cluster to see whether the pods are running. If the status isn’t changing, you might not have a sufficient number of instances or the instance type is too small. If you observe that many pods of the cluster are in the PENDING state that might be an indicator of one of these issues.

watch -n 5 aws eks describe-addon
        –cluster-name <eks-cluster-name>
        –addon-name aws-ebs-csi-driver
        –query ‘addon.status’
        
# wait until you see this:
“ACTIVE”

Running the volume resize demo
Now we’re ready for the demo, all the components are installed and ready to interact with each other. On your local machine, download the repository of AWS EBS CSI driver, then navigate to the folder of the resizing example.

$ git clone git@github.com:kubernetes-sigs/aws-ebs-csi-driver.git
Cloning into ‘aws-ebs-csi-driver’…
remote: Enumerating objects: 35200, done.
remote: Counting objects: 100% (146/146), done.
remote: Compressing objects: 100% (81/81), done.
remote: Total 35200 (delta 99), reused 67 (delta 61), pack-reused 35054 (from 2)
Receiving objects: 100% (35200/35200), 29.61 MiB | 14.56 MiB/s, done.
Resolving deltas: 100% (20351/20351), done.

$ cd aws-ebs-csi-driver/examples/kubernetes/resizing

Within this folder, we will utilize the provided example, which you can study yourself a bit more by reading the readme file.
Quoting the readme file, we are going to:

Deploy the provided pod on your cluster along with the StorageClass and PersistentVolumeClaim:

kubectl apply -f manifests
persistentvolumeclaim/ebs-claim created
pod/app created
storageclass.storage.k8s.io/resize-sc created

Wait for the PersistentVolumeClaim to bind and the pod to reach the Running state.

kubectl get pvc/ebs-claim pod/app
NAME                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/ebs-claim      pvc-404555ec-d4a8-4fb0-bfbb-782619b1f815   4Gi        RWO            resize-sc      <unset>                 55s

NAME      READY   STATUS    RESTARTS   AGE
pod/app   1/1       0          55s

Expand the volume size by increasing the capacity specification in the PersistentVolumeClaim using the editor, we use vim but you can use other editors. The following example is the content of the file with extra comments pointing to the places where you should change the capacity. Be attentive, as there are two places with storage volume – one is the specification, while the other is only a status. Changing the status will result in no changes.

$ KUBE_EDITOR=”vim” && kubectl edit pvc ebs-claim

  1 # Please edit the object below. Lines beginning with a ‘#’ will be ignored,
  2 # and an empty file will abort the edit. If an error occurs while saving this file will be
  3 # reopened with the relevant failures.
  4 #
  5 apiVersion: v1
  6 kind: PersistentVolumeClaim
  7 metadata:
  8   annotations:
  9     kubectl.kubernetes.io/last-applied-configuration: |
 10       {“apiVersion”:”v1″,”kind”:”PersistentVolumeClaim”,”metadata”:{“annotations”:{},”name”:”ebs-claim”,”namespace”:”default”},”spec”:{“accessMod>
 11     pv.kubernetes.io/bind-completed: “yes”
 12     pv.kubernetes.io/bound-by-controller: “yes”
 13     volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
 14     volume.kubernetes.io/storage-provisioner: ebs.csi.aws.com
 15   creationTimestamp: “2025-08-19T13:14:42Z”
 16   finalizers:
 17   – kubernetes.io/pvc-protection
 18   name: ebs-claim
 19   namespace: default
 20   resourceVersion: “45457”
 21   uid: 404555ec-d4a8-4fb0-bfbb-782619b1f815
 22 spec:
 23   accessModes:
 24   – ReadWriteOnce
 25   resources:
 26     requests:
 27       storage: 4Gi # <———– CHANGE THE VALUE HERE 
 28   storageClassName: resize-sc
 29   volumeMode: Filesystem
 30   volumeName: pvc-404555ec-d4a8-4fb0-bfbb-782619b1f815
 31 status:
 32   accessModes:
 33   – ReadWriteOnce
 34   capacity:
 35     storage: 4Gi # <————- NOT HERE. THIS IS ONLY STATUS
 36   phase: Bound

Wait a few minutes and verify that both the persistence volume and persistence volume claim have been appropriately resized. To do so, first, check the claim ebs-claim and use the VOLUME from the output to check the volume itself. In both outputs we now see the Capacity changed to 8Gi form initial 4Gi

kubectl get pvc/ebs-claim
NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
ebs-claim   Bound              RWO            resize-sc      <unset>                 10m

kubectl get pv/
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM               STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
pvc-404555ec-d4a8-4fb0-bfbb-782619b1f815          RWO            Delete           Bound    default/ebs-claim   resize-sc      <unset>                          11m

Clean up the example:

kubectl delete -f manifests
persistentvolumeclaim “ebs-claim” deleted
pod “app” deleted
storageclass.storage.k8s.io “resize-sc” deleted

We are done with the demo of the feature on the resize example, congratulations! Explore other examples in the same repository, like dynamic provisioning or block volume.
Clean up
To clean up your resources to avoid incurring more charges, complete the following steps:

Delete your SageMaker HyperPod cluster.
If you created the networking stack from the SageMaker HyperPod workshop, delete the stack as well to clean up the virtual private cloud (VPC) resources and the FSx for Lustre volume.

Conclusion
The new features in Amazon SageMaker HyperPod Customer Managed Key (CMK) support and Amazon EBS CSI driver support enhance system security and storage capabilities.The Amazon EBS CSI driver support within SageMaker HyperPod EKS clusters supports the use of Amazon EBS volumes for flexible and dynamic storage management options for large-scale AI workloads. In addition to other storage services already available with SageMaker HyperPod clusters, such as Amazon FSx or Amazon S3, you can build efficient and high performing AI solutions. By combining Amazon EBS volumes with Customer Managed Keys support, you can maintain compliance and security governance by controlling their own encryption keys.Together, these features make SageMaker HyperPod a more robust and enterprise-ready environment for training and deploying foundation models at scale, allowing organizations to meet both their security requirements and storage needs efficiently.
For more information, please see, Customer managed AWS KMS key encryption for SageMaker HyperPod and Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters.

About the authors
Mark Vinciguerra is an Associate Specialist Solutions Architect at Amazon Web Services (AWS) based in New York. He focuses on Generative AI training and inference, with the goal of helping customers architect, optimize, and scale their workloads across various AWS services. Prior to AWS, he went to Boston University and graduated with a degree in Computer Engineering. You can connect with him on LinkedIn.
Rostislav (Ross) Povelikin is a Senior Specialist Solutions Architect at AWS focusing on systems performance for distributed training and inference. Prior to this, he focused on datacenter network and software performance optimisations at NVIDIA.
Kunal Jha is a Principal Product Manager at AWS, where he focuses on building Amazon SageMaker HyperPod to enable scalable distributed training and fine-tuning of foundation models. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest. You can connect with him on LinkedIn.
Takuma Yoshitani  is a Senior Software Development Engineer at AWS, where he focuses on improving the experience of the SageMaker HyperPod service. Prior to SageMaker, he has contributed to Amazon Go / Just Walk-Out tech.
Vivek Koppuru is an engineering leader on the Amazon SageMaker HyperPod team helping provide infrastructure solutions for ML training and inference. He has years of experience in AWS and compute as an engineer, working on core services like EC2 and EKS. He is passionate about building customer-focused solutions and navigating through complex technical challenges in distributed systems with the team.
Ajay Mahendru is an engineering leader at AWS, working in the SageMaker HyperPod team. Bringing in nearly 15+ years of software development experience, Ajay has contributed to multiple AWS SageMaker Services inlcuding SageMaker Inference, Training, Processing and HyperPod. With an expertise in building distributed systems, he focuses on building reliable, customer-focused and scalable solutions across teams.
Siddharth Senger currently serves as a Senior Software Development Engineer at Amazon Web Services (AWS), specifically within the SageMaker HyperPod team. Bringing nearly a decade of software development experience, Siddharth has contributed to several across Amazon, including Retail, Amazon Rekognition, Amazon Textract and AWS SageMaker. He is passionate about building reliable, scalable, and efficient distributed systems that empower customers to accelerate large-scale machine learning and AI innovation.

Accelerating generative AI applications with a platform engineering ap …

Over the past two years, I’ve worked with many customers using generative AI to transform their organizations. Most stall at experimentation, because costs stack up and timelines extend before delivering demonstrable value. A 2023 AWS MIT Chief Data Officer (CDO) Symposium survey backs this up, reporting that while 71% of Chief Data Officers were experimenting with generative AI, only 6% had successfully deployed it in production.
Successful adopters use platform engineering concepts to avoid this trap by building reusable components to accelerate development and control costs. In this post, I will illustrate how applying platform engineering principles to generative AI unlocks faster time-to-value, cost control, and scalable innovation.
Why platform engineering?
Platform engineering isn’t a new concept. In traditional software development, teams have long invested in building functional tooling to accelerate application development. This approach not only saves time and money but also allows development teams to focus on improving application quality by isolating concerns. A dedicated platform engineering team handles the creation and enhancement of these tools, providing expanded functionality, ease of use, and continuous improvement. As shown in the following figure, not only are newer large language models launching more frequently, but their benchmark scores are also improving at twice the rate in early 2025 compared to 2024. This accelerating pace of innovation makes platform engineering especially important, enabling organizations to quickly adopt newer, more capable models, integrate the latest advancements, and continuously enhance their applications.
Additionally, a platform engineering approach achieves scalability and efficiency through reusable components and standardized frameworks, enabling rapid deployment of multiple AI models and applications. Standardized processes and tools help ensure consistency and high-quality outputs. Security, compliance, and ethical standards are enhanced with uniform implementation across the platform. Innovation accelerates because AI developers can focus on creative solutions rather than infrastructure. Cost management improves by reducing duplication of effort and resource wastage, making generative AI more affordable. A shared platform fosters collaboration, breaking down silos for more cohesive AI solutions. Finally, intuitive, user-friendly tools reduce the learning curve, enhancing developer productivity.
Anatomy of generative AI applications
A good place to start imagining what a generative AI application would look like is to start from what we already know about majority of applications out there. Pre-generative AI era applications are primarily data handlers in some shape or form, and generally include three layers: a presentation (or frontend) layer, an application logic layer, and a data layer, as shown in the following figure.

Each layer has a well-defined role—the presentation layer captures user instructions and input data, the application layer supports this instruction by either retrieving data from the data layer (in the case of READ operations) or processing the input before writing it to the data layer, the data layer receives instructions from the application layer and provides persistence to data.
A generative AI application consists of the same basic setup; however, applications don’t just deal with CRUD (CREATE, READ, UPDATE, DELETE) operations with data anymore—generative AI technology replaces the data layer with the generation layer. Data is now part of the wider middle layer, and plays a supporting function to the generation layer, as shown in the following figure.

Platform engineering blueprint for generative AI
With this mental model of a generative AI application, you can start looking at what reusable components you can build with the sound platform engineering principles in Why platform engineering? The following figure is an overview of the components described in this section.

Frontend components
All applications require a great presentation layer, and more specifically to generative AI, you need a presentation layer to cover several key functionalities. If you’re building an interactive application, you probably need session management capabilities so that the application can remember the interactions it had with the user, and over time re-use this data as context to guide future responses. Because such interactions are private, you need sufficient authentication and authorization controls to secure access at an individual basis. These capabilities can be packaged into one of many micro-frontend components that are reusable across all applications, saving time for development and adding a consistent organizational touch to the applications. Finally, interactive frontends are just one channel of interacting with your applications, other times it might make more sense to expose over RESTful or Websocket APIs so that you can embed into websites or internal messaging applications. So, by building a well-defined connectors layer, you can standardize all associated aspects (such as security, monitoring and logging, and documentation) and empower independent experimentation.
Data
To unlock the greatest business value, you need to include organizational data in your generative AI use cases by building a suitable data infrastructure to allow secure access to that data at scale. Data can be grouped either as unstructured data (stored on intranet sites, wikis, and content and knowledge management systems) and structured data (stored in transactional databases, data warehouses, and external software-as-a-service (SaaS)). Making each type of data widely available involves different treatment. For unstructured data, building up a metadata index layer makes it searchable. One way of doing so is to use vectorization, which uses embedding models to convert unstructured data into vector representations and stores them in vector databases. With vector search capabilities, you can build knowledge bases for different organizational domains—such as HR, Finance, and Marketing. These vector databases are progressively evolved to improve search and retrieval accuracy and relevancy with newer technology, chunking strategy and embedding models.
For structured data, while it’s possible for LLMs to query a database by writing their own SQL queries and doing so over preconfigured JDBC or ODBC connections, it’s more scalable and secure to build dedicated interfaces meant for generative AI use. These can be well-defined data APIs designed to handle larger queries using read-replicas, which help insulate primary transactional systems from surges in read requests originating from generative AI applications. While RESTful APIs are an good choice because of their low complexity and speed to deploy, you could also explore GraphQL based APIs, which are more powerful, particularly in querying several datastores at once through a common interface. GraphQL does this using different data resolvers to interface with different databases, even when those databases operate on different underlying technologies (SQL or NoSQL). Generative AI applications can remember the same GraphQL API endpoint and API calls but get access to more data sources as more resolvers are added. On AWS, you can implement both RESTful and GraphQL APIs using Amazon API Gateway and Amazon AppSync respectively.
As increasing amounts of data become available to generative AI applications, setting up strong data governance becomes necessary to track, monitor and secure access to the data. You should apply fine-grained permissions at the data level to makes sure that each generative AI application can only access the data that it (or its users) are allowed to. To implement this at scale, you can use AWS Lake Formation to define and enforce granular access controls on data stored in Amazon Simple Storage Service (Amazon S3) without needing to manage individual AWS Identity and Access Management (IAM) policies manually. It supports table- and column-level permissions, integrates with AWS CloudTrail for auditing, and enables centralized, fine-grained governance across AI workloads sharing the same data lake.
Controls
You can build a unified output control layer that applies across all generative AI applications built in your organization. By doing this, you can apply a consistent set of quality and security policies across all outputs regardless of the language model used. Output controls can be categorized into two main sets. The first set, safety controls, focuses on making sure that responses are non-toxic (toxicity), avoids sensitive topics or keywords (filtering), and limits the exposure of personally identifiable information (PII) (redaction). The second set, quality controls, helps ensure the accuracy of responses, including aspects such as faithfulness, correctness, and relevancy to the original prompt. To uniformly enforce these controls across all generative AI applications, you can implement a standardized enforcement layer. This layer should include a fine-tuned language model trained to sanitize outputs and evaluate responses before they’re made available to users.
Observability
Observability is crucial in maintaining the health and performance of generative AI applications. It involves monitoring, logging, and evaluating model behaviour, user interactions, and system performance to ensure generative AI applications run smoothly and issues are detected promptly. Monitoring includes feedback mechanisms to capture user interactions and record response times, making sure that the system meets performance expectations. Capacity monitoring makes sure that the system scales appropriately under varying loads. Logging involves capturing detailed interaction logs that help in diagnosing issues and understanding user behavior. Evaluation and testing through benchmarking and adversarial testing help assess the robustness and accuracy of the AI models. By implementing comprehensive observability practices, you can maintain high standards of performance and reliability across all generative AI applications. AWS observability services including Amazon CloudWatch, AWS X-Ray, and Amazon OpenSearch Service provide comprehensive monitoring, logging, and analysis capabilities.
Orchestration
As generative AI applications become more sophisticated, they often move beyond single-prompt interactions to workflows that coordinate multiple steps and services. This is where orchestration becomes essential. Complex tasks might involve classical AI components such as optical character recognition (OCR), prompt decomposition, or using specialized language models for sub-tasks. To manage these workflows, AWS Step Functions provides serverless, event-driven orchestration that sequences tasks, handles retries, and maintains state—forming the backbone of AI e logic. A key part of this is prompt management—the ability to track, version, and persist prompt templates, sub-prompts, and intermediate results across executions. Amazon DynamoDB supports this by offering scalable, low-latency storage that enables real-time access to prompt metadata and agent state, providing consistent and traceable workflow behavior.
Reusable logic or API calls can be embedded using AWS Lambda, allowing flexible function execution within chains. As applications adopt agentic workflows, where LLMs function as modular agents with defined roles, Step Functions coordinates agent interactions while DynamoDB serves as persistent context memory.
Together, these components support structured chaining, reliable prompt management, and scalable agentic workflows, enabling modular, resilient, and intelligent orchestration for complex generative AI systems.
Large language models
Large language models are deployed in the generation layer of the application. We have a variety of models to choose from that vary in performance and cost, and these fall into categories of pretrained models, fine-tuned models, and custom models. Each type serves distinct purposes and offers unique advantages depending on the specific requirements of the application.
Pretrained models are the foundation of many generative AI applications. These models are trained on vast amounts of diverse data and can generate coherent and contextually relevant text based on the input prompt. Pretrained models are ideal for general-purpose tasks where extensive domain-specific customization isn’t required. Examples of pretrained models available on Amazon Bedrock include Anthropic’s Claude models and Meta’s Llama models. Orgnaizaitons can use AWS services such as Amazon Comprehend and Amazon Polly to use these pretrained models for tasks such as natural language understanding and text-to-speech conversion. These models provide a strong baseline and can be quickly deployed to perform a wide range of functions, saving time and resources.
While pretrained models are highly versatile, fine-tuned models offer greater specificity and accuracy for particular tasks. Fine-tuning involves taking a pretrained model and further training it on a smaller, domain-specific dataset. This process allows the model to adapt to the nuances and intricacies of specific industries or applications. For instance, an LLM can be fine-tuned to understand medical terminology for healthcare applications or legal jargon for legal solutions. Amazon SageMaker provides end-to-end capabilities for building, training, and deploying machine learning models at scale, which organizations can use to efficiently fine-tune pretrained models for domain-specific precision.
Custom models are built from the ground up to meet highly specialized requirements. These models are trained exclusively on a curated dataset that represents the specific needs and context of the application. Custom models are ideal for scenarios where existing pretrained or fine-tuned models don’t suffice because of the unique nature of the data or the complexity of the tasks. Developing custom models requires significant expertise and resources, but they offer unparalleled accuracy and relevance. AWS provides extensive tools and frameworks through SageMaker that data scientists and machine learning engineers can use to build, train, and deploy custom models tailored to their exact specifications.
Conclusion
The relentless development of ever more capable LLMs, coupled with the rise of specialized models outperforming generalists for specific tasks, underscores the need for a flexible platform engineering approach. Such an approach simplifies the evaluation, integration, and operationalization of new models, enabling organizations to continuously enhance their generative AI applications. Crucially, it facilitates the orchestration of multi-model workflows, stringing together outputs from different specialized models to maximize overall capability. By embracing this platform-centric strategy, companies can future-proof their generative AI initiatives, rapidly realizing innovations while maintaining scalability, consistency, and responsible practices.To further explore the implementation of platform engineering in generative AI applications, consider the following AWS resources:

Best practices to build generative AI applications on AWS: This blog post delves into various approaches for developing generative AI applications, including prompt engineering, Retrieval-Augmented Generation (RAG), and model customization.
Achieve operational excellence with well-architected generative AI solutions using Amazon Bedrock This article discusses strategies for deploying generative AI at scale while maintaining operational excellence, emphasizing the importance of a well-architected approach.
Choosing a generative AI service: This AWS documentation guide helps you select the most suitable AWS generative AI services and tools based on organizational needs.
Generative AI Application Builder on AWS: This solution speeds up your AI development by incorporating your business data, comparing the performance of LLMs, running multi-step tasks through AI agents, quickly building extensible applications, and deploying them with enterprise-grade architecture.

About the authors
Thong Seng Foo is a Principal Solutions Architect at Amazon Web Services based in Singapore, specializing in public sector digital transformation and large-scale AI platform design. He advises governments across Asia-Pacific on building secure cloud foundations, digital public infrastructure, and national AI capabilities.
Kamlesh Bhatt is a Senior ProServe Architect at AWS Professional Services based in Singapore. He brings a decade of cloud and data expertise, with a strong focus on artificial intelligence, machine learning and generative Al. Specializing in building machine learning platforms and generative Al products, he helps organisations leverage the power of cloud computing and advanced Al technologies.

Focal Loss vs Binary Cross-Entropy: A Practical Guide for Imbalanced C …

Binary cross-entropy (BCE) is the default loss function for binary classification—but it breaks down badly on imbalanced datasets. The reason is subtle but important: BCE weighs mistakes from both classes equally, even when one class is extremely rare. 

Imagine two predictions: a minority-class sample with true label 1 predicted at 0.3, and a majority-class sample with true label 0 predicted at 0.7. Both produce the same BCE value: −log(0.3). But should these two errors be treated equally? In an imbalanced dataset, definitely not—the mistake on the minority sample is far more costly. 

This is exactly where Focal Loss comes in. It reduces the contribution of easy, confident predictions and amplifies the impact of difficult, minority-class examples. As a result, the model focuses less on the overwhelmingly easy majority class and more on the patterns that actually matter. Check out the FULL CODES here.

In this tutorial, we demonstrate this effect by training two identical neural networks on a dataset with a 99:1 imbalance ratio—one using BCE and the other using Focal Loss—and comparing their behavior, decision regions, and confusion matrices. Check out the FULL CODES here.

Installing the dependencies

Copy CodeCopiedUse a different Browserpip install numpy pandas matplotlib scikit-learn torch

Creating an Imbalanced Dataset

We create a synthetic binary classification dataset with a 99:1 imbalance with 6000 samples using make_classification. This ensures that almost all samples belong to the majority class, making it an ideal setup to demonstrate why BCE struggles and how Focal Loss helps. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim

# Generate imbalanced dataset
X, y = make_classification(
n_samples=6000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.99, 0.01],
class_sep=1.5,
random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)

Creating the Neural Network

We define a simple neural network with two hidden layers to keep the experiment lightweight and focused on the loss functions. This small architecture is sufficient to learn the decision boundary in our 2D dataset while clearly highlighting the differences between BCE and Focal Loss. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SimpleNN(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(2, 16),
nn.ReLU(),
nn.Linear(16, 8),
nn.ReLU(),
nn.Linear(8, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.layers(x)

Focal Loss Implementation

This class implements the Focal Loss function, which modifies binary cross-entropy by down-weighting easy examples and focusing the training on hard, misclassified samples. The gamma term controls how aggressively easy samples are suppressed, while alpha assigns higher weight to the minority class. Together, they help the model learn better on imbalanced datasets. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass FocalLoss(nn.Module):
def __init__(self, alpha=0.25, gamma=2):
super().__init__()
self.alpha = alpha
self.gamma = gamma

def forward(self, preds, targets):
eps = 1e-7
preds = torch.clamp(preds, eps, 1 – eps)

pt = torch.where(targets == 1, preds, 1 – preds)
loss = -self.alpha * (1 – pt) ** self.gamma * torch.log(pt)
return loss.mean()

Training the Model

We define a simple training loop that optimizes the model using the chosen loss function and evaluates accuracy on the test set. We then train two identical neural networks — one with standard BCE loss and the other with Focal Loss — allowing us to directly compare how each loss function performs on the same imbalanced dataset. The printed accuracies highlight the performance gap between BCE and Focal Loss.

Although BCE shows a very high accuracy (98%), this is misleading because the dataset is heavily imbalanced — predicting almost everything as the majority class still yields high accuracy. Focal Loss, on the other hand, improves minority-class detection, which is why its slightly higher accuracy (99%) is far more meaningful in this context. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef train(model, loss_fn, lr=0.01, epochs=30):
opt = optim.Adam(model.parameters(), lr=lr)

for _ in range(epochs):
preds = model(X_train)
loss = loss_fn(preds, y_train)
opt.zero_grad()
loss.backward()
opt.step()

with torch.no_grad():
test_preds = model(X_test)
test_acc = ((test_preds > 0.5).float() == y_test).float().mean().item()
return test_acc, test_preds.squeeze().detach().numpy()

# Models
model_bce = SimpleNN()
model_focal = SimpleNN()

acc_bce, preds_bce = train(model_bce, nn.BCELoss())
acc_focal, preds_focal = train(model_focal, FocalLoss(alpha=0.25, gamma=2))

print(“Test Accuracy (BCE):”, acc_bce)
print(“Test Accuracy (Focal Loss):”, acc_focal)

Plotting the Decision Boundary

The BCE model produces an almost flat decision boundary that predicts only the majority class, completely ignoring the minority samples. This happens because, in an imbalanced dataset, BCE is dominated by the majority-class examples and learns to classify nearly everything as that class. In contrast, the Focal Loss model shows a much more refined and meaningful decision boundary, successfully identifying more minority-class regions and capturing patterns BCE fails to learn. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef plot_decision_boundary(model, title):
# Create a grid
x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
xx, yy = np.meshgrid(
np.linspace(x_min, x_max, 300),
np.linspace(y_min, y_max, 300)
)
grid = torch.tensor(np.c_[xx.ravel(), yy.ravel()], dtype=torch.float32)
with torch.no_grad():
Z = model(grid).reshape(xx.shape)

# Plot
plt.contourf(xx, yy, Z, levels=[0,0.5,1], alpha=0.4)
plt.scatter(X[:,0], X[:,1], c=y, cmap=’coolwarm’, s=10)
plt.title(title)
plt.show()

plot_decision_boundary(model_bce, “Decision Boundary — BCE Loss”)
plot_decision_boundary(model_focal, “Decision Boundary — Focal Loss”)

Plotting the Confusion Matrix

In the BCE model’s confusion matrix, the network correctly identifies only 1 minority-class sample, while misclassifying 27 of them as majority class. This shows that BCE collapses toward predicting almost everything as the majority class due to the imbalance. In contrast, the Focal Loss model correctly predicts 14 minority samples and reduces misclassifications from 27 down to 14. This demonstrates how Focal Loss places more emphasis on hard, minority-class examples, enabling the model to learn a decision boundary that actually captures the rare class instead of ignoring it. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserfrom sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def plot_conf_matrix(y_true, y_pred, title):
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=”Blues”, values_format=’d’)
plt.title(title)
plt.show()

# Convert torch tensors to numpy
y_test_np = y_test.numpy().astype(int)

preds_bce_label = (preds_bce > 0.5).astype(int)
preds_focal_label = (preds_focal > 0.5).astype(int)

plot_conf_matrix(y_test_np, preds_bce_label, “Confusion Matrix — BCE Loss”)
plot_conf_matrix(y_test_np, preds_focal_label, “Confusion Matrix — Focal Loss”)

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Focal Loss vs Binary Cross-Entropy: A Practical Guide for Imbalanced Classification appeared first on MarkTechPost.

Google DeepMind’s WeatherNext 2 Uses Functional Generative Networks …

Google DeepMind Research have introduced WeatherNext 2, an AI based medium range global weather forecasting system that now powers upgraded forecasts in Google Search, Gemini, Pixel Weather and Google Maps Platform’s Weather API, with Google Maps integration coming next. It combines a new Functional Generative Network, or FGN, architecture with a large ensemble to deliver probabilistic forecasts that are faster, more accurate and higher resolution than the previous WeatherNext system, and it is exposed as data products in Earth Engine, BigQuery and as an early access model on Vertex AI.

https://arxiv.org/pdf/2506.10772

From deterministic grids to functional ensembles

At the core of WeatherNext 2 is the FGN model. Instead of predicting a single deterministic future field, the model directly samples from the joint distribution over 15 day global weather trajectories. Each state 𝑋ₜ includes 6 atmospheric variables at 13 pressure levels and 6 surface variables on a 0.25 degree latitude longitude grid, with a 6 hour timestep. The model learns to approximate 𝑝(𝑋ₜ ∣ 𝑋ₜ₋₂:𝑡₋₁) and is run autoregressively from two initial analysis frames to generate ensemble trajectories.

Architecturally, each FGN instance follows a similar layout to the GenCast denoiser. A graph neural network encoder and decoder map between the regular grid and a latent representation defined on a spherical, 6 times refined icosahedral mesh. A graph transformer operates on the mesh nodes. The production FGN used for WeatherNext 2 is larger than GenCast, with about 180 million parameters per model seed, latent dimension 768 and 24 transformer layers, compared with 57 million parameters, latent 512 and 16 layers for GenCast. FGN also runs at a 6 hour timestep, where GenCast used 12 hour steps.

https://arxiv.org/pdf/2506.10772

Modeling epistemic and aleatoric uncertainty in function space

FGN separates epistemic and aleatoric uncertainty in a way that is practical for large scale forecasting. Epistemic uncertainty, which comes from limited data and imperfect learning, is handled by a deep ensemble of 4 independently initialized and trained models. Each model seed has the architecture described above, and the system generates an equal number of ensemble members from each seed when producing forecasts.

Aleatoric uncertainty, which represents inherent variability in the atmosphere and unresolved processes, is handled through functional perturbations. At each forecast step, the model samples a 32 dimensional Gaussian noise vector 𝜖ₜ and feeds it through parameter shared conditional normalization layers inside the network. This effectively samples a new set of weights 𝜃ₜ for that forward pass. Different 𝜖ₜ values give different but dynamically coherent forecasts for the same initial condition, so ensemble members look like distinct plausible weather outcomes, not independent noise at each grid point.

Training on marginals with CRPS, learning joint structure

A key design choice is that FGN is trained only on per location, per variable marginals, not on explicit multivariate targets. The model uses the Continuous Ranked Probability Score (CRPS) as the training loss, computed with a fair estimator on ensemble samples at each grid point and averaged over variables, levels and time. CRPS encourages sharp, well calibrated predictive distributions for each scalar quantity. During later training stages the authors introduce short autoregressive rollouts, up to 8 steps, and back-propagate through the rollout, which improves long range stability but is not strictly required for good joint behavior.

Despite using only marginal supervision, the low dimensional noise and shared functional perturbations force the model to learn realistic joint structure. With a single 32 dimensional noise vector influencing an entire global field, the easiest way to reduce CRPS everywhere is to encode physically consistent spatial and cross variable correlations along that manifold, rather than independent fluctuations. Experiments confirm that the resulting ensemble captures realistic regional aggregates and derived quantities.

Measured gains over GenCast and traditional baselines

On marginal metrics, WeatherNext 2’s FGN ensemble clearly improves over GenCast. FGN achieves better CRPS in 99.9% of cases with statistically significant gains, with an average improvement of about 6.5% and maximum gains near 18% for some variables at shorter lead times. Ensemble mean root mean squared error also improves while maintaining good spread skill relationships, indicating that ensemble spread is consistent with forecast error out to 15 days.

https://arxiv.org/pdf/2506.10772

To test joint structure, the research team evaluate CRPS after pooling over spatial windows at different scales and over derived quantities such as 10 meter wind speed and the difference in geopotential height between 300 hPa and 500 hPa. FGN improves both average pooled and max pooled CRPS relative to GenCast, showing that it better models region level aggregates and multivariate relationships, not only point wise values.

Tropical cyclone tracking is a particularly important use case. Using an external tracker, the research team compute ensemble mean track errors. FGN achieves position errors that correspond to roughly one extra day of useful predictive skill compared with GenCast. Even when constrained to a 12 hour timestep version, FGN still outperforms GenCast beyond 2 day lead times. Relative Economic Value analysis on track probability fields also favors FGN over GenCast across a range of cost loss ratios, which is crucial for decision makers planning evacuations and asset protection.

Key Takeaways

Functional Generative Network core: WeatherNext 2 is built on the Functional Generative Network, a graph transformer ensemble that predicts full 15 day global trajectories on a 0.25° grid with a 6 hour timestep, modeling 6 atmospheric variables at 13 pressure levels plus 6 surface variables.

Explicit modeling of epistemic and aleatoric uncertainty: The system combines 4 independently trained FGN seeds for epistemic uncertainty with a shared 32 dimensional noise input that perturbs network normalization layers for aleatoric uncertainty, so each sample is a dynamically coherent alternative forecast, not point wise noise.

Trained on marginals, improves joint structure: FGN is trained only on per location marginals using fair CRPS, yet still improves joint spatial and cross variable structure over the previous diffusion based WeatherNext Gen model, including lower pooled CRPS on region level aggregated fields and derived variables such as 10 meter wind speed and geopotential thickness.

Consistent accuracy gains over GenCast and WeatherNext Gen: WeatherNext 2 achieves better CRPS than the earlier GenCast based WeatherNext model on 99.9% of variable, level and lead time combinations, with average CRPS improvements around 6.5 percent, improved ensemble mean RMSE and better relative economic value for extreme event thresholds and tropical cyclone tracks.

Check out the Full Paper, Technical Details and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind’s WeatherNext 2 Uses Functional Generative Networks For 8x Faster Probabilistic Weather Forecasts appeared first on MarkTechPost.

Meta AI Introduces DreamGym: A Textual Experience Synthesizer For Rein …

Reinforcement learning RL for large language model LLM agents looks attractive on paper, but in practice it breaks on cost, infrastructure and reward noise. Training an agent that clicks through web pages or completes multi step tool use can easily need tens of thousands of real interactions, each slow, brittle and hard to reset. Meta’s new framework DreamGym reframes that bottleneck as a modeling problem. Instead of running RL directly in environments such as WebShop, ALFWorld and WebArena Lite, it learns a reasoning based experience model that simulates them entirely in text.

https://arxiv.org/pdf/2511.03773

Why Real Environment RL for Agents Does Not Scale?

Current RL pipelines for agents face four coupled problems. Real rollouts are costly, task diversity is limited, reward signals are unstable and the infrastructure stack is complex. Web environments change often, rewards depend on fragile scrapers and many actions are irreversible. Reset mechanisms and episode control are also hard to implement, so long horizon tasks become noisy and sample inefficient.

Benchmarks split into two groups. WebShop and ALFWorld are RL ready but expensive, since they still need about 80 thousand real transitions to reach strong baselines with PPO or GRPO. WebArena Lite is not RL ready at all, because resets and automatic reward checks are unreliable, so online RL in the real environment is effectively infeasible.

DreamGym as a Reasoning Based Simulator

DreamGym is built around three components, a reasoning based experience model, an experience replay buffer and an adaptive curriculum task generator. Together they define a synthetic Markov decision process where the environment lives as text.

The reasoning based experience model Mexp operates in an abstract textual state space. States are compact descriptions of what matters for the task, for example cleaned page elements instead of raw HTML. On each step, the agent provides the current state, the action, the task instruction and the interaction history. The system retrieves the top k similar past transitions from the replay buffer, then uses chain of thought reasoning to produce a reasoning trace, a next state and a reward.

Conceptually, you can view Mexp as an LLM world model for web and tool tasks, but defined purely over text. It is trained with supervised fine tuning on offline trajectories, with a joint objective that learns to generate both the reasoning trace and the next state conditioned on that trace. This forces the model to encode causal structure, not just local text statistics.

https://arxiv.org/pdf/2511.03773

Replay Buffer as Grounding Memory

The experience replay buffer is initialized with offline real environment data from WebShop, ALFWorld and WebArena Lite. As DreamGym trains policies in the synthetic environment, it writes new trajectories back into that buffer. Each prediction step in Mexp uses an encoder to retrieve a small set of similar transitions from this memory and conditions on them when generating reasoning and next states.

This retrieval acts as grounding. It keeps synthetic transitions close to the empirical data distribution and reduces hallucinations in long rollouts. The research team showed that removing history or retrieval degrades consistency, informativeness and factuality of the generated states when judged by an external evaluator, and it also lowers downstream success rates on WebShop and WebArena Lite.

Curriculum from Reward Entropy

The curriculum task generator uses the same backbone as the experience model. It selects seed tasks whose outcomes under the current policy have high reward variance, which corresponds to intermediate difficulty tasks that the agent sometimes solves and sometimes fails. For each such task, the model generates variations that preserve action types but change constraints, targets or context.

The selection heuristic is based on reward entropy computed over batches of rollouts for each task. Tasks with non zero variance and balanced success and failure are preferred. Ablations show that turning off this adaptive curriculum causes both WebShop and WebArena Lite performance to drop by around 6 percentage points and leads to early plateaus as the replay buffer saturates with easy, low entropy trajectories.

https://arxiv.org/pdf/2511.03773

RL Inside DreamGym and Theoretical Guarantees

Inside DreamGym, the policy uses standard RL algorithms. The research team evaluates Proximal Policy Optimization and Group Relative Policy Optimization. Rollouts alternate between the policy choosing actions and the experience model synthesizing next states and rewards. From the point of view of the RL code, this is just another environment interface.

The research team also derive a trust region style improvement bound that links policy performance in the synthetic MDP and in the real environment. The bound contains error terms that depend on the reward prediction error and the divergence between real and synthetic transition distributions. As those errors shrink, improvement in DreamGym implies improvement in the underlying real task.

Experimental Results on WebShop, ALFWorld and WebArena Lite

DreamGym is tested with Llama-based and Qwen-based agents across WebShop, ALFWorld and WebArena Lite. Results fall into three regimes.

First, in RL ready but costly environments WebShop and ALFWorld, agents trained with PPO or GRPO inside DreamGym, using only synthetic transitions, match the performance of PPO and GRPO baselines that use about 80 thousand real environment interactions. This shows that reasoning based experience synthesis can provide enough signal for stable policy improvement.

Second, in not RL ready environments such as WebArena Lite, DreamGym enables RL training that would otherwise be impractical. The framework achieves more than 30 percent improvement in success rate over all baselines, including supervised fine tuning and direct behavior cloning.

Third, in sim to real transfer, the DreamGym-S2R configuration first trains a policy entirely in the synthetic environment and then fine tunes it with a small number of real rollouts. This setting yields more than 40 percent additional gain compared with training from scratch in the real environment, while using less than 10 percent of the real data and cutting total training cost to roughly between one third and one fifth of the baselines.

https://arxiv.org/pdf/2511.03773

Key Takeaways

DreamGym replaces fragile real environment rollouts with a reasoning based experience model that operates in an abstract textual state space, predicting next state and reward from history, task and retrieved similar transitions.

The framework combines 3 components, a reasoning experience model, an experience replay buffer seeded with real trajectories, and a curriculum task generator that selects and varies tasks using a reward entropy heuristic, which together stabilize and diversify RL training.

In WebShop and ALFWorld, which are RL ready but expensive, agents trained with PPO or GRPO entirely inside DreamGym using synthetic interactions match the performance of PPO and GRPO baselines that use about 80,000 real environment transitions.

In WebArena Lite, which is not RL ready, DreamGym enables online RL and achieves more than 30 percent higher success rate than all non RL baselines including supervised fine tuning and behavior cloning.

In the sim to real configuration, policies pretrained in DreamGym and then fine tuned with a small number of real rollouts achieve more than 40 percent additional improvement while using less than 10 percent of the real interaction budget and reducing total training cost to around one third to one fifth of standard RL.

Editorial Comments

DreamGym is an important step toward practical reinforcement learning for LLM agents because it reframes the environment as a reasoning based experience model, grounded by an experience replay buffer and a reward entropy driven curriculum, rather than as a fragile browser stack. The reported gains on WebArena Lite, WebShop and ALFWorld with PPO and GRPO suggest that synthetic experience plus Sim to Real adaptation can become a standard pattern for agent training at scale. Overall, DreamGym makes the experience model, not the policy, the main lever for scaling RL agents.

Check out the Full Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Introduces DreamGym: A Textual Experience Synthesizer For Reinforcement learning RL Agents appeared first on MarkTechPost.

Your complete guide to Amazon Quick Suite at AWS re:Invent 2025

What if you could answer complex business questions in minutes instead of weeks, automate workflows without writing code, and empower every employee with enterprise AI—all while maintaining security and governance? That’s the power of Amazon Quick Suite, and at AWS re:Invent 2025, we are showcasing how organizations are making it a reality. Launched in October 2025, Quick Suite is a new agentic teammate that quickly answers your questions at work and turns those insights into actions for you.
This December in Las Vegas, Quick Suite takes center stage with an impressive lineup of sessions designed to help you reimagine how work gets done. These sessions include breakthrough customer stories and hands-on workshops on how to harness the power of AI agents, research, automation and unified BI.
This year, re:Invent will be held in Las Vegas, Nevada, from December 1 to December 5, 2025, and this guide will help you navigate our comprehensive session catalog and plan your week. The sessions cater to business and technology leaders, product and engineering teams, and data and analytics teams interested in incorporating agentic AI capabilities across their teams and organization.
Explore the session catalog and learn more. Register today to reserve a seat for our sessions!
Keynote sessions
KEY001 – Opening Keynote with AWS CEO Matt Garman
Tuesday, Dec 2 | 8:00 AM – 10:30 AM PST | Venetian | Level 2 | Venetian Ballroom F

Join AWS CEO Matt Garman to hear how AWS is innovating across every aspect of the world’s leading cloud. He explores how we are reinventing foundational building blocks as well as developing brand new experiences, all to empower customers and partners with what they need to build a better future.
KEY002 – The Future of Agentic AI is Here with Swami Sivasubramanian, Vice President of Agentic AI
Wednesday, Dec 3 | 8:30 AM – 10:30 AM PST | Venetian | Level 2 | Venetian Ballroom F

Join Dr. Swami Sivasubramanian, Vice President of Agentic AI, to learn how Agentic AI is poised to transform the way we live and work. In this keynote, you will hear about the tools and services you can use to build, deploy, and run secure, reliable, and scalable agents on AWS. We will also dive deep into the engineering innovations that power your agentic systems and give you a glimpse of the future.
Innovation talk
INV203: The agent-enabled workplace: Transforming businesses with AI
Monday, Dec 1 | 12:00 PM – 1:00 PM PST | Venetian | Level 5 | Palazzo Ballroom B
Discover how organizations are transforming their businesses by truly making AI part of the team. Learn three key ways companies are putting AI to work today: revolutionizing business processes, reinventing the way individuals work and teams collaborate, and transforming customer experiences. We also explore how the future workplace will evolve as AI becomes an integral team member. Through real customer examples, see how users can work with an agentic teammate like Amazon Quick Suite to get the right answers to every question across all their data and transform answers into actions, and how Amazon Connect is creating customer experiences that make every interaction personal, effortless, and memorable. You will also learn how Amazon uses these technologies in our own business. Gain practical insights to deliver real business value with AI while maintaining enterprise-grade security and trust. Join us to learn how AWS is helping organizations transform their business with effective AI collaboration.
Exclusive Executive Event
Amazon Quick Suite: Driving business growth and productivity with Data & AI
Wednesday, December 3 | 12:00 PM – 5:00 PM | Renaissance Las Vegas
Don’t miss this intimate executive event featuring customer panels, global partner insights and live Quick Suite demonstrations. Designed exclusively for C-level executives and senior decision-makers, this event offers strategic roundtables, one-on-one consultations with product leaders, and networking opportunities you won’t find anywhere else at re:Invent. Space is limited to ensure meaningful engagement. Register now to secure your spot – confirmed registrations only.
Breakout sessions
BIZ202: Reimagine work with Amazon Quick Suite
Monday, Dec 1 | 10:00 AM – 11:00 AM PST | Venetian | Level 3 | Lido 3106
Amazon Quick Suite is an agentic teammate for business users that quickly answers their questions at work and turns those insights into actions. Join this session to hear compelling customer stories and discover how organizations are transforming workplace productivity with AI agents for automation, research, and business intelligence in a unified experience. Learn more about how Quick Suite reduces application and context switching, breaks down data silos, delivers comprehensive insights, and accelerates decision-making and taking action—all while maintaining enterprise-grade security.
BIZ203: Amazon’s journey deploying Quick Suite across thousands of users
Wednesday, Dec 3 | 1:30 PM – 2:30 PM PST | MGM | Level 3 | Chairman’s 364
Go behind the scenes of Amazon’s internal Quick Suite deployment across multiple organizations and thousands of employees. This session covers the challenges of implementing enterprise AI at scale, including data integration complexities, orchestration layer design, and overcoming organizational silos. Learn from Amazon teams about deployment strategies, change management approaches, security considerations, and lessons learned from rolling out Quick Suite across diverse business units. Discover practical frameworks for enterprise-wide AI adoption and hear real stories of transformation challenges and solutions that organizations can apply.
BIZ223: Research agents in action: From complex business challenge to trusted insights
Wednesday, Dec 3 | 1:00 PM – 2:00 PM PST | Wynn | Convention Promenade | Latour 2
What if your most challenging research tasks could be completed in minutes instead of weeks? That’s the power of Amazon Quick Research. Join us, along with Principal Financial Group, to see how Quick Research breaks down complex topics, pulling from your organization’s internal knowledge, web data, and premium third-party datasets to deliver comprehensive, source-verified insights. Explore diverse use cases—from market intelligence to risk assessments—and learn about the journey Principal took towards smarter research and decision-making.
BIZ208: Enhance SaaS Applications with Quick Suite Agentic Capabilities
Thursday, Dec 4 | 4:00 PM – 5:00 PM PST | MGM | Level 3 | Chairman’s 360
Learn how Amazon Quick Suite agentic AI capabilities increase customer engagement and application value by 50%. Hear from a customer speaker who uses ISV application integrated with conversational AI and agentic AI capabilities while maintaining multi-tenant security and performance. Explore embedding patterns, API integration strategies, and agents and actions communication for SaaS applications. Discover implementation approaches that add intelligent workplace productivity features without disrupting existing user workflows or application architectures.

Sessions
Date and Venue

BIZ228: Reimagine business intelligence with Amazon Quick Sight
Monday, Dec 11:30 PM – 2:30 PM PST Wynn | Convention Promenade | Lafite 7 | Content Hub | Orange Theater

BIZ331: Build Robust Data Foundations to power Enterprise AI and BI
Monday, Dec 110:00 AM – 11:00 AM PST Wynn | Upper Convention Promenade | Bollinger

BIZ224: Automate any business process using Amazon Quick Suite
Thursday, Dec 411:00 AM – 12:00 PM PST Wynn | Convention Promenade | Lafite 7 | Content Hub | Pink Theater

BIZ207: Democratize access to insights with Amazon Quick Suite
Tuesday, Dec 211:30 AM – 12:30 PM PST Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

BIZ227: Generate new revenue streams with Amazon Quick Sight embedded
Thursday, Dec 41:00 PM – 2:00 PM PST MGM | Level 1 | Grand 122

BIZ225: Deploy Quick Suite at scale with confidence and control
Monday, Dec 14:30 PM – 5:30 PM PST Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Orange Theater

Chalk talks
BIZ323: Design AI-powered BI architectures for modern enterprises with Amazon Quick Suite
Monday, Dec 1 | 11:30 AM – 12:30 PM PST | Wynn | Convention Promenade | Montrachet 1
AI transforms how organizations collect, analyze, and derive insights from data in business intelligence environments. Join this chalk talk to explore the technical details of architectural frameworks and methodologies for developing next-generation BI systems with Amazon Quick Sight, the BI capability of Amazon Quick Suite. Dive deep into how machine learning, natural language processing, and automated analytics integration can revolutionize traditional BI architectures. Discuss implementation challenges including data quality requirements and enterprise readiness considerations for AI-powered BI solutions. Share experiences and learn best practices for maximizing business value and operational efficiency in your AI-powered BI initiatives using Quick Sight.
BIZ319: Beyond chatbots: Discover conversational AI in Amazon Quick Suite
Monday, Dec 1 | 3:00 PM – 4:00 PM PST | MGM | Level 3 | Premier 320
Join our interactive chalk talk to explore conversational AI capabilities in Quick Suite. Discover how to use natural language queries to get answers and visualizations from all your data—including metrics from databases and data warehouses, documents, emails, and knowledge bases. We will diagram advanced chat workflows, exploring knowledge gathering, context management, and agent integrations. Learn to handle complex scenarios like multi-turn conversations and context switching. Together, we will tackle real-world challenges in designing efficient flows and implementing productivity tools, as well as discover strategies for scaling AI conversations while maintaining quality standards. Bring your questions to this collaborative and interactive session.

Sessions
Date and Venue

BIZ327: Bridge data silos to unlock complete insights with Amazon Quick Suite
Tuesday, Dec 22:30 PM – 3:30 PM PST Mandalay Bay | Level 3 South | South Seas C

BIZ326: Agentic workflow architectures with Amazon Quick Flows
Wednesday, Dec 31:00 PM – 2:00 PM PST Wynn | Convention Promenade | Montrachet 1

BIZ405: Building agentic research solutions you can trust with Amazon Quick Research
Wednesday, Dec 32:30 PM – 3:30 PM PST Wynn | Convention Promenade | Lafite 1

BIZ325: Build multi-tenant ISV applications with Quick Suite and Quick Index
Tuesday, Dec 211:30 AM – 12:30 PM PST Wynn | Convention Promenade | Montrachet 1

BIZ329: Design patterns for embedded and agentic analytics with Quick Suite
Monday, Dec 15:30 PM – 6:30 PM PST Wynn | Convention Promenade | Montrachet 1

BIZ328: Implement enterprise governance for Amazon Quick Suite
Thursday, Dec 42:00 PM – 3:00 PM PST Wynn | Convention Promenade | Montrachet 1

BIZ406: Operationalize Amazon Quick Suite deployments at scale
Thursday, Dec 411:00 AM – 12:00 PM PST Mandalay Bay | Level 3 South | South Seas C

Workshops
BIZ402: Use agents to transform complex business processes with Amazon Quick Automate
Thursday, Dec 4 | 3:30 PM – 5:30 PM PST | Caesars Forum | Level 1 | Academy 413
Transform your manual document workflows into agentic automations in this hands-on workshop using Amazon Quick Automate, a capability of Amazon Quick Suite. We will transform a manual claims processing use case into an intelligent, adaptive automation. In this hands-on workshop, build end-to-end automations that combine document extraction, data validation, and business rules processing by using specialized AI agents. Learn how Quick Automate can implement smart exception handling while maintaining human oversight for critical decisions. This workshop is ideal for organizations modernizing document-intensive operations. All attendees must bring a laptop to participate.
BIZ306: Create Agentic AI Chat Experiences with Quick Suite
Monday, Dec 1 | 8:30 AM – 10:30 AM PST | Wynn | Upper Convention Promenade | Cristal 3
Wednesday, Dec 3 | 8:30 AM – 10.30 AM PST | Wynn | Mouton 2
Build comprehensive conversational AI solutions using chat agents and spaces in Amazon Quick Suite. Practice implementing multi-turn conversations that provide contextual, intelligent responses. Customize your chat agent’s behavior through simple steps that support enterprise readiness. Learn to create flows that implement repetitive tasks into an agentic workflow. Dive into deep research capabilities, knowledge integration, and user experience optimization of Quick Suite for enterprise deployment.

Sessions
Schedule

BIZ204: Experience AI-powered BI with Amazon Quick Suite
Tuesday, Dec 2 3:00 PM – 5:00 PM PST Wynn | Upper Convention Promenade | Crystal 1 Wednesday, Dec 3 8.30 AM – 10:30 AM PST Caesars Forum | Alliance 308

BIZ322: Customize your Application with Amazon Quick Suite APIs
Thursday, Dec 4 12:00 PM – 2:00 PM PST Wynn | Upper Convention Promenade | Cristal 1

BIZ315: Configure security and governance controls for Amazon Quick Suite
Wednesday, Dec 3 1:00 PM – 3:00 PM PST Venetian | Level 3 | Lido 3001A

Builder session
BIZ401: Build agentic automations for business processes with Amazon Quick Automate
Wednesday, Dec 3 | 10:00 AM – 11:00 AM PST | Wynn | Convention Promenade | Latour 7
In this session, learn how to build an enterprise-grade automation using Amazon Quick Automate, a capability of Amazon Quick Suite. Through a financial services example, explore how specialized AI agents work together to handle complex interactions across webpages and business applications. You will create a production-ready automation featuring custom agents that leverage knowledge and tools to transform a merchant onboarding process. Using Quick Automate’s chat-based authoring and visual studio, you will configure a workflow with multiple agents, integrate with multiple tools, test and debug the workflow, and then deploy it using robust enterprise controls. Walk away knowing how to develop agentic automations for real-world use cases in under an hour.
Schedule

Register today to reserve a seat!
Resources

Learn more: AWS re:Invent 2025
AWS re:Invent 2025 catalog—Register to book your seat!
Know more about Amazon Quick Suite
Explore the Amazon Quick Suite Community

About the authors
Pelak Desai is a Product Marketing Manager for Amazon Quick Suite. She comes with over 12 years of experience in marketing and business.
Srikanth Baheti is a Senior Manager for Amazon Quick Sight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications and highly scalable and maintainable data pipelines for reporting platforms using AWS services and serverless computing.

Accelerate enterprise solutions with agentic AI-powered consulting: In …

AWS Professional Services set out to help organizations accelerate their cloud adoption with expert guidance and proven methodologies. Today, we’re at a pivotal moment in consulting. Just as cloud computing transformed how enterprises build technology, agentic AI is transforming how consulting services deliver value. We believe in a future where intelligent agents work alongside expert consultants to compress development timelines, elevate solution quality, and enable organizations to achieve their digital transformation goals faster. Making this vision real requires a fundamental reimagining of the traditional consulting model. Drawing on our experience delivering enterprise solutions at scale, I’m excited to announce AWS Professional Services now offers specialized AI agents including the AWS Professional Services Delivery Agent. This represents a transformation to the consulting experience that embeds intelligent agents throughout the consulting life cycle to deliver better value for customers.
An agent-first consulting approach
The AWS Professional Services (AWS ProServe) new approach to agentic AI fundamentally changes what’s possible with consulting. By combining our deep expertise with specialized AI agents, we’re delivering enterprise solutions faster while maintaining the rigorous quality and security standards our customers expect. Agents empower our consultants to focus on what matters most—understanding their customer’s unique business challenges, providing strategic guidance, and driving meaningful outcomes, while agents handle implementation details with consistency and speed.
We have already started transforming customer engagements through agents, demonstrating tangible impact across industries. Whether you’re building next-generation AI applications, migrating critical workloads to the cloud, or modernizing existing systems, these agents compress timelines from months to weeks—or weeks to days—without compromising on quality.
A comprehensive agent system across the consulting cycle

Traditional consulting models struggle to balance speed, quality, and cost. A system of specialized agents embodying AWS institutional knowledge and proven methodologies help to solve this challenge.
AI agents that accelerate every stage: At the heart of the agent system is the AWS Professional Services Delivery Agent, an AI-powered technical expert that serves as your primary interface for technical engagements. The Delivery Agent analyzes your requirements, builds AI applications directly, and orchestrates specialized work by delegating migration and modernization tasks to purpose-built agents such as the custom agent built on AWS Transform, an AWS agentic AI service for enterprise migration and modernization workloads. Before delivery even begins, a sales agent streamlines proposal generation and statement of work creation, compressing what traditionally takes weeks into hours. Throughout every engagement, embedded capabilities ensure solutions meet enterprise-grade security and compliance standards.
From requirements to deployment in record time: Consider a typical generative AI application development project. Traditionally, building a customer service agent to help representatives quickly access policy information requires 6-8 weeks with a full consulting team gathering requirements, designing architecture, developing code, and deploying the solution. The Delivery Agent ingests your requirements—whether detailed documentation, architecture diagrams, or even meeting notes—and within hours produces comprehensive design specifications and implementation plans aligned with AWS best practices. The agent then generates code, automates testing, and prepares deployment packages while your AWS ProServe consultant provides strategic oversight and ensures alignment with your business context.
Migration and modernization at scale: For migration projects, incorporating agents demonstrates even more dramatic acceleration. Imagine a healthcare provider migrating 500+ applications to AWS—traditionally a 12+ month undertaking requiring extensive discovery and planning. We launched AWS Transform in May to help customers accelerate their cloud transformation journeys. Building on AWS Transform and leveraging its composable capability, we have built a custom agent tailored to how AWS ProServe delivers projects. This agent incorporates a knowledge base of learnings from thousands of migrations AWS ProServe has completed and automation capabilities to accelerate project delivery. The Delivery Agent analyzes the statement of work and project artifacts and engages the custom agent for migration, which handles wave planning, dependency mapping, workload scheduling, and runbook generation automatically. Your AWS ProServe consultant maintains strategic oversight while agents compress the timeline to just a few months, all while maintaining rigorous security and compliance standards.
Built on enterprise-grade AI infrastructure: The agent system leverages the same technologies we offer customers, including Amazon Bedrock AgentCore, AWS Transform, and advanced development tools like Kiro and Amazon Q Developer CLI. This helps ensure that every engagement benefits from industry-leading security through isolated computing environments, comprehensive observability for full transparency, and the scalability to handle engagements of any size.
Human expertise meets AI acceleration
What truly differentiates AWS ProServe agents is how it combines the value of human expertise with the speed and consistency of AI agents. AWS ProServe consultants remain integral to every engagement including understanding your business context, providing strategic guidance, making critical decisions, and building lasting relationships. The agents amplify their impact by handling implementation details, code generation, testing, and deployment with proven AWS methodologies embedded directly into their operations.
This human-AI collaboration delivers customer value through:

Unprecedented speed: Reduce project timelines achieving in days what traditionally required months
Consistent excellence: Every solution incorporates AWS best practices, architectural patterns, and the Well-Architected Framework
Lower total costs: Streamlined delivery and accelerated time-to-value translate directly to better ROI

Unlike general-purpose AI tools, the agents embody AWS specialized knowledge, including decades of experience informed by thousands of prior engagements, and proven methodologies. It draws from the vast AWS institutional knowledge base and has been specifically designed for enterprise-grade solution delivery. Further backed by AWS ProServe’s consulting expertise to ensure every solution meets your unique business requirements.
Making business transformation real with agents
Organizations across industries are already experiencing results by partnering with AWS ProServe agents, from rapid AI application development to accelerated cloud migrations. The National Football League (NFL) faced a challenge familiar to many organizations, building agents to serve millions of fantasy football fans while maintaining both speed and reliability. Working with the AWS Professional Services team, they used the delivery agent and were able to deploy a production quality prototype that seamlessly integrate NextGen Stats, Player News, weather data, and both proprietary and public NFL information to generate personalized fantasy football recommendation in just a few days.
“Building an AI agent that serves thousands of fantasy football fans requires both speed and reliability. The AWS Professional Services Delivery Agent helped us achieve both – we went from zero to production in 8 weeks while maintaining the quality standards NFL fans expect. The framework automated routine development tasks, freeing our team to focus on performance optimization and delivering unique insights powered by NFL’s proprietary data,” says Mike Band, Senior Manager, Research & Analytics, Next Gen Stats, NFL.
The transformation extends beyond customer outcomes to how AWS ProServe delivers consulting services. “Our goal with AWS Transform has always been to enable better customer outcomes through transformative new approaches to migration,” says Asa Kalavade, Vice President of AWS Transform. “AWS Professional Services’ custom agent, built on AWS Transform’s composable foundation, demonstrates this vision perfectly. It delivers customized workflows tailored to how AWS ProServe works directly in customer accounts, with goal-based, interactive agents that personalize each migration. Whether orchestrating large VMware migrations or handling dynamic wave planning for enterprises migrating thousands of VMs, these agents adapt to each customer’s unique context. This is the future of migration—faster, more personalized, and delivering outcomes that traditional approaches simply couldn’t achieve.”
This represents the future of professional services: AI-augmented consulting that delivers results without sacrificing the strategic guidance and partnership that complex enterprise initiatives require.
Reimagining the future of consulting with agentic AI
This new agentic-powered consulting approach is a demonstration of what becomes possible when you apply cutting-edge AI technologies to transform your own operations. While many organizations talk about what AI might do someday, AWS ProServe shows what AI can deliver for enterprises today. Customers can experience the new agent-powered consulting model by engaging with AWS ProServe and AWS Professional Services Partners today. Contact your AWS account team or visit the AWS Professional Services webpage to discover how AWS can accelerate your digital transformation.

About the author
Francessca Vasquez is the Vice President of Professional Services and Agentic AI for Amazon Web Services (AWS). She leads AWS’s global consulting services, overseeing customer engagements across public sector, commercial, and partner businesses worldwide. Francessca drives co-innovation and delivery of emerging technologies including Generative AI, Quantum Computing, and Application Modernization. Her team connects AWS AI and ML experts with customers globally to design and launch cutting-edge generative AI solutions. As Executive Sponsor for the AWS Global CIO Council and AWS Partner Collective, she strengthens strategic partnerships that help organizations accelerate their digital transformation and unlock the full potential of cloud and AI technologies.

Amazon Bedrock AgentCore and Claude: Transforming business with agenti …

The enterprise AI conversation has fundamentally shifted. We’re no longer asking “Can AI understand language?” but rather “Can AI autonomously execute complex business processes that drive real value?” According to McKinsey research, agentic AI has the potential to generate $450 billion to $650 billion in additional annual revenue by 2030, representing a 5 to 10 percent revenue increase across industries.
The window for competitive advantage is narrowing. While your competitors experiment with AI pilots, the organizations that move agentic AI into production are capturing measurable gains today. Yet here’s the paradox we keep seeing: enterprises build impressive prototypes that never scale. The gap isn’t in model capabilities, but rather in the operational infrastructure required to deploy agents that can work autonomously for hours, integrate securely with enterprise systems, and maintain reliability at scale. The figure below outlines the various challenges that organizations may face taking their agents to production.

But some organizations have already crossed this divide. They’re running AI agents in production right now, handling real business processes, serving thousands of customers, and delivering results that seemed impossible just months ago. Let’s start with what they’ve achieved.
What’s possible today: Production results from leading organizations
Cox Automotive and Druva are both putting Amazon Bedrock AgentCore and Claude to work across their organizations.
Cox Automotive: Accelerating enterprise-scale agentic AI deployment
As the world’s largest automotive services and technology company, Cox Automotive has a wide breadth of products and services that touch almost all aspects of the automotive industry and a vehicle’s lifecycle. Agentic AI holds the promise to connect solutions and help consumers, dealers, automakers, and other automotive stakeholders to help execute workflows in more automated, scalable, and even personalized ways. AI agents can fundamentally transform every touchpoint in automotive, from how consumers search and purchase vehicles to how dealers manage service operations and inventory. This is happening in production right now at Cox Automotive. Cox Automotive has shifted from “Data-First, AI-Enabled” to “AI-First, Data Differentiated.” Cox Automotive is using Anthropic’s Claude model and Amazon Bedrock AgentCore as one of their critical capabilities for agentic AI solution deployment at scale with 17 major proofs of concept deployed in production and seven industry-transformational solutions currently in development.

“At Cox Automotive, we’re transforming our customer experience with generative and agentic AI. We are working with all frontier model providers but have anchored on Claude for its strong performance across three critical metrics: latency, cost, and accuracy. Amazon Bedrock AgentCore is one of the strategic tools we’re using to build AI agents that can deploy at scale, ranging from virtual assistants that improve our omnichannel dealer experience to an agentic marketplace that streamlines vehicle discovery and buying. AgentCore’s key capabilities – runtime for secured deployments, observability for monitoring, identity for authentication, and enterprise grade primitives are enabling our teams to develop and test these agents efficiently as we scale AI across the enterprise.” – Marianne Johnson, EVP & Chief Product Officer, Cox Automotive

Druva: Up to 63% autonomous resolution with up to 58% faster response times
Druva’s customers faced an escalating challenge in cybersecurity: staying ahead of evolving data anomalies across complex infrastructure. Manual threat investigation meant navigating multiple dashboards, logs, and alerts. In security, missing threat signals can lead to catastrophic consequences—but the volume of potential signals makes comprehensive manual review impossible.
Consider the scale: over 7,500 customers, each with their own infrastructure patterns, threat landscapes, and security requirements. The challenge was building an AI solution that could operate reliably and securely at this scale.
Druva partnered with the AWS Generative AI Innovation Center to build DruAI, a multi-agent system powered by Claude on Amazon Bedrock AgentCore. The system uses multiple AI agents that work together to automatically choose the right tools from hundreds of options, handling telemetry analysis, threat investigation, and remediation. AgentCore Runtime provides a more secure, isolated execution environment with automated scaling, allowing Druva’s team to focus on delivering customer value rather than building and maintaining complex security infrastructure.
The impact: Over 3,000 customers and 10,000 users now deploy DruAI, resulting in up to 58% faster time-to-resolution and solving up to 63% of customer issues without human intervention. In cybersecurity, speed is the difference between contained threats and business-impacting breaches.

“Our customers at Druva needed to transform their manual threat investigation processes, which involved navigating multiple dashboards, logs, and alerts. Using AgentCore’s Runtime, we rapidly deployed DruAI, our suite of AI capabilities for customers, with complete session isolation and automated scaling – enabling us to focus on delivering value to customers rather than building and maintaining complex security infrastructure. Our system handles telemetry analysis, threat investigation and remediation, and is already being used by over 3,000 customers and 10,000 users. DruAI delivers 58% faster time-to-resolution, solving 63% of customer issues without human intervention.” – David Gildea, VP of Product, AI, Druva

These results raise an obvious question: How did organizations achieve production deployments that deliver measurable business value? The answer lies in combining two critical elements that work better together than either could alone.
Why Amazon Bedrock AgentCore and Claude by Anthropic

Agentic AI in production requires two things: frontier AI capabilities that can handle complex, autonomous workflows, and enterprise-grade infrastructure that provides the security, reliability, and operational foundation those agents need to run in production. Amazon Bedrock AgentCore and Claude provide this combination. AgentCore has multiple fully-managed services that can be used together or independently as part of Amazon Bedrock AgentCore: Runtime, Memory, Identity, Gateway, Code Interpreter, Browser Tool, and Observability.
Agent intelligence and logic: Focus on what matters
When enterprises build agentic AI, engineering teams usually spend months building infrastructure like session management, credential vaults, tool orchestration, observability frameworks, and scaling logic. By the time they’re ready to focus on the actual agent logic and business value, they’re exhausted and the use case may have evolved.Amazon Bedrock AgentCore is a comprehensive agentic platform to build, deploy and operate highly capable agents at scale. It’s model-agnostic, which means it handles the infrastructure and operational challenges so your developers can concentrate on what differentiates your business: the agent’s logic and the specific tasks it needs to perform. Claude’s high performance and contextual understanding are maximized by this approach.

AgentCore works with frameworks your team already knows like Strands Agents, CrewAI, LangGraph, LlamaIndex. You can also use it with any foundation model, whether hosted on Amazon Bedrock or elsewhere. This removes the traditional tradeoff between open source flexibility and enterprise-grade reliability.
Enterprise-grade security and reliability built in
Although optimized for agentic AI workflows, Claude alone doesn’t provide the production infrastructure that complex agents require. That’s where Amazon Bedrock AgentCore comes in. AgentCore provides complete session isolation to make sure each execution is fully contained, secure credential vaults help protect sensitive tokens, and identity-aware authorization controls exactly what agents can access. Agents can work autonomously for up to eight hours with automatic scaling, delivering the reliability that business processes demand.
Enhanced agent capabilities
AgentCore provides built-in tools that extend what Claude-powered agents can accomplish. Code Interpreter offers secure code execution for data processing and analysis, while Browser enables agents to interact with web applications, navigate pages, extract data, and execute transactions.
But the real multiplier is AgentCore Gateway: it transforms your existing REST APIs and AWS Lambda functions into agent-ready tools with semantic routing. Your agents can interact with your existing business systems, databases, and services without rebuilding everything for AI. The gateway handles dual-sided security and intelligent tool selection, so as you scale to hundreds or thousands of tools, agents can still find and use the right ones.

Together, these elements create something neither could achieve alone: AI agents with frontier intelligence, enterprise-grade reliability, and the operational foundation to deliver business value in production—not in six months after you build infrastructure, but now. The previous figure shows the benefits of AgentCore Gateway.
The technology behind these results
Let’s explore the technology foundation that makes these results possible, without getting lost in implementation details.
Infrastructure that scales production workloads
Amazon Bedrock AgentCore is purpose-built infrastructure for production agentic AI. Think of it as the operational foundation that transforms capable AI models into usable business systems. Rather than spending months on undifferentiated heavy lifting or building production ready agents from scratch, it’s available as a managed agentic platform.

The AgentCore Runtime and AgentCore Identity services provide more secure, serverless execution where agents work autonomously for up to eight hours with complete session isolation. Identity management integrates with your existing providers—Okta, Microsoft Entra, or Amazon Cognito—handling OAuth, token management, and comprehensive audit trails that can help align with the most stringent compliance requirements, including those trusted by AWS GovCloud (US) customers. The Gateway transforms REST APIs and Lambda functions into agent-compatible tools with intelligent semantic routing, while AgentCore Memory is straightforward for developers to use to build context-aware agents by minimizing complex memory infrastructure, so that agents can maintain context across conversations and build knowledge bases over time.
Observability delivers complete visibility through CloudWatch with OpenTelemetry compatibility for systems like Dynatrace, Datadog, Arize Phoenix, LangSmith, and Langfuse. You can track what agents are doing, monitor performance, identify errors, and maintain the operational visibility that production systems demand. AgentCore services support VPC, AWS PrivateLink, CloudFormation, and resource tagging for enhanced enterprise security.
Claude’s intelligence that handles complex, long-running tasks
While infrastructure enables deployment, model capabilities determine what agents can accomplish. Claude Sonnet 4.5 is Anthropic’s best performing model for agentic AI use cases, with capabilities specifically designed for autonomous, long-running workflows.
Claude Sonnet 4.5 can work independently for extended periods while maintaining clarity and focus. The model makes steady progress on tasks rather than attempting everything simultaneously, providing fact-based updates that accurately reflect accomplishments. This capability is critical for complex workflows that require sustained attention and incremental progress over hours.
The model tracks token usage throughout conversations and maintains awareness of its working context. This helps prevent remature task abandonment and enables more effective execution on long-running operations. Combined with memory capabilities that enable storage and retrieval of information outside the immediate context window, agents can maintain state across sessions and build knowledge bases over time.
Built with Anthropic’s Constitutional AI method, Claude is designed to be helpful, harmless, and honest. Extensive safety training has substantially reduced concerning behaviors including sycophancy, deception, and power-seeking. This alignment foundation is particularly important for enterprise deployments where agent reliability and appropriate behavior are non-negotiable requirements. When agents operate autonomously for hours, trust is fundamental.
Claude Sonnet 4.5 achieves state-of-the-art performance on coding and reasoning tasks, with enhanced planning and system design capabilities. The model excels at autonomous tasks that span hours or days while maintaining consistent performance. Beyond coding, Claude demonstrates advanced reasoning capabilities for financial analysis, research workflows, and cybersecurity applications which enable sophisticated agent applications across multiple enterprise use cases.
Strategic implications for enterprise leaders
The decisions you make about agentic AI infrastructure are about establishing the foundation for your multi-year AI roadmap. Take these into consideration:
System choice as competitive positioning
Your competitors are evaluating the same opportunities. The organizations that establish production agentic AI first can capture advantages that compound over time: operational efficiencies that can reduce costs while improving service, capabilities that were previously impossible becoming standard practice, and the organizational learning that comes from real-world deployment.
AI is transforming your industry. Will you be leading that transformation or reacting to it?
Velocity of innovation: Automatic capability improvements
Claude Sonnet 4.5 was released just seven weeks after Claude Opus 4.1. That velocity of model improvement is now the baseline expectation. The system you choose determines whether you benefit from these advances automatically or face migration projects every time capabilities improve.
Organizations building on Amazon Bedrock gain access to new model capabilities as they become available without having to re-engineer, spin up migration projects, and without technical debt. Your agents become more capable over time, and your team stays focused on business value rather than system maintenance.
The expanding capabilities of AgentCore follow similar trajectories. Recent additions include enhanced Agent-to-Agent (A2A) protocol support for multi-agent coordination, expanded observability integrations, and new tools like Browser and Code Interpreter. These capabilities become available to your agents as they launch, future-proofing your investments while maintaining backward compatibility.
The multi-agent future: Coordination and specialization
As individual agents prove value in your organization, the next frontier involves coordinated multi-agent systems where specialized agents collaborate on complex business challenges. Amazon Bedrock supports multi-agent collaboration through the A2A protocol, enabling sophisticated patterns:
Specialized agent teams where you deploy focused agents, each excelling at specific domains like financial analysis, code review, customer interaction, security monitoring, working together under intelligent orchestration.
Supervisor agents that break down complex workflows into manageable sub-tasks, delegate to appropriate specialist agents, and synthesize results into coherent outcomes.
Organizations like Druva are already running multi-agent systems in production, and the architectural patterns are becoming established. The infrastructure foundation you choose will determine how smoothly you can evolve to these sophisticated deployments tomorrow.
Risk mitigation: Security, governance, and compliance
Enterprise deployments require security and governance built into the foundation. AgentCore provides complete audit trails for compliance, fine-grained authorization that scales with your agent environment, and session isolation that help contain potential issues. Constitutional AI in Claude Sonnet 4.5 helps provide an additional reliability layer: when agents operate autonomously, you need confidence they’ll behave appropriately and align with your instructions.
Evaluating agentic AI for your enterprise
If you’re a technical leader or architect exploring agentic AI for your organization, here’s a practical framework for evaluation and adoption.
Start with high-value use cases
The most successful early deployments share common characteristics. Look for workflows that are:

Repetitive yet require judgment: Tasks your team does regularly that follow patterns but need decision-making, not just automation
Multi-system integration opportunities: Processes that involve pulling data from multiple sources, making decisions, and taking actions across different systems
24/7 availability benefits: Workflows where autonomous operation outside business hours provides real value
Clear, measurable success metrics: Use cases where you can quantify impact—time saved, accuracy improved, costs reduced, capacity increased

What are the equivalent opportunities in your business?
Move from evaluation to production decisively
The evaluation process should be measured in weeks, not months:
Week 1-2: Review case studies and assess relevance to your context. Identify 1-2 pilot workflows with defined success criteria. Reach out to your AWS account team to discuss using Claude with Amazon Bedrock AgentCore for help assessing technical fit and business value potential.
Week 3-4: Prototype with production infrastructure from day one. Leverage AgentCore so you’re not building throwaway infrastructure. Your learnings and code can transfer directly to production.
Week 5-8: Run your pilot and measure against your success criteria. With production infrastructure already in place, this is about validating business value, not rebuilding for scale.
Week 9+: Scale based on proven results. The AgentCore infrastructure scales automatically, so moving from pilot to production is about expanding scope, not re-engineering foundations.
This timeline is achievable because you’re not building infrastructure from scratch. Your AWS account team can connect you with resources, technical guidance, and examples from organizations like Cox Automotive and Druva who’ve already walked this path.
Conclusion: The agentic enterprise is being built today
Agentic AI represents a fundamental shift in how enterprises put AI to work, moving from tools that assist to systems that act autonomously. The technical requirements for production deployment are substantial, but the combination of Amazon Bedrock AgentCore and Claude Sonnet 4.5 makes this transformation accessible.
The infrastructure exists. Organizations are already running agents in production with measurable business impact. The question for enterprise leaders is no longer “Is agentic AI ready?” but rather “How quickly can we capture this advantage?”
Organizations that master agentic AI are improving operational efficiency and reimagining what’s possible in their industries. The agentic enterprise of the future is being built now by teams that combine the right model capabilities with the right operational infrastructure.
Ready to explore what’s possible for your organization? Reach out to your AWS account team to get started with Claude in Amazon Bedrock AgentCore. They can help you assess use cases, design your pilot, and accelerate your path to production agentic AI.
The foundation is ready. The models are proven. The path forward is clear.

About the authors
Jawhny Cooke is a Senior Anthropic Specialist Solutions Architect for Generative AI at AWS. He specializes in integrating and deploying Anthropic models on AWS infrastructure. He partners with customers and AI providers to implement production-grade generative AI solutions through Amazon Bedrock, offering expert guidance on architecture design and system implementation to maximize the potential of these advanced models.
Brad Abrams is Head of Product for the Claude Developer Platform at Anthropic, where he leads API product development and works on building tools that help developers create powerful AI agents. Prior to Anthropic, Brad spent significant time at Google, where he was recognized as one of the most influential technologists in the voice assistant landscape. He also held roles at Microsoft, bringing deep expertise in developer tools and platform ecosystems. Brad holds a Bachelor of Science in Computer Science from North Carolina State University. Throughout his career, he has focused on developer experience, distributed systems, and software product management. Based in Palo Alto, he continues to drive innovation at the intersection of AI capabilities and developer tooling.

Google DeepMind Introduces SIMA 2, A Gemini Powered Generalist Agent F …

Google DeepMind has released SIMA 2 to test how far generalist embodied agents can go inside complex 3D game worlds. SIMA’s (Scalable Instructable Multiworld Agent) new version upgrades the original instruction follower into a Gemini driven system that reasons about goals, explains its plans, and improves from self play in many different environments.

From SIMA 1 to SIMA 2

The first SIMA, released in 2024, learned more than 600 language following skills such as ‘turn left’, ‘climb the ladder’, and ‘open the map’. It controlled commercial games only from rendered pixels and a virtual keyboard and mouse, without any access to game internals. On complex tasks, DeepMind reported a SIMA 1 success rate of about 31 percent, while human players reached about 71 percent on the same benchmark.

SIMA 2 keeps the same embodied interface but replaces the core policy with a Gemini model. According to a TechCrunch article that the system uses Gemini 2.5 Flash Lite as the reasoning engine. This changes SIMA from a direct mapping between pixels and actions into an agent that forms an internal plan, reasons in language, and then executes the necessary action sequence in the game. DeepMind describes this as moving from an instruction follower to an interactive gaming companion that collaborates with the player.

https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/

Architecture, Gemini in the control loop

The SIMA 2 architecture integrates Gemini as the agent core. The model receives visual observations and user instructions, infers a high level goal, and produces actions that are sent through the virtual keyboard and mouse interface. Training uses a mix of human demonstration videos with language labels and labels generated by Gemini itself. This supervision lets the agent align its internal reasoning with both human intent and model generated descriptions of behavior.

Because of this training scheme, SIMA 2 can explain what it intends to do and list the steps it will take. In practice, this means the agent can answer questions about its current objective, justify its decisions, and expose an interpretable chain of thought about the environment.

Generalization and performance

The task completion plot shows SIMA 1 at about 31% and SIMA 2 at 62% that value on the main evaluation suite, with humans around the 70% range. Integrating Gemini doubles the performance of the original agent on complex tasks. The important point is not the exact number, it is the shape, the new agent closes most of the measured gap between SIMA 1 and human players on long, language specified missions in the training games.

On held out games such as ASKA and MineDojo, which are never seen during training, the DeepMind team show a similar pattern. SIMA 2 has much higher task completion than SIMA 1 in these environments, which indicates a real gain in zero shot generalization rather than overfitting to a fixed game set. The agent also transfers abstract concepts, for example it can reuse an understanding of ‘mining’ in one title when it is asked to ‘harvest’ in another.

Multimodal instructions

SIMA 2 extends the instruction channel beyond plain text. The DeepMind demonstrations show the agent following spoken commands, reacting to sketches drawn on the screen, and executing tasks from prompts that use only emojis. In one example, the user asks SIMA 2 to go to ‘the house that is the color of a ripe tomato’. The Gemini core reasons that ripe tomatoes are red, then selects and walks to the red house.

Gemini also enables instruction following in multiple natural languages and supports mixed prompts where language and visual cues are combined. For physical AI, robotics devs, this is a concrete multimodal stack, a shared representation links text, audio, images, and in game actions, and the agent uses this representation to ground abstract symbols in concrete control sequences.

Self improvement at scale

One of the main research contributions in SIMA 2 is the explicit self improvement loop. After an initial phase that uses human gameplay as a baseline, the team moves the agent into new games and lets it learn only from its own experience. A separate Gemini model generates new tasks for the agent in each world, and a reward model scores each attempt.

These trajectories are stored in a bank of self generated data. Later generations of SIMA 2 use this data during training, which allows the agent to succeed on tasks where earlier generations failed, without any fresh human demonstrations. This is a concrete example of a multitask, model in the loop data engine, where a language model specifies goals and gives feedback, and the agent converts that feedback into new competent policies.

Genie 3 worlds

To push generalization further, DeepMind combines SIMA 2 with Genie 3, a world model that generates interactive 3D environments from a single image or text prompt. In these virtual worlds, the agent has to orient itself, parse instructions, and act toward goals even though the geometry and assets differ from all training games.

The reported behavior is that SIMA 2 can navigate these Genie 3 scenes, identify objects such as benches and trees, and perform requested actions in a coherent way. This is important for researchers, it shows that a single agent can operate across commercial titles and generated environments, using the same reasoning core and control interface.

Key Takeaways

Gemini centered architecture: SIMA 2 integrates Gemini, reported as Gemini 2.5 Flash Lite, as the core reasoning and planning module, wrapped by a visuomotor control stack that acts from pixels through a virtual keyboard and mouse across many commercial games.

Measured performance jump over SIMA 1: On DeepMind’s main task suite, SIMA 2 roughly doubles SIMA 1’s 31 percent task completion rate and approaches human level performance in training games, while also delivering significantly higher success rates on held out environments such as ASKA and MineDojo.

Multimodal, compositional instruction following: The agent can follow long, compositional instructions and supports multimodal prompts, including speech, sketches, and emojis, by grounding language and symbols in a shared representation over visual observations and in game actions.

Self improvement via model generated tasks and rewards: SIMA 2 uses a Gemini based teacher to generate tasks and a learned reward model to score trajectories, building a growing experience bank that allows later generations of the agent to outperform earlier ones without additional human demonstrations.

Stress testing with Genie 3 and implications for robotics: Coupling SIMA 2 with Genie 3, which synthesizes interactive 3D environments from images or text, shows that the agent can transfer skills to newly generated worlds, supporting DeepMind’s claim that this stack is a concrete step toward general purpose embodied agents and, eventually, more capable real world robots.

Editorial Comments

SIMA 2 is a meaningful systems milestone rather than a simple benchmark win. By embedding a trimmed Gemini 2.5 Flash lite model at the core, DeepMind team demonstrates a practical recipe that joins multimodal perception, language based planning, and a Gemini orchestrated self improving loop, validated both in commercial games and Genie 3 generated environments. Overall, SIMA 2 shows how an embodied Gemini stack can act as a realistic precursor for general purpose robotic agents.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Introduces SIMA 2, A Gemini Powered Generalist Agent For Complex 3D Virtual Worlds appeared first on MarkTechPost.

AI Interview Series #2: Explain Some of the Common Model Context Proto …

In this part of the Interview Series, we’ll look at some of the common security vulnerabilities in the Model Context Protocol (MCP) — a framework designed to let LLMs safely interact with external tools and data sources. While MCP brings structure and transparency to how models access context, it also introduces new security risks if not properly managed. In this article, we’ll explore three key threats — MCP Tool Poisoning, Rug Pulls, and Tool Hijacking Attacks

Tool Poisoning

A Tool Poisoning Attack happens when an attacker inserts hidden malicious instructions inside an MCP tool’s metadata or description.

Users only see a clean, simplified tool description in the UI.

LLMs, however, see the full tool definition — including hidden prompts, backdoor commands, or manipulated instructions.

This mismatch allows attackers to silently influence the AI into harmful or unauthorized actions.

Tool Hijacking

A Tool Hijacking Attack happens when you connect multiple MCP servers to the same client, and one of them is malicious. The malicious server injects hidden instructions inside its own tool descriptions that try to redirect, override, or manipulate the behavior of tools provided by a trusted server.

In this case, Server B pretends to offer a harmless add() tool, but its hidden instructions try to hijack the email_sender tool exposed by Server A.

MCP Rug Pulls

An MCP Rug Pull happens when a server changes its tool definitions after the user has already approved them. It’s similar to installing a trusted app that later updates itself into malware — the client believes the tool is safe, but its behavior has silently changed behind the scenes.

Because users rarely re-review tool specs, this attack is extremely hard to detect.

AI Interview Series #1: Explain Some LLM Text Generation Strategies Used in LLMs

The post AI Interview Series #2: Explain Some of the Common Model Context Protocol (MCP) Security Vulnerabilities appeared first on MarkTechPost.

Comparing the Top 4 Agentic AI Browsers in 2025: Atlas vs Copilot Mode …

Agentic AI browsers are moving the model from ‘answering about the web’ to operating on the web. In 2025, four AI browsers define this space: OpenAI’s ChatGPT Atlas, Microsoft Edge with Copilot Mode, The Browser Company’s Dia, and Perplexity’s Comet. Each makes different design choices around autonomy, memory, and privacy. This article compares their architectures, capabilities, and risk profiles so various type of users can decide which browser aligns with their workflows.

What are Agentic Browsers?

Agentic browsers are not just ‘chat over a page’. They expose the browser’s DOM (Document Object Model), tab graph, and history to an AI model and allow it to:

Read and reason over multiple tabs

Maintain task context across time

Take actions such as navigating, filling forms, and completing workflows

OpenAI ChatGPT Atlas, Microsoft Edge Copilot Mode, The Browser Company’s Dia, and Perplexity’s Comet all do this, but with different tradeoffs in autonomy, memory, and security.

High-level comparison

Atlas is the most fully agentic: deep ChatGPT integration, rich browser control, strong but complex memory and privacy story.

Copilot Mode is an incremental but significant extension to Edge: unified Copilot, cross-tab reasoning, early ‘Actions’ for automation, still conservative compared with Atlas and Comet.

Dia is an AI-first browser built on Chromium, optimized for reading, writing, and structured workflows with privacy-first defaults and intentionally limited autonomy.

Comet is a highly agentic personal assistant browser with deep workflow automation, a local-data narrative, and currently the most aggressive legal and security risk profile.

The rest of the article unpacks these differences in a more technical way.

1. ChatGPT Atlas (OpenAI): AI-native browser with full agent mode

1.1 Architecture

Atlas is a dedicated AI browser built around ChatGPT rather than a standard Chromium shell with an extension. It runs on Chromium but wraps it in OpenAI’s OWL process architecture, which separates the rendering engine from the Atlas application and agent layer.

Key characteristics:

macOS only at launch, with Windows, iOS, and Android ‘coming soon’.

ChatGPT is exposed everywhere: omnibox, main panel, and a ChatGPT sidebar that can see the current page and tabs.

This gives Atlas a first-class API into:

Current tab DOM and visible content

Tab list and navigation history

User queries and previous conversation state

1.2 Agent mode: real browser control

Agent Mode is the key differentiator. For Plus / Pro / Business users, Atlas can execute multi-step workflows:

Open and close tabs, follow links, and switch sites

Fill out forms and online applications

Book reservations such as hotels and restaurants

Compare products across multiple sites and return structured summaries

Constraints:

Agent mode cannot access local files or the OS, and cannot download or execute local programs. It is sandboxed inside the browser.

Actions require explicit user consent; Atlas surfaces prompts like ‘Should I start clicking and filling these forms’ before executing workflows.

1.3 Memory and privacy

Atlas introduces browser memories:

It stores filtered summaries of visited pages and inferred user intent, not full page captures. Summaries are retained for about 30 days, enabling queries like ‘reopen the reports I read yesterday’ or ‘continue the Athens itinerary plan’.

Memories are opt-in and can be viewed, edited, or deleted. Memory can be disabled globally or on specific sites, and Atlas supports incognito.

OpenAI also added parental controls that let guardians disable both browser memories and agent mode for child accounts.

Critical points:

Atlas still needs to transmit page snippets and metadata to OpenAI’s servers for summarization, which means sensitive content can be exposed if protections fail.

Security researchers have already demonstrated prompt-injection attacks that exploit Atlas’s omnibox and agent context, confirming that highly agentic browsing increases the attack surface.

1.4 Pricing and fit

Atlas is free to install for ChatGPT users on macOS.

Agent Mode is only available on paid ChatGPT tiers (Plus, Pro, Business, Enterprise).

Fit:

Best for users who want maximum in-browser automation and are comfortable with cloud-centric data handling and a still-evolving security posture.

2. Copilot Mode in Microsoft Edge: tab-reasoning with controlled autonomy

2.1 Architecture

Copilot Mode is Microsoft’s AI layer inside Edge, not a separate browser. It exposes:

A unified Copilot box on new tabs for chat, search, and navigation

Deep integration with Edge context (open tabs, history, and some browser settings) when users opt in.

Microsoft also ties Copilot Mode into:

Journeys: topic-centric clusters over browsing history, which Copilot can summarize and re-open.

Copilot Actions: an early agentic layer capable of actions like clearing cache, unsubscribing from mailing lists, and booking reservations in preview.

2.2 Agentic behavior

Compared with Atlas:

Copilot Mode can reason across multiple tabs, summarize and compare them, and assist with structured tasks like trip planning or multi-site research.

Actions Preview extends this into partially agentic flows, such as booking a restaurant or filling forms, but current evaluations show inconsistent reliability and occasional ‘hallucinated’ completions of tasks that were not successfully executed.

Crucially, Copilot Mode remains more constrained than Atlas or Comet:

It does not expose an openly programmable DOM-level agent with free cursor control

Action templates are narrower and guarded, particularly for email and account-sensitive operations

2.3 Data, privacy, and enterprise posture

Edge with Copilot Mode is clearly aimed at enterprise adoption:

Copilot access to tab and history data is explicitly permissioned; users can disable history-based personalization, Copilot context, and Copilot Mode entirely.

Microsoft integrates Prompt Shields and Azure AI safety layers to mitigate prompt injection and jailbreak attempts.

Fit:

Appropriate where organizations want AI-assisted browsing and cross-tab reasoning while keeping automation scoped and more auditable than a fully agentic browser.

3. Dia (The Browser Company): AI-first, Chromium-based, privacy-forward

3.1 Architecture and UX

Dia is The Browser Company’s AI-centric successor to Arc, built on Chromium and currently available on macOS only.

Core design choices:

The canonical interaction is ‘chat with your tabs‘: Dia’s assistant can read open tabs, referenced tabs, and selections, and answer questions or transform content in place.

Dia includes a Skills system, where users define reusable prompt ‘scripts’ and workflows for tasks like note-taking or research templates.

Dia’s UX is optimized for:

Reading and understanding long-form content

Writing and editing in-page

Learning workflows (tutoring, flashcards, argument comparison)

3.2 Memory and ‘local-first’ privacy

Dia’s main differentiation is its privacy posture:

Browsing history, chats, bookmarks, and saved content are stored locally and encrypted, with data sent to servers only when required to answer a specific query.

The Memory feature stores summaries and learned preferences, but users can disable memory entirely in settings or control what contexts are shared.

The net effect is an AI browser that tries to behave more like a local knowledge layer with scoped cloud calls rather than a continuous telemetry stream.

3.3 Agentic scope and constraints

Dia is intentionally less agentic than Atlas or Comet:

The assistant can read and summarize pages, transform text, generate content, and run Skills over the current tab set.

Current public builds do not expose a general DOM automation agent capable of open-ended clicking and form submission across arbitrary sites.

In practice, Dia behaves as a high-context copilot rather than a fully autonomous web operator. This is aligned with the company’s positioning and with Atlassian’s stated intent after acquiring The Browser Company, which emphasizes individual knowledge worker workflows over transactional automation.

3.4 Pricing and availability

Dia now ships to all Mac users, no invite required, as of October 2025.

Free tier: Core AI chat, Skills, and Memory, with usage limits.

Dia Pro at $20/month unlocks effectively unlimited AI chat usage within terms of use.

Fit:

Strong for educational and writing-heavy workflows, for users who want AI-augmented browsing without handing an agent broad control over the web session.

4. Comet (Perplexity): highly agentic assistant browser with heavy risk surface

4.1 Architecture and capabilities

Comet is Perplexity’s AI browser built on Chromium, positioned as a personal AI assistant and ‘thinking partner‘ rather than a simple search UI.

The Comet Assistant can:

Summarize and explore any page

Execute multi-step workflows for research, coding, meeting prep, and e-commerce

Manage email and calendar via integrated connectors

Handle complex tasks like comparing products, reading reviews, and moving all the way to checkout.

Recent updates extend the agent to work longer and across larger jobs, emphasizing persistent, agentic behavior over many tabs and time periods.

4.2 Data model and privacy claims

Perplexity’s Comet Privacy Notice and product pages claim:

Browsing data, cookies, and saved credentials are stored locally on the device by default.

Users can delete browsing data and stored credentials from Comet settings, and manage cookie behavior.

Integration with 1Password keeps vaults end-to-end encrypted and opaque to Perplexity.

So the official architecture is a hybrid: local browser state with selective context uploads to Comet’s servers and Perplexity’s search models.

However, multiple independent reviews argue that despite these controls, the combination of: Deep integration with third-party services (Gmail, calendar, financial accounts) and high agent autonomy over those services produces a large effective privacy risk envelope, especially for corporate data.

4.3 Security incidents and legal pressure

Comet currently has the most visible security and legal issues among the four:

Indirect prompt-injection / ‘CometJacking‘: LayerX and other researchers showed that malicious URLs and embedded prompts could hijack Comet’s assistant, exfiltrating data from connected services and even performing fraudulent actions.

Although Perplexity has patched specific vulnerabilities, security audits from Brave, Guardio, and others still recommend extreme caution for sensitive workloads.

Amazon lawsuit: Amazon is suing Perplexity over Comet’s ‘agentic shopping’ behavior, alleging that automated shopping sessions accessed customer accounts and impersonated human browsing, violating platform rules and harming personalization systems.

4.4 Pricing and availability

As of October–November 2025, Comet is free to download globally; earlier Max-only and Pro-only restrictions have been removed.

Perplexity monetizes via Pro / Max subscriptions for higher model tiers and via Comet Plus (~$5 / month), which grants access to curated news and publisher content and is bundled into Pro / Max.

Fit:

Very strong for users who want maximum automation across research, communications, and purchases, and who are comfortable operating at the bleeding edge of the security and platform-policy risk curve.

Comparison Table

DimensionChatGPT Atlas (OpenAI)Edge + Copilot Mode (Microsoft)Dia (The Browser Company)Comet (Perplexity)Engine / platformChromium-based; Atlas shell with OWL architecture; macOS now, Windows / mobile planned Edge (Chromium) on Windows and macOS with optional Copilot Mode Chromium-based AI browser; macOS only, GA, no invite; Windows not yet announced Chromium-based browser with integrated Perplexity search and assistant; desktop global, mobile rolling outAgentic autonomyHigh: Agent Mode can click, navigate, fill forms, book reservations, and chain multi-step workflows inside the browser Medium: cross-tab reasoning and Actions; can perform some transactional steps but with limited scope and reliabilityLow–medium: chat, Skills, and memory over tabs; no general agent that freely manipulates arbitrary sites; autonomy intentionally constrained High: Comet Assistant executes long-running workflows across browsing, email, calendar, and e-commerce, including end-to-end shopping and planning flows Memory / personalizationBrowser memories retain summarized context for ~30 days; persistent task context across sessions, opt-in and user-controllableJourneys over history, context sharing for Copilot is opt-in; personalization tied to Microsoft account and privacy controls Local encrypted storage of history, chats, bookmarks; Dia Memory for personalization with ability to limit shared contextLocal-first browsing data plus cloud-side models; settings allow deleting local data and tuning collectionBest-fit use casesComplex research, automation-heavy workflows, and agent experiments where strong autonomy outweighs riskEveryday browsing with AI summaries and research assistance in Microsoft-centric environmentsLearning, writing, and planning where privacy and structured Skills are more important than full automationPower users who want a personal operator for browsing, communication, and shopping, and who will actively manage security and policy risk

Which browser to choose in 2025?

Pick Atlas when you want to explore the frontier of in-browser agents. It offers the richest action surface and memory model, at the cost of greater complexity in safety and compliance design.

Pick Edge + Copilot Mode when you need incremental AI assistance in a browser that already fits Microsoft-centric enterprise governance, and you prefer scoped agents over unconstrained ones.

Pick Dia when your primary workload is reading, learning, and writing, and you want strong local-first guarantees and explicit control over what information the model sees, with minimal automation.

Pick Comet only if you explicitly want a high-autonomy personal operator in your browser and are willing to track security advisories and platform policies closely.

References:

OpenAI – Introducing ChatGPT Atlashttps://openai.com/index/introducing-chatgpt-atlas/

OpenAI – How we built OWL, the new architecture behind our browserhttps://openai.com/index/building-chatgpt-atlas/

Microsoft – AI browser innovation with Copilot Mode in Edgehttps://www.microsoft.com/en-us/microsoft-copilot/for-individuals/do-more-with-ai/ai-for-daily-life/ai-browser-innovation-with-copilot-in-edge

Microsoft – Copilot Mode | Microsoft Edgehttps://www.microsoft.com/en-us/edge/copilot-mode

Dia Browser – Official sitehttps://www.diabrowser.com/

Dia Browser – Skills Galleryhttps://www.diabrowser.com/skills

9to5Mac – Dia, The Browser Company’s AI-powered browser, is now generally available on macOShttps://9to5mac.com/2025/10/08/dia-the-browser-companys-ai-powered-browser-is-now-generally-available-on-macos/

Perplexity – Comet Browser: a Personal AI Assistanthttps://www.perplexity.ai/comet/

1Password – Secure credentials on Comet with 1Passwordhttps://1password.com/partners/perplexity

Reuters – Amazon sues Perplexity over “agentic” shopping toolhttps://www.reuters.com/business/retail-consumer/perplexity-receives-legal-threat-amazon-over-agentic-ai-shopping-tool-2025-11-04/

The post Comparing the Top 4 Agentic AI Browsers in 2025: Atlas vs Copilot Mode vs Dia vs Comet appeared first on MarkTechPost.

Cerebras Releases MiniMax-M2-REAP-162B-A10B: A Memory Efficient Versio …

Cerebras has released MiniMax-M2-REAP-162B-A10B, a compressed Sparse Mixture-of-Experts (SMoE) Causal Language Model derived from MiniMax-M2, using the new Router weighted Expert Activation Pruning (REAP) method. The model keeps the behavior of the original 230B total, 10B active MiniMax M2, while pruning experts and reducing memory for deployment focused workloads such as coding agents and tool calling.

Architecture and core specifications

MiniMax-M2-REAP-162B-A10B has these key properties:

Base model: MiniMax-M2

Compression method: REAP, Router weighted Expert Activation Pruning

Total parameters: 162B

Active parameters per token: 10B

Layers: 62 transformer blocks

Attention heads per layer: 48

Experts: 180 experts, obtained by pruning a 256 expert configuration

Activated experts per token: 8

Context length: 196,608 tokens

License: modified MIT, derived from MiniMaxAI MiniMax M2

The SMoE design means that the model stores 162B parameters, but each token only routes through a small set of experts, so the effective compute cost per token is similar to a 10B dense model. MiniMax M2 itself is positioned as an MoE model built for coding and agentic workflows, with 230B total parameters and 10B active, which this checkpoint inherits.

How REAP compresses MiniMax-M2?

MiniMax-M2-REAP-162B-A10B is created by applying REAP uniformly across all MoE blocks of MiniMax M2, at a 30 percent expert pruning rate.

The REAP method defines a saliency score for each expert that combines:

Router gate values: How often and how strongly the router selects that expert

Expert activation norms: The magnitude of the expert output when active

Experts that contribute minimally to the layer output, under this combined criterion, are removed. The remaining experts keep their original weights and the router keeps separate gates for each of them. This is one shot compression, there is no extra fine tuning after pruning in the method definition.

A core theoretical result in the REAP’s research paper is that expert merging with summed gates causes functional subspace collapse. When experts are merged, the router loses its independent, input dependent control over those experts, so a single merged expert must approximate an input dependent mixture that was originally expressed through multiple experts. The research team proves that, whenever the router policy depends on the input and the experts are not identical, this introduces irreducible error. In contrast, pruning removes some experts but preserves independent control of the survivors, so the error scales with the gate weight of the removed experts.

Across a set of SMoE models in the 20B to 1T parameter range, REAP consistently outperforms expert merging and other pruning criteria on generative benchmarks such as code generation, mathematical reasoning and tool calling, especially at 50 percent compression.

Accuracy under 30 percent expert pruning

The MiniMax-M2-REAP-162B-A10B model gets compared on three checkpoints on standard coding, reasoning and agentic benchmarks:

MiniMax-M2 (230B, base model)

MiniMax-M2-REAP-172B-A10B, 25 percent pruning

MiniMax-M2-REAP-162B-A10B, 30 percent pruning

https://huggingface.co/cerebras/MiniMax-M2-REAP-162B-A10B

On coding benchmarks such as HumanEval, HumanEval Plus, MBPP and MBPP Plus, the 162B REAP model stays very close to the base model. HumanEval sits around 90% range, and MBPP stays in the 80% range, with the 172B and 162B models essentially tracking the original MiniMax-M2 within a few points.

On reasoning benchmarks such as AIME 25 and MATH 500, there are small shifts between the three models, but there is no collapse at 30 percent pruning and the 162B checkpoint remains competitive with the base model.

On tool calling and agentic evaluation, represented by τ2 bench in a telecom setting, the 162B REAP model again matches the base model within small variance. The model card explicitly states that this checkpoint keeps almost identical performance while being about 30 percent lighter in parameter count.

These results line up with the broader REAP study, which reports near lossless compression for code generation and tool calling on several large SMoE architectures when pruning experts using the REAP criterion.

Deployment, memory usage and observed throughput

Cerebras provides a direct vLLM serve example and positions MiniMax-M2-REAP-162B-A10B as a drop in model for the existing MiniMax M2 integration.

Copy CodeCopiedUse a different Browservllm serve cerebras/MiniMax-M2-REAP-162B-A10B
–tensor-parallel-size 8
–tool-call-parser minimax_m2
–reasoning-parser minimax_m2_append_think
–trust-remote-code
–enable_expert_parallel
–enable-auto-tool-choice

If the run hits memory limits, the card recommends lowering –max-num-seqs, for example to 64, to keep batch size in check on a given GPU.

Key Takeaways

SMoE architecture with efficient compute: MiniMax-M2-REAP-162B-A10B is a Sparse Mixture of Experts model with 162B total parameters and 10B active parameters per token, so the compute cost per token is close to a 10B dense model while keeping frontier scale capacity.

REAP expert pruning keeps behavior of MiniMax-M2: The model is produced by applying REAP Router weighted Expert Activation Pruning to MiniMax-M2 at roughly 30 percent expert pruning, pruning experts based on router gate values and expert activation norms while leaving surviving experts and router structure intact.

Near lossless accuracy at 30 percent compression: On coding benchmarks such as HumanEval and MBPP, and on reasoning benchmarks such as AIME25 and MATH 500, the 162B REAP variant tracks the 230B MiniMax-M2 and a 172B REAP variant within a few points, showing near lossless compression for code, reasoning and tool use.

Pruning outperforms expert merging for generative SMoE: The REAP study shows that pruning experts using a saliency criterion avoids the functional subspace collapse seen with expert merging in generative tasks, and performs better across large SMoE models in the 22B to about 1T parameter range.

Comparison Table

Image source: Marktechpost.com

Editorial Comments

Cerebras’ release of MiniMax-M2-REAP-162B-A10B is a strong signal that Router weighted Expert Activation Pruning is ready for real workloads, not just as a research curiosity. The checkpoint shows that a 30 percent expert pruning schedule can keep MiniMax-M2 230B-A10B behavior almost intact while cutting memory and preserving long context coding, reasoning and tool calling performance, which is exactly what SMoE researchers need for practical deployment. Overall, Cerebras is quietly turning expert pruning into production infrastructure for frontier class SMoE models.

Check out the Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Cerebras Releases MiniMax-M2-REAP-162B-A10B: A Memory Efficient Version of MiniMax-M2 for Long Context Coding Agents appeared first on MarkTechPost.

MBZUAI Researchers Introduce PAN: A General World Model For Interactab …

Most text to video models generate a single clip from a prompt and then stop. They do not keep an internal world state that persists as actions arrive over time. PAN, a new model from MBZUAI’s Institute of Foundation Models, is designed to fill that gap by acting as a general world model that predicts future world states as video, conditioned on history and natural language actions.

https://arxiv.org/pdf/2511.09057

From video generator to interactive world simulator

PAN is defined as a general, interactable, long horizon world model. It maintains an internal latent state that represents the current world, then updates that state when it receives a natural language action such as ‘turn left and speed up’ or ‘move the robot arm to the red block.’ The model then decodes the updated state into a short video segment that shows the consequence of that action. This cycle repeats, so the same world state evolves across many steps.

This design allows PAN to support open domain, action conditioned simulation. It can roll out counterfactual futures for different action sequences. An external agent can query PAN as a simulator, compare predicted futures, and choose actions based on those predictions.

GLP architecture, separating what happens from how it looks

The base of PAN is the Generative Latent Prediction, GLP, architecture. GLP separates world dynamics from visual rendering. First, a vision encoder maps images or video frames into a latent world state. Second, an autoregressive latent dynamics backbone based on a large language model predicts the next latent state, conditioned on history and the current action. Third, a video diffusion decoder reconstructs the corresponding video segment from that latent state.

In PAN, the vision encoder and backbone are built on Qwen2.5-VL-7B-Instruct. The vision tower tokenizes frames into patches and produces structured embeddings. The language backbone runs over a history of world states and actions, plus learned query tokens, and outputs the latent representation of the next world state. These latents live in the shared multimodal space of the VLM, which helps ground the dynamics in both text and vision.

The video diffusion decoder is adapted from Wan2.1-T2V-14B, a diffusion transformer for high fidelity video generation. The research team trains this decoder with a flow matching objective, using one thousand denoising steps and a Rectified Flow formulation. The decoder conditions on both the predicted latent world state and the current natural language action, with a dedicated cross attention stream for the world state and another for the action text.

https://arxiv.org/pdf/2511.09057

Causal Swin DPM and sliding window diffusion

Naively chaining single shot video models by conditioning only on the last frame leads to local discontinuities and rapid quality degradation over long rollouts. PAN addresses this with Causal Swin DPM, which augments the Shift Window Denoising Process Model with chunk wise causal attention.

The decoder operates on a sliding temporal window that holds two chunks of video frames at different noise levels. During denoising, one chunk moves from high noise to clean frames and then leaves the window. A new noisy chunk enters at the other end. Chunk wise causal attention ensures that the later chunk can only attend to the earlier one, not to unseen future actions. This keeps transitions between chunks smooth and reduces error accumulation over long horizons.

PAN also adds controlled noise to the conditioning frame, rather than using a perfectly sharp frame. This suppresses incidental pixel details that do not matter for dynamics and encourages the model to focus on stable structure such as objects and layout.

https://arxiv.org/pdf/2511.09057

Training stack and data construction

PAN is trained in two stages. In the first stage, the research team adapts Wan2.1 T2V 14B into the Causal Swin DPM architecture. They train the decoder in BFloat16 with AdamW, a cosine schedule, gradient clipping, FlashAttention3 and FlexAttention kernels, and a hybrid sharded data parallel scheme across 960 NVIDIA H200 GPUs.

In the second stage, they integrate the frozen Qwen2.5 VL 7B Instruct backbone with the video diffusion decoder under the GLP objective. The vision language model remains frozen. The model learns query embeddings and the decoder so that predicted latents and reconstructed videos stay consistent. This joint training also uses sequence parallelism and Ulysses style attention sharding to handle long context sequences. Early stopping ends training after 1 epoch once validation converges, even though the schedule allows 5 epochs.

Training data comes from widely used publicly accessible video sources that cover everyday activities, human object interactions, natural environments, and multi agent scenarios. Long form videos are segmented into coherent clips using shot boundary detection. A filtering pipeline removes static or overly dynamic clips, low aesthetic quality, heavy text overlays, and screen recordings using rule based metrics, pretrained detectors, and a custom VLM filter. The research team then re-captions clips with dense, temporally grounded descriptions that emphasize motion and causal events.

Benchmarks, action fidelity, long horizon stability, planning

The research team evaluates the model along three axes, action simulation fidelity, long horizon forecast, and simulative reasoning and planning, against both open source and commercial video generators and world models. Baselines include WAN 2.1 and 2.2, Cosmos 1 and 2, V JEPA 2, and commercial systems such as KLING, MiniMax Hailuo, and Gen 3.

For action simulation fidelity, a VLM based judge scores how well the model executes language specified actions while maintaining a stable background. PAN reaches 70.3% accuracy on agent simulation and 47% on environment simulation, for an overall score of 58.6%. It achieves the highest fidelity among open source models and surpasses most commercial baselines.

For long horizon forecast, the research team measures Transition Smoothness and Simulation Consistency. Transition Smoothness uses optical flow acceleration to quantify how smooth motion is across action boundaries. Simulation Consistency uses metrics inspired by WorldScore to monitor degradation over extended sequences. PAN scores 53.6% on Transition Smoothness and 64.1% on Simulation Consistency and exceeds all baselines, including KLING and MiniMax, on these metrics.

For simulative reasoning and planning, PAN is used as an internal simulator inside an OpenAI-o3 based agent loop. In step wise simulation, PAN achieves 56.1% accuracy, the best among open source world models.

https://arxiv.org/pdf/2511.09057

Key Takwaways

PAN implements the Generative Latent Prediction architecture, combining a Qwen2.5-VL-7B based latent dynamics backbone with a Wan2.1-T2V-14B based video diffusion decoder, to unify latent world reasoning and realistic video generation.

The Causal Swin DPM mechanism introduces a sliding window, chunk wise causal denoising process that conditions on partially noised past chunks, which stabilizes long horizon video rollouts and reduces temporal drift compared to naive last frame conditioning.

PAN is trained in two stages, first adapting the Wan2.1 decoder to Causal Swin DPM on 960 NVIDIA H200 GPUs with a flow matching objective, then jointly training the GLP stack with a frozen Qwen2.5-VL backbone and learned query embeddings plus decoder.

The training corpus consists of large scale video action pairs from diverse domains, processed with segmentation, filtering, and dense temporal recaptioning, enabling PAN to learn action conditioned, long range dynamics instead of isolated short clips.

PAN achieves state of the art open source results on action simulation fidelity, long horizon forecasting, and simulative planning, with reported scores such as 70.3% agent simulation, 47% environment simulation, 53.6% transition smoothness, and 64.1% simulation consistency, while remaining competitive with leading commercial systems.

Comparison Table

DimensionPANCosmos video2world WFMWan2.1 T2V 14BV JEPA 2OrganizationMBZUAI Institute of Foundation ModelsNVIDIA ResearchWan AI and Open LaboratoryMeta AIPrimary roleGeneral world model for interactive, long horizon world simulation with natural language actionsWorld foundation model platform for Physical AI with video to world generation for control and navigationHigh quality text to video and image to video generator for general content creation and editingSelf supervised video model for understanding, prediction and planning tasksWorld model framingExplicit GLP world model, latent state, action, and next observation defined, focuses on simulative reasoning and planningDescribed as world foundation model that generates future video worlds from past video and control prompt, aimed at Physical AI, robotics, driving, navigationFramed as video generation model, not primarily as world model, no persistent internal world state described in docsJoint embedding predictive architecture for video, focuses on latent prediction rather than explicit generative supervision in observation spaceCore architectureGLP stack, vision encoder from Qwen2.5 VL 7B, LLM based latent dynamics backbone, video diffusion decoder with Causal Swin DPMFamily of diffusion based and autoregressive world models, with video2world generation, plus diffusion decoder and prompt upsampler based on a language modelSpatio temporal variational autoencoder and diffusion transformer T2V model at 14 billion parameters, supports multiple generative tasks and resolutionsJEPA style encoder plus predictor architecture that matches latent representations of consecutive video observationsBackbone and latent spaceMultimodal latent space from Qwen2.5 VL 7B, used both for encoding observations and for autoregressive latent prediction under actionsToken based video2world model with text prompt conditioning and optional diffusion decoder for refinement, latent space details depend on model variantLatent space from VAE plus diffusion transformer, driven mainly by text or image prompts, no explicit agent action sequence interfaceLatent space built from self supervised video encoder with predictive loss in representation space, not generative reconstruction lossAction or control inputNatural language actions in dialogue format, applied at every simulation step, model predicts next latent state and decodes video conditioned on action and historyControl input as text prompt and optionally camera pose for navigation and downstream tasks such as humanoid control and autonomous drivingText prompts and image inputs for content control, no explicit multi step agent action interface described as world model controlDoes not focus on natural language actions, used more as visual representation and predictor module inside larger agents or plannersLong horizon designCausal Swin DPM sliding window diffusion, chunk wise causal attention, conditioning on slightly noised last frame to reduce drift and maintain stable long horizon rolloutsVideo2world model generates future video given past window and prompt, supports navigation and long sequences but the paper does not describe a Causal Swin DPM style mechanismCan generate several seconds at 480 P and 720 P, focuses on visual quality and motion, long horizon stability is evaluated through Wan Bench but without explicit world state mechanismLong temporal reasoning comes from predictive latent modeling and self supervised training, not from generative video rollouts with explicit diffusion windowsTraining data focusLarge scale video action pairs across diverse physical and embodied domains, with segmentation, filtering and dense temporal recaptioning for action conditioned dynamicsMix of proprietary and public Internet videos focused on Physical AI categories such as driving, manipulation, human activity, navigation and nature dynamics, with a dedicated curation pipelineLarge open domain video and image corpora for general visual generation, with Wan Bench evaluation prompts, not targeted specifically at agent environment rolloutsLarge scale unlabelled video data for self supervised representation learning and prediction, details in V JEPA 2 paper

Editorial Comments

PAN is an important step because it operationalizes Generative Latent Prediction with production scale components such as Qwen2.5-VL-7B and Wan2.1-T2V-14B, then validates this stack on well defined benchmarks for action simulation, long horizon forecasting, and simulative planning. The training and evaluation pipeline is clearly documented by the research team, the metrics are reproducible, and the model is released within a transparent world modeling framework rather than as an opaque video demo. Overall, PAN shows how a vision language backbone plus diffusion video decoder can function as a practical world model instead of a pure generative toy.

Check out the Paper, Technical details and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation appeared first on MarkTechPost.