Meta AI Introduces DreamGym: A Textual Experience Synthesizer For Rein …

Reinforcement learning RL for large language model LLM agents looks attractive on paper, but in practice it breaks on cost, infrastructure and reward noise. Training an agent that clicks through web pages or completes multi step tool use can easily need tens of thousands of real interactions, each slow, brittle and hard to reset. Meta’s new framework DreamGym reframes that bottleneck as a modeling problem. Instead of running RL directly in environments such as WebShop, ALFWorld and WebArena Lite, it learns a reasoning based experience model that simulates them entirely in text.

https://arxiv.org/pdf/2511.03773

Why Real Environment RL for Agents Does Not Scale?

Current RL pipelines for agents face four coupled problems. Real rollouts are costly, task diversity is limited, reward signals are unstable and the infrastructure stack is complex. Web environments change often, rewards depend on fragile scrapers and many actions are irreversible. Reset mechanisms and episode control are also hard to implement, so long horizon tasks become noisy and sample inefficient.

Benchmarks split into two groups. WebShop and ALFWorld are RL ready but expensive, since they still need about 80 thousand real transitions to reach strong baselines with PPO or GRPO. WebArena Lite is not RL ready at all, because resets and automatic reward checks are unreliable, so online RL in the real environment is effectively infeasible.

DreamGym as a Reasoning Based Simulator

DreamGym is built around three components, a reasoning based experience model, an experience replay buffer and an adaptive curriculum task generator. Together they define a synthetic Markov decision process where the environment lives as text.

The reasoning based experience model Mexp operates in an abstract textual state space. States are compact descriptions of what matters for the task, for example cleaned page elements instead of raw HTML. On each step, the agent provides the current state, the action, the task instruction and the interaction history. The system retrieves the top k similar past transitions from the replay buffer, then uses chain of thought reasoning to produce a reasoning trace, a next state and a reward.

Conceptually, you can view Mexp as an LLM world model for web and tool tasks, but defined purely over text. It is trained with supervised fine tuning on offline trajectories, with a joint objective that learns to generate both the reasoning trace and the next state conditioned on that trace. This forces the model to encode causal structure, not just local text statistics.

https://arxiv.org/pdf/2511.03773

Replay Buffer as Grounding Memory

The experience replay buffer is initialized with offline real environment data from WebShop, ALFWorld and WebArena Lite. As DreamGym trains policies in the synthetic environment, it writes new trajectories back into that buffer. Each prediction step in Mexp uses an encoder to retrieve a small set of similar transitions from this memory and conditions on them when generating reasoning and next states.

This retrieval acts as grounding. It keeps synthetic transitions close to the empirical data distribution and reduces hallucinations in long rollouts. The research team showed that removing history or retrieval degrades consistency, informativeness and factuality of the generated states when judged by an external evaluator, and it also lowers downstream success rates on WebShop and WebArena Lite.

Curriculum from Reward Entropy

The curriculum task generator uses the same backbone as the experience model. It selects seed tasks whose outcomes under the current policy have high reward variance, which corresponds to intermediate difficulty tasks that the agent sometimes solves and sometimes fails. For each such task, the model generates variations that preserve action types but change constraints, targets or context.

The selection heuristic is based on reward entropy computed over batches of rollouts for each task. Tasks with non zero variance and balanced success and failure are preferred. Ablations show that turning off this adaptive curriculum causes both WebShop and WebArena Lite performance to drop by around 6 percentage points and leads to early plateaus as the replay buffer saturates with easy, low entropy trajectories.

https://arxiv.org/pdf/2511.03773

RL Inside DreamGym and Theoretical Guarantees

Inside DreamGym, the policy uses standard RL algorithms. The research team evaluates Proximal Policy Optimization and Group Relative Policy Optimization. Rollouts alternate between the policy choosing actions and the experience model synthesizing next states and rewards. From the point of view of the RL code, this is just another environment interface.

The research team also derive a trust region style improvement bound that links policy performance in the synthetic MDP and in the real environment. The bound contains error terms that depend on the reward prediction error and the divergence between real and synthetic transition distributions. As those errors shrink, improvement in DreamGym implies improvement in the underlying real task.

Experimental Results on WebShop, ALFWorld and WebArena Lite

DreamGym is tested with Llama-based and Qwen-based agents across WebShop, ALFWorld and WebArena Lite. Results fall into three regimes.

First, in RL ready but costly environments WebShop and ALFWorld, agents trained with PPO or GRPO inside DreamGym, using only synthetic transitions, match the performance of PPO and GRPO baselines that use about 80 thousand real environment interactions. This shows that reasoning based experience synthesis can provide enough signal for stable policy improvement.

Second, in not RL ready environments such as WebArena Lite, DreamGym enables RL training that would otherwise be impractical. The framework achieves more than 30 percent improvement in success rate over all baselines, including supervised fine tuning and direct behavior cloning.

Third, in sim to real transfer, the DreamGym-S2R configuration first trains a policy entirely in the synthetic environment and then fine tunes it with a small number of real rollouts. This setting yields more than 40 percent additional gain compared with training from scratch in the real environment, while using less than 10 percent of the real data and cutting total training cost to roughly between one third and one fifth of the baselines.

https://arxiv.org/pdf/2511.03773

Key Takeaways

DreamGym replaces fragile real environment rollouts with a reasoning based experience model that operates in an abstract textual state space, predicting next state and reward from history, task and retrieved similar transitions.

The framework combines 3 components, a reasoning experience model, an experience replay buffer seeded with real trajectories, and a curriculum task generator that selects and varies tasks using a reward entropy heuristic, which together stabilize and diversify RL training.

In WebShop and ALFWorld, which are RL ready but expensive, agents trained with PPO or GRPO entirely inside DreamGym using synthetic interactions match the performance of PPO and GRPO baselines that use about 80,000 real environment transitions.

In WebArena Lite, which is not RL ready, DreamGym enables online RL and achieves more than 30 percent higher success rate than all non RL baselines including supervised fine tuning and behavior cloning.

In the sim to real configuration, policies pretrained in DreamGym and then fine tuned with a small number of real rollouts achieve more than 40 percent additional improvement while using less than 10 percent of the real interaction budget and reducing total training cost to around one third to one fifth of standard RL.

Editorial Comments

DreamGym is an important step toward practical reinforcement learning for LLM agents because it reframes the environment as a reasoning based experience model, grounded by an experience replay buffer and a reward entropy driven curriculum, rather than as a fragile browser stack. The reported gains on WebArena Lite, WebShop and ALFWorld with PPO and GRPO suggest that synthetic experience plus Sim to Real adaptation can become a standard pattern for agent training at scale. Overall, DreamGym makes the experience model, not the policy, the main lever for scaling RL agents.

Check out the Full Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Introduces DreamGym: A Textual Experience Synthesizer For Reinforcement learning RL Agents appeared first on MarkTechPost.

Your complete guide to Amazon Quick Suite at AWS re:Invent 2025

What if you could answer complex business questions in minutes instead of weeks, automate workflows without writing code, and empower every employee with enterprise AI—all while maintaining security and governance? That’s the power of Amazon Quick Suite, and at AWS re:Invent 2025, we are showcasing how organizations are making it a reality. Launched in October 2025, Quick Suite is a new agentic teammate that quickly answers your questions at work and turns those insights into actions for you.
This December in Las Vegas, Quick Suite takes center stage with an impressive lineup of sessions designed to help you reimagine how work gets done. These sessions include breakthrough customer stories and hands-on workshops on how to harness the power of AI agents, research, automation and unified BI.
This year, re:Invent will be held in Las Vegas, Nevada, from December 1 to December 5, 2025, and this guide will help you navigate our comprehensive session catalog and plan your week. The sessions cater to business and technology leaders, product and engineering teams, and data and analytics teams interested in incorporating agentic AI capabilities across their teams and organization.
Explore the session catalog and learn more. Register today to reserve a seat for our sessions!
Keynote sessions
KEY001 – Opening Keynote with AWS CEO Matt Garman
Tuesday, Dec 2 | 8:00 AM – 10:30 AM PST | Venetian | Level 2 | Venetian Ballroom F

Join AWS CEO Matt Garman to hear how AWS is innovating across every aspect of the world’s leading cloud. He explores how we are reinventing foundational building blocks as well as developing brand new experiences, all to empower customers and partners with what they need to build a better future.
KEY002 – The Future of Agentic AI is Here with Swami Sivasubramanian, Vice President of Agentic AI
Wednesday, Dec 3 | 8:30 AM – 10:30 AM PST | Venetian | Level 2 | Venetian Ballroom F

Join Dr. Swami Sivasubramanian, Vice President of Agentic AI, to learn how Agentic AI is poised to transform the way we live and work. In this keynote, you will hear about the tools and services you can use to build, deploy, and run secure, reliable, and scalable agents on AWS. We will also dive deep into the engineering innovations that power your agentic systems and give you a glimpse of the future.
Innovation talk
INV203: The agent-enabled workplace: Transforming businesses with AI
Monday, Dec 1 | 12:00 PM – 1:00 PM PST | Venetian | Level 5 | Palazzo Ballroom B
Discover how organizations are transforming their businesses by truly making AI part of the team. Learn three key ways companies are putting AI to work today: revolutionizing business processes, reinventing the way individuals work and teams collaborate, and transforming customer experiences. We also explore how the future workplace will evolve as AI becomes an integral team member. Through real customer examples, see how users can work with an agentic teammate like Amazon Quick Suite to get the right answers to every question across all their data and transform answers into actions, and how Amazon Connect is creating customer experiences that make every interaction personal, effortless, and memorable. You will also learn how Amazon uses these technologies in our own business. Gain practical insights to deliver real business value with AI while maintaining enterprise-grade security and trust. Join us to learn how AWS is helping organizations transform their business with effective AI collaboration.
Exclusive Executive Event
Amazon Quick Suite: Driving business growth and productivity with Data & AI
Wednesday, December 3 | 12:00 PM – 5:00 PM | Renaissance Las Vegas
Don’t miss this intimate executive event featuring customer panels, global partner insights and live Quick Suite demonstrations. Designed exclusively for C-level executives and senior decision-makers, this event offers strategic roundtables, one-on-one consultations with product leaders, and networking opportunities you won’t find anywhere else at re:Invent. Space is limited to ensure meaningful engagement. Register now to secure your spot – confirmed registrations only.
Breakout sessions
BIZ202: Reimagine work with Amazon Quick Suite
Monday, Dec 1 | 10:00 AM – 11:00 AM PST | Venetian | Level 3 | Lido 3106
Amazon Quick Suite is an agentic teammate for business users that quickly answers their questions at work and turns those insights into actions. Join this session to hear compelling customer stories and discover how organizations are transforming workplace productivity with AI agents for automation, research, and business intelligence in a unified experience. Learn more about how Quick Suite reduces application and context switching, breaks down data silos, delivers comprehensive insights, and accelerates decision-making and taking action—all while maintaining enterprise-grade security.
BIZ203: Amazon’s journey deploying Quick Suite across thousands of users
Wednesday, Dec 3 | 1:30 PM – 2:30 PM PST | MGM | Level 3 | Chairman’s 364
Go behind the scenes of Amazon’s internal Quick Suite deployment across multiple organizations and thousands of employees. This session covers the challenges of implementing enterprise AI at scale, including data integration complexities, orchestration layer design, and overcoming organizational silos. Learn from Amazon teams about deployment strategies, change management approaches, security considerations, and lessons learned from rolling out Quick Suite across diverse business units. Discover practical frameworks for enterprise-wide AI adoption and hear real stories of transformation challenges and solutions that organizations can apply.
BIZ223: Research agents in action: From complex business challenge to trusted insights
Wednesday, Dec 3 | 1:00 PM – 2:00 PM PST | Wynn | Convention Promenade | Latour 2
What if your most challenging research tasks could be completed in minutes instead of weeks? That’s the power of Amazon Quick Research. Join us, along with Principal Financial Group, to see how Quick Research breaks down complex topics, pulling from your organization’s internal knowledge, web data, and premium third-party datasets to deliver comprehensive, source-verified insights. Explore diverse use cases—from market intelligence to risk assessments—and learn about the journey Principal took towards smarter research and decision-making.
BIZ208: Enhance SaaS Applications with Quick Suite Agentic Capabilities
Thursday, Dec 4 | 4:00 PM – 5:00 PM PST | MGM | Level 3 | Chairman’s 360
Learn how Amazon Quick Suite agentic AI capabilities increase customer engagement and application value by 50%. Hear from a customer speaker who uses ISV application integrated with conversational AI and agentic AI capabilities while maintaining multi-tenant security and performance. Explore embedding patterns, API integration strategies, and agents and actions communication for SaaS applications. Discover implementation approaches that add intelligent workplace productivity features without disrupting existing user workflows or application architectures.

Sessions
Date and Venue

BIZ228: Reimagine business intelligence with Amazon Quick Sight
Monday, Dec 11:30 PM – 2:30 PM PST Wynn | Convention Promenade | Lafite 7 | Content Hub | Orange Theater

BIZ331: Build Robust Data Foundations to power Enterprise AI and BI
Monday, Dec 110:00 AM – 11:00 AM PST Wynn | Upper Convention Promenade | Bollinger

BIZ224: Automate any business process using Amazon Quick Suite
Thursday, Dec 411:00 AM – 12:00 PM PST Wynn | Convention Promenade | Lafite 7 | Content Hub | Pink Theater

BIZ207: Democratize access to insights with Amazon Quick Suite
Tuesday, Dec 211:30 AM – 12:30 PM PST Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Turquoise Theater

BIZ227: Generate new revenue streams with Amazon Quick Sight embedded
Thursday, Dec 41:00 PM – 2:00 PM PST MGM | Level 1 | Grand 122

BIZ225: Deploy Quick Suite at scale with confidence and control
Monday, Dec 14:30 PM – 5:30 PM PST Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Orange Theater

Chalk talks
BIZ323: Design AI-powered BI architectures for modern enterprises with Amazon Quick Suite
Monday, Dec 1 | 11:30 AM – 12:30 PM PST | Wynn | Convention Promenade | Montrachet 1
AI transforms how organizations collect, analyze, and derive insights from data in business intelligence environments. Join this chalk talk to explore the technical details of architectural frameworks and methodologies for developing next-generation BI systems with Amazon Quick Sight, the BI capability of Amazon Quick Suite. Dive deep into how machine learning, natural language processing, and automated analytics integration can revolutionize traditional BI architectures. Discuss implementation challenges including data quality requirements and enterprise readiness considerations for AI-powered BI solutions. Share experiences and learn best practices for maximizing business value and operational efficiency in your AI-powered BI initiatives using Quick Sight.
BIZ319: Beyond chatbots: Discover conversational AI in Amazon Quick Suite
Monday, Dec 1 | 3:00 PM – 4:00 PM PST | MGM | Level 3 | Premier 320
Join our interactive chalk talk to explore conversational AI capabilities in Quick Suite. Discover how to use natural language queries to get answers and visualizations from all your data—including metrics from databases and data warehouses, documents, emails, and knowledge bases. We will diagram advanced chat workflows, exploring knowledge gathering, context management, and agent integrations. Learn to handle complex scenarios like multi-turn conversations and context switching. Together, we will tackle real-world challenges in designing efficient flows and implementing productivity tools, as well as discover strategies for scaling AI conversations while maintaining quality standards. Bring your questions to this collaborative and interactive session.

Sessions
Date and Venue

BIZ327: Bridge data silos to unlock complete insights with Amazon Quick Suite
Tuesday, Dec 22:30 PM – 3:30 PM PST Mandalay Bay | Level 3 South | South Seas C

BIZ326: Agentic workflow architectures with Amazon Quick Flows
Wednesday, Dec 31:00 PM – 2:00 PM PST Wynn | Convention Promenade | Montrachet 1

BIZ405: Building agentic research solutions you can trust with Amazon Quick Research
Wednesday, Dec 32:30 PM – 3:30 PM PST Wynn | Convention Promenade | Lafite 1

BIZ325: Build multi-tenant ISV applications with Quick Suite and Quick Index
Tuesday, Dec 211:30 AM – 12:30 PM PST Wynn | Convention Promenade | Montrachet 1

BIZ329: Design patterns for embedded and agentic analytics with Quick Suite
Monday, Dec 15:30 PM – 6:30 PM PST Wynn | Convention Promenade | Montrachet 1

BIZ328: Implement enterprise governance for Amazon Quick Suite
Thursday, Dec 42:00 PM – 3:00 PM PST Wynn | Convention Promenade | Montrachet 1

BIZ406: Operationalize Amazon Quick Suite deployments at scale
Thursday, Dec 411:00 AM – 12:00 PM PST Mandalay Bay | Level 3 South | South Seas C

Workshops
BIZ402: Use agents to transform complex business processes with Amazon Quick Automate
Thursday, Dec 4 | 3:30 PM – 5:30 PM PST | Caesars Forum | Level 1 | Academy 413
Transform your manual document workflows into agentic automations in this hands-on workshop using Amazon Quick Automate, a capability of Amazon Quick Suite. We will transform a manual claims processing use case into an intelligent, adaptive automation. In this hands-on workshop, build end-to-end automations that combine document extraction, data validation, and business rules processing by using specialized AI agents. Learn how Quick Automate can implement smart exception handling while maintaining human oversight for critical decisions. This workshop is ideal for organizations modernizing document-intensive operations. All attendees must bring a laptop to participate.
BIZ306: Create Agentic AI Chat Experiences with Quick Suite
Monday, Dec 1 | 8:30 AM – 10:30 AM PST | Wynn | Upper Convention Promenade | Cristal 3
Wednesday, Dec 3 | 8:30 AM – 10.30 AM PST | Wynn | Mouton 2
Build comprehensive conversational AI solutions using chat agents and spaces in Amazon Quick Suite. Practice implementing multi-turn conversations that provide contextual, intelligent responses. Customize your chat agent’s behavior through simple steps that support enterprise readiness. Learn to create flows that implement repetitive tasks into an agentic workflow. Dive into deep research capabilities, knowledge integration, and user experience optimization of Quick Suite for enterprise deployment.

Sessions
Schedule

BIZ204: Experience AI-powered BI with Amazon Quick Suite
Tuesday, Dec 2 3:00 PM – 5:00 PM PST Wynn | Upper Convention Promenade | Crystal 1 Wednesday, Dec 3 8.30 AM – 10:30 AM PST Caesars Forum | Alliance 308

BIZ322: Customize your Application with Amazon Quick Suite APIs
Thursday, Dec 4 12:00 PM – 2:00 PM PST Wynn | Upper Convention Promenade | Cristal 1

BIZ315: Configure security and governance controls for Amazon Quick Suite
Wednesday, Dec 3 1:00 PM – 3:00 PM PST Venetian | Level 3 | Lido 3001A

Builder session
BIZ401: Build agentic automations for business processes with Amazon Quick Automate
Wednesday, Dec 3 | 10:00 AM – 11:00 AM PST | Wynn | Convention Promenade | Latour 7
In this session, learn how to build an enterprise-grade automation using Amazon Quick Automate, a capability of Amazon Quick Suite. Through a financial services example, explore how specialized AI agents work together to handle complex interactions across webpages and business applications. You will create a production-ready automation featuring custom agents that leverage knowledge and tools to transform a merchant onboarding process. Using Quick Automate’s chat-based authoring and visual studio, you will configure a workflow with multiple agents, integrate with multiple tools, test and debug the workflow, and then deploy it using robust enterprise controls. Walk away knowing how to develop agentic automations for real-world use cases in under an hour.
Schedule

Register today to reserve a seat!
Resources

Learn more: AWS re:Invent 2025
AWS re:Invent 2025 catalog—Register to book your seat!
Know more about Amazon Quick Suite
Explore the Amazon Quick Suite Community

About the authors
Pelak Desai is a Product Marketing Manager for Amazon Quick Suite. She comes with over 12 years of experience in marketing and business.
Srikanth Baheti is a Senior Manager for Amazon Quick Sight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications and highly scalable and maintainable data pipelines for reporting platforms using AWS services and serverless computing.

Accelerate enterprise solutions with agentic AI-powered consulting: In …

AWS Professional Services set out to help organizations accelerate their cloud adoption with expert guidance and proven methodologies. Today, we’re at a pivotal moment in consulting. Just as cloud computing transformed how enterprises build technology, agentic AI is transforming how consulting services deliver value. We believe in a future where intelligent agents work alongside expert consultants to compress development timelines, elevate solution quality, and enable organizations to achieve their digital transformation goals faster. Making this vision real requires a fundamental reimagining of the traditional consulting model. Drawing on our experience delivering enterprise solutions at scale, I’m excited to announce AWS Professional Services now offers specialized AI agents including the AWS Professional Services Delivery Agent. This represents a transformation to the consulting experience that embeds intelligent agents throughout the consulting life cycle to deliver better value for customers.
An agent-first consulting approach
The AWS Professional Services (AWS ProServe) new approach to agentic AI fundamentally changes what’s possible with consulting. By combining our deep expertise with specialized AI agents, we’re delivering enterprise solutions faster while maintaining the rigorous quality and security standards our customers expect. Agents empower our consultants to focus on what matters most—understanding their customer’s unique business challenges, providing strategic guidance, and driving meaningful outcomes, while agents handle implementation details with consistency and speed.
We have already started transforming customer engagements through agents, demonstrating tangible impact across industries. Whether you’re building next-generation AI applications, migrating critical workloads to the cloud, or modernizing existing systems, these agents compress timelines from months to weeks—or weeks to days—without compromising on quality.
A comprehensive agent system across the consulting cycle

Traditional consulting models struggle to balance speed, quality, and cost. A system of specialized agents embodying AWS institutional knowledge and proven methodologies help to solve this challenge.
AI agents that accelerate every stage: At the heart of the agent system is the AWS Professional Services Delivery Agent, an AI-powered technical expert that serves as your primary interface for technical engagements. The Delivery Agent analyzes your requirements, builds AI applications directly, and orchestrates specialized work by delegating migration and modernization tasks to purpose-built agents such as the custom agent built on AWS Transform, an AWS agentic AI service for enterprise migration and modernization workloads. Before delivery even begins, a sales agent streamlines proposal generation and statement of work creation, compressing what traditionally takes weeks into hours. Throughout every engagement, embedded capabilities ensure solutions meet enterprise-grade security and compliance standards.
From requirements to deployment in record time: Consider a typical generative AI application development project. Traditionally, building a customer service agent to help representatives quickly access policy information requires 6-8 weeks with a full consulting team gathering requirements, designing architecture, developing code, and deploying the solution. The Delivery Agent ingests your requirements—whether detailed documentation, architecture diagrams, or even meeting notes—and within hours produces comprehensive design specifications and implementation plans aligned with AWS best practices. The agent then generates code, automates testing, and prepares deployment packages while your AWS ProServe consultant provides strategic oversight and ensures alignment with your business context.
Migration and modernization at scale: For migration projects, incorporating agents demonstrates even more dramatic acceleration. Imagine a healthcare provider migrating 500+ applications to AWS—traditionally a 12+ month undertaking requiring extensive discovery and planning. We launched AWS Transform in May to help customers accelerate their cloud transformation journeys. Building on AWS Transform and leveraging its composable capability, we have built a custom agent tailored to how AWS ProServe delivers projects. This agent incorporates a knowledge base of learnings from thousands of migrations AWS ProServe has completed and automation capabilities to accelerate project delivery. The Delivery Agent analyzes the statement of work and project artifacts and engages the custom agent for migration, which handles wave planning, dependency mapping, workload scheduling, and runbook generation automatically. Your AWS ProServe consultant maintains strategic oversight while agents compress the timeline to just a few months, all while maintaining rigorous security and compliance standards.
Built on enterprise-grade AI infrastructure: The agent system leverages the same technologies we offer customers, including Amazon Bedrock AgentCore, AWS Transform, and advanced development tools like Kiro and Amazon Q Developer CLI. This helps ensure that every engagement benefits from industry-leading security through isolated computing environments, comprehensive observability for full transparency, and the scalability to handle engagements of any size.
Human expertise meets AI acceleration
What truly differentiates AWS ProServe agents is how it combines the value of human expertise with the speed and consistency of AI agents. AWS ProServe consultants remain integral to every engagement including understanding your business context, providing strategic guidance, making critical decisions, and building lasting relationships. The agents amplify their impact by handling implementation details, code generation, testing, and deployment with proven AWS methodologies embedded directly into their operations.
This human-AI collaboration delivers customer value through:

Unprecedented speed: Reduce project timelines achieving in days what traditionally required months
Consistent excellence: Every solution incorporates AWS best practices, architectural patterns, and the Well-Architected Framework
Lower total costs: Streamlined delivery and accelerated time-to-value translate directly to better ROI

Unlike general-purpose AI tools, the agents embody AWS specialized knowledge, including decades of experience informed by thousands of prior engagements, and proven methodologies. It draws from the vast AWS institutional knowledge base and has been specifically designed for enterprise-grade solution delivery. Further backed by AWS ProServe’s consulting expertise to ensure every solution meets your unique business requirements.
Making business transformation real with agents
Organizations across industries are already experiencing results by partnering with AWS ProServe agents, from rapid AI application development to accelerated cloud migrations. The National Football League (NFL) faced a challenge familiar to many organizations, building agents to serve millions of fantasy football fans while maintaining both speed and reliability. Working with the AWS Professional Services team, they used the delivery agent and were able to deploy a production quality prototype that seamlessly integrate NextGen Stats, Player News, weather data, and both proprietary and public NFL information to generate personalized fantasy football recommendation in just a few days.
“Building an AI agent that serves thousands of fantasy football fans requires both speed and reliability. The AWS Professional Services Delivery Agent helped us achieve both – we went from zero to production in 8 weeks while maintaining the quality standards NFL fans expect. The framework automated routine development tasks, freeing our team to focus on performance optimization and delivering unique insights powered by NFL’s proprietary data,” says Mike Band, Senior Manager, Research & Analytics, Next Gen Stats, NFL.
The transformation extends beyond customer outcomes to how AWS ProServe delivers consulting services. “Our goal with AWS Transform has always been to enable better customer outcomes through transformative new approaches to migration,” says Asa Kalavade, Vice President of AWS Transform. “AWS Professional Services’ custom agent, built on AWS Transform’s composable foundation, demonstrates this vision perfectly. It delivers customized workflows tailored to how AWS ProServe works directly in customer accounts, with goal-based, interactive agents that personalize each migration. Whether orchestrating large VMware migrations or handling dynamic wave planning for enterprises migrating thousands of VMs, these agents adapt to each customer’s unique context. This is the future of migration—faster, more personalized, and delivering outcomes that traditional approaches simply couldn’t achieve.”
This represents the future of professional services: AI-augmented consulting that delivers results without sacrificing the strategic guidance and partnership that complex enterprise initiatives require.
Reimagining the future of consulting with agentic AI
This new agentic-powered consulting approach is a demonstration of what becomes possible when you apply cutting-edge AI technologies to transform your own operations. While many organizations talk about what AI might do someday, AWS ProServe shows what AI can deliver for enterprises today. Customers can experience the new agent-powered consulting model by engaging with AWS ProServe and AWS Professional Services Partners today. Contact your AWS account team or visit the AWS Professional Services webpage to discover how AWS can accelerate your digital transformation.

About the author
Francessca Vasquez is the Vice President of Professional Services and Agentic AI for Amazon Web Services (AWS). She leads AWS’s global consulting services, overseeing customer engagements across public sector, commercial, and partner businesses worldwide. Francessca drives co-innovation and delivery of emerging technologies including Generative AI, Quantum Computing, and Application Modernization. Her team connects AWS AI and ML experts with customers globally to design and launch cutting-edge generative AI solutions. As Executive Sponsor for the AWS Global CIO Council and AWS Partner Collective, she strengthens strategic partnerships that help organizations accelerate their digital transformation and unlock the full potential of cloud and AI technologies.

Amazon Bedrock AgentCore and Claude: Transforming business with agenti …

The enterprise AI conversation has fundamentally shifted. We’re no longer asking “Can AI understand language?” but rather “Can AI autonomously execute complex business processes that drive real value?” According to McKinsey research, agentic AI has the potential to generate $450 billion to $650 billion in additional annual revenue by 2030, representing a 5 to 10 percent revenue increase across industries.
The window for competitive advantage is narrowing. While your competitors experiment with AI pilots, the organizations that move agentic AI into production are capturing measurable gains today. Yet here’s the paradox we keep seeing: enterprises build impressive prototypes that never scale. The gap isn’t in model capabilities, but rather in the operational infrastructure required to deploy agents that can work autonomously for hours, integrate securely with enterprise systems, and maintain reliability at scale. The figure below outlines the various challenges that organizations may face taking their agents to production.

But some organizations have already crossed this divide. They’re running AI agents in production right now, handling real business processes, serving thousands of customers, and delivering results that seemed impossible just months ago. Let’s start with what they’ve achieved.
What’s possible today: Production results from leading organizations
Cox Automotive and Druva are both putting Amazon Bedrock AgentCore and Claude to work across their organizations.
Cox Automotive: Accelerating enterprise-scale agentic AI deployment
As the world’s largest automotive services and technology company, Cox Automotive has a wide breadth of products and services that touch almost all aspects of the automotive industry and a vehicle’s lifecycle. Agentic AI holds the promise to connect solutions and help consumers, dealers, automakers, and other automotive stakeholders to help execute workflows in more automated, scalable, and even personalized ways. AI agents can fundamentally transform every touchpoint in automotive, from how consumers search and purchase vehicles to how dealers manage service operations and inventory. This is happening in production right now at Cox Automotive. Cox Automotive has shifted from “Data-First, AI-Enabled” to “AI-First, Data Differentiated.” Cox Automotive is using Anthropic’s Claude model and Amazon Bedrock AgentCore as one of their critical capabilities for agentic AI solution deployment at scale with 17 major proofs of concept deployed in production and seven industry-transformational solutions currently in development.

“At Cox Automotive, we’re transforming our customer experience with generative and agentic AI. We are working with all frontier model providers but have anchored on Claude for its strong performance across three critical metrics: latency, cost, and accuracy. Amazon Bedrock AgentCore is one of the strategic tools we’re using to build AI agents that can deploy at scale, ranging from virtual assistants that improve our omnichannel dealer experience to an agentic marketplace that streamlines vehicle discovery and buying. AgentCore’s key capabilities – runtime for secured deployments, observability for monitoring, identity for authentication, and enterprise grade primitives are enabling our teams to develop and test these agents efficiently as we scale AI across the enterprise.” – Marianne Johnson, EVP & Chief Product Officer, Cox Automotive

Druva: Up to 63% autonomous resolution with up to 58% faster response times
Druva’s customers faced an escalating challenge in cybersecurity: staying ahead of evolving data anomalies across complex infrastructure. Manual threat investigation meant navigating multiple dashboards, logs, and alerts. In security, missing threat signals can lead to catastrophic consequences—but the volume of potential signals makes comprehensive manual review impossible.
Consider the scale: over 7,500 customers, each with their own infrastructure patterns, threat landscapes, and security requirements. The challenge was building an AI solution that could operate reliably and securely at this scale.
Druva partnered with the AWS Generative AI Innovation Center to build DruAI, a multi-agent system powered by Claude on Amazon Bedrock AgentCore. The system uses multiple AI agents that work together to automatically choose the right tools from hundreds of options, handling telemetry analysis, threat investigation, and remediation. AgentCore Runtime provides a more secure, isolated execution environment with automated scaling, allowing Druva’s team to focus on delivering customer value rather than building and maintaining complex security infrastructure.
The impact: Over 3,000 customers and 10,000 users now deploy DruAI, resulting in up to 58% faster time-to-resolution and solving up to 63% of customer issues without human intervention. In cybersecurity, speed is the difference between contained threats and business-impacting breaches.

“Our customers at Druva needed to transform their manual threat investigation processes, which involved navigating multiple dashboards, logs, and alerts. Using AgentCore’s Runtime, we rapidly deployed DruAI, our suite of AI capabilities for customers, with complete session isolation and automated scaling – enabling us to focus on delivering value to customers rather than building and maintaining complex security infrastructure. Our system handles telemetry analysis, threat investigation and remediation, and is already being used by over 3,000 customers and 10,000 users. DruAI delivers 58% faster time-to-resolution, solving 63% of customer issues without human intervention.” – David Gildea, VP of Product, AI, Druva

These results raise an obvious question: How did organizations achieve production deployments that deliver measurable business value? The answer lies in combining two critical elements that work better together than either could alone.
Why Amazon Bedrock AgentCore and Claude by Anthropic

Agentic AI in production requires two things: frontier AI capabilities that can handle complex, autonomous workflows, and enterprise-grade infrastructure that provides the security, reliability, and operational foundation those agents need to run in production. Amazon Bedrock AgentCore and Claude provide this combination. AgentCore has multiple fully-managed services that can be used together or independently as part of Amazon Bedrock AgentCore: Runtime, Memory, Identity, Gateway, Code Interpreter, Browser Tool, and Observability.
Agent intelligence and logic: Focus on what matters
When enterprises build agentic AI, engineering teams usually spend months building infrastructure like session management, credential vaults, tool orchestration, observability frameworks, and scaling logic. By the time they’re ready to focus on the actual agent logic and business value, they’re exhausted and the use case may have evolved.Amazon Bedrock AgentCore is a comprehensive agentic platform to build, deploy and operate highly capable agents at scale. It’s model-agnostic, which means it handles the infrastructure and operational challenges so your developers can concentrate on what differentiates your business: the agent’s logic and the specific tasks it needs to perform. Claude’s high performance and contextual understanding are maximized by this approach.

AgentCore works with frameworks your team already knows like Strands Agents, CrewAI, LangGraph, LlamaIndex. You can also use it with any foundation model, whether hosted on Amazon Bedrock or elsewhere. This removes the traditional tradeoff between open source flexibility and enterprise-grade reliability.
Enterprise-grade security and reliability built in
Although optimized for agentic AI workflows, Claude alone doesn’t provide the production infrastructure that complex agents require. That’s where Amazon Bedrock AgentCore comes in. AgentCore provides complete session isolation to make sure each execution is fully contained, secure credential vaults help protect sensitive tokens, and identity-aware authorization controls exactly what agents can access. Agents can work autonomously for up to eight hours with automatic scaling, delivering the reliability that business processes demand.
Enhanced agent capabilities
AgentCore provides built-in tools that extend what Claude-powered agents can accomplish. Code Interpreter offers secure code execution for data processing and analysis, while Browser enables agents to interact with web applications, navigate pages, extract data, and execute transactions.
But the real multiplier is AgentCore Gateway: it transforms your existing REST APIs and AWS Lambda functions into agent-ready tools with semantic routing. Your agents can interact with your existing business systems, databases, and services without rebuilding everything for AI. The gateway handles dual-sided security and intelligent tool selection, so as you scale to hundreds or thousands of tools, agents can still find and use the right ones.

Together, these elements create something neither could achieve alone: AI agents with frontier intelligence, enterprise-grade reliability, and the operational foundation to deliver business value in production—not in six months after you build infrastructure, but now. The previous figure shows the benefits of AgentCore Gateway.
The technology behind these results
Let’s explore the technology foundation that makes these results possible, without getting lost in implementation details.
Infrastructure that scales production workloads
Amazon Bedrock AgentCore is purpose-built infrastructure for production agentic AI. Think of it as the operational foundation that transforms capable AI models into usable business systems. Rather than spending months on undifferentiated heavy lifting or building production ready agents from scratch, it’s available as a managed agentic platform.

The AgentCore Runtime and AgentCore Identity services provide more secure, serverless execution where agents work autonomously for up to eight hours with complete session isolation. Identity management integrates with your existing providers—Okta, Microsoft Entra, or Amazon Cognito—handling OAuth, token management, and comprehensive audit trails that can help align with the most stringent compliance requirements, including those trusted by AWS GovCloud (US) customers. The Gateway transforms REST APIs and Lambda functions into agent-compatible tools with intelligent semantic routing, while AgentCore Memory is straightforward for developers to use to build context-aware agents by minimizing complex memory infrastructure, so that agents can maintain context across conversations and build knowledge bases over time.
Observability delivers complete visibility through CloudWatch with OpenTelemetry compatibility for systems like Dynatrace, Datadog, Arize Phoenix, LangSmith, and Langfuse. You can track what agents are doing, monitor performance, identify errors, and maintain the operational visibility that production systems demand. AgentCore services support VPC, AWS PrivateLink, CloudFormation, and resource tagging for enhanced enterprise security.
Claude’s intelligence that handles complex, long-running tasks
While infrastructure enables deployment, model capabilities determine what agents can accomplish. Claude Sonnet 4.5 is Anthropic’s best performing model for agentic AI use cases, with capabilities specifically designed for autonomous, long-running workflows.
Claude Sonnet 4.5 can work independently for extended periods while maintaining clarity and focus. The model makes steady progress on tasks rather than attempting everything simultaneously, providing fact-based updates that accurately reflect accomplishments. This capability is critical for complex workflows that require sustained attention and incremental progress over hours.
The model tracks token usage throughout conversations and maintains awareness of its working context. This helps prevent remature task abandonment and enables more effective execution on long-running operations. Combined with memory capabilities that enable storage and retrieval of information outside the immediate context window, agents can maintain state across sessions and build knowledge bases over time.
Built with Anthropic’s Constitutional AI method, Claude is designed to be helpful, harmless, and honest. Extensive safety training has substantially reduced concerning behaviors including sycophancy, deception, and power-seeking. This alignment foundation is particularly important for enterprise deployments where agent reliability and appropriate behavior are non-negotiable requirements. When agents operate autonomously for hours, trust is fundamental.
Claude Sonnet 4.5 achieves state-of-the-art performance on coding and reasoning tasks, with enhanced planning and system design capabilities. The model excels at autonomous tasks that span hours or days while maintaining consistent performance. Beyond coding, Claude demonstrates advanced reasoning capabilities for financial analysis, research workflows, and cybersecurity applications which enable sophisticated agent applications across multiple enterprise use cases.
Strategic implications for enterprise leaders
The decisions you make about agentic AI infrastructure are about establishing the foundation for your multi-year AI roadmap. Take these into consideration:
System choice as competitive positioning
Your competitors are evaluating the same opportunities. The organizations that establish production agentic AI first can capture advantages that compound over time: operational efficiencies that can reduce costs while improving service, capabilities that were previously impossible becoming standard practice, and the organizational learning that comes from real-world deployment.
AI is transforming your industry. Will you be leading that transformation or reacting to it?
Velocity of innovation: Automatic capability improvements
Claude Sonnet 4.5 was released just seven weeks after Claude Opus 4.1. That velocity of model improvement is now the baseline expectation. The system you choose determines whether you benefit from these advances automatically or face migration projects every time capabilities improve.
Organizations building on Amazon Bedrock gain access to new model capabilities as they become available without having to re-engineer, spin up migration projects, and without technical debt. Your agents become more capable over time, and your team stays focused on business value rather than system maintenance.
The expanding capabilities of AgentCore follow similar trajectories. Recent additions include enhanced Agent-to-Agent (A2A) protocol support for multi-agent coordination, expanded observability integrations, and new tools like Browser and Code Interpreter. These capabilities become available to your agents as they launch, future-proofing your investments while maintaining backward compatibility.
The multi-agent future: Coordination and specialization
As individual agents prove value in your organization, the next frontier involves coordinated multi-agent systems where specialized agents collaborate on complex business challenges. Amazon Bedrock supports multi-agent collaboration through the A2A protocol, enabling sophisticated patterns:
Specialized agent teams where you deploy focused agents, each excelling at specific domains like financial analysis, code review, customer interaction, security monitoring, working together under intelligent orchestration.
Supervisor agents that break down complex workflows into manageable sub-tasks, delegate to appropriate specialist agents, and synthesize results into coherent outcomes.
Organizations like Druva are already running multi-agent systems in production, and the architectural patterns are becoming established. The infrastructure foundation you choose will determine how smoothly you can evolve to these sophisticated deployments tomorrow.
Risk mitigation: Security, governance, and compliance
Enterprise deployments require security and governance built into the foundation. AgentCore provides complete audit trails for compliance, fine-grained authorization that scales with your agent environment, and session isolation that help contain potential issues. Constitutional AI in Claude Sonnet 4.5 helps provide an additional reliability layer: when agents operate autonomously, you need confidence they’ll behave appropriately and align with your instructions.
Evaluating agentic AI for your enterprise
If you’re a technical leader or architect exploring agentic AI for your organization, here’s a practical framework for evaluation and adoption.
Start with high-value use cases
The most successful early deployments share common characteristics. Look for workflows that are:

Repetitive yet require judgment: Tasks your team does regularly that follow patterns but need decision-making, not just automation
Multi-system integration opportunities: Processes that involve pulling data from multiple sources, making decisions, and taking actions across different systems
24/7 availability benefits: Workflows where autonomous operation outside business hours provides real value
Clear, measurable success metrics: Use cases where you can quantify impact—time saved, accuracy improved, costs reduced, capacity increased

What are the equivalent opportunities in your business?
Move from evaluation to production decisively
The evaluation process should be measured in weeks, not months:
Week 1-2: Review case studies and assess relevance to your context. Identify 1-2 pilot workflows with defined success criteria. Reach out to your AWS account team to discuss using Claude with Amazon Bedrock AgentCore for help assessing technical fit and business value potential.
Week 3-4: Prototype with production infrastructure from day one. Leverage AgentCore so you’re not building throwaway infrastructure. Your learnings and code can transfer directly to production.
Week 5-8: Run your pilot and measure against your success criteria. With production infrastructure already in place, this is about validating business value, not rebuilding for scale.
Week 9+: Scale based on proven results. The AgentCore infrastructure scales automatically, so moving from pilot to production is about expanding scope, not re-engineering foundations.
This timeline is achievable because you’re not building infrastructure from scratch. Your AWS account team can connect you with resources, technical guidance, and examples from organizations like Cox Automotive and Druva who’ve already walked this path.
Conclusion: The agentic enterprise is being built today
Agentic AI represents a fundamental shift in how enterprises put AI to work, moving from tools that assist to systems that act autonomously. The technical requirements for production deployment are substantial, but the combination of Amazon Bedrock AgentCore and Claude Sonnet 4.5 makes this transformation accessible.
The infrastructure exists. Organizations are already running agents in production with measurable business impact. The question for enterprise leaders is no longer “Is agentic AI ready?” but rather “How quickly can we capture this advantage?”
Organizations that master agentic AI are improving operational efficiency and reimagining what’s possible in their industries. The agentic enterprise of the future is being built now by teams that combine the right model capabilities with the right operational infrastructure.
Ready to explore what’s possible for your organization? Reach out to your AWS account team to get started with Claude in Amazon Bedrock AgentCore. They can help you assess use cases, design your pilot, and accelerate your path to production agentic AI.
The foundation is ready. The models are proven. The path forward is clear.

About the authors
Jawhny Cooke is a Senior Anthropic Specialist Solutions Architect for Generative AI at AWS. He specializes in integrating and deploying Anthropic models on AWS infrastructure. He partners with customers and AI providers to implement production-grade generative AI solutions through Amazon Bedrock, offering expert guidance on architecture design and system implementation to maximize the potential of these advanced models.
Brad Abrams is Head of Product for the Claude Developer Platform at Anthropic, where he leads API product development and works on building tools that help developers create powerful AI agents. Prior to Anthropic, Brad spent significant time at Google, where he was recognized as one of the most influential technologists in the voice assistant landscape. He also held roles at Microsoft, bringing deep expertise in developer tools and platform ecosystems. Brad holds a Bachelor of Science in Computer Science from North Carolina State University. Throughout his career, he has focused on developer experience, distributed systems, and software product management. Based in Palo Alto, he continues to drive innovation at the intersection of AI capabilities and developer tooling.

Google DeepMind Introduces SIMA 2, A Gemini Powered Generalist Agent F …

Google DeepMind has released SIMA 2 to test how far generalist embodied agents can go inside complex 3D game worlds. SIMA’s (Scalable Instructable Multiworld Agent) new version upgrades the original instruction follower into a Gemini driven system that reasons about goals, explains its plans, and improves from self play in many different environments.

From SIMA 1 to SIMA 2

The first SIMA, released in 2024, learned more than 600 language following skills such as ‘turn left’, ‘climb the ladder’, and ‘open the map’. It controlled commercial games only from rendered pixels and a virtual keyboard and mouse, without any access to game internals. On complex tasks, DeepMind reported a SIMA 1 success rate of about 31 percent, while human players reached about 71 percent on the same benchmark.

SIMA 2 keeps the same embodied interface but replaces the core policy with a Gemini model. According to a TechCrunch article that the system uses Gemini 2.5 Flash Lite as the reasoning engine. This changes SIMA from a direct mapping between pixels and actions into an agent that forms an internal plan, reasons in language, and then executes the necessary action sequence in the game. DeepMind describes this as moving from an instruction follower to an interactive gaming companion that collaborates with the player.

https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/

Architecture, Gemini in the control loop

The SIMA 2 architecture integrates Gemini as the agent core. The model receives visual observations and user instructions, infers a high level goal, and produces actions that are sent through the virtual keyboard and mouse interface. Training uses a mix of human demonstration videos with language labels and labels generated by Gemini itself. This supervision lets the agent align its internal reasoning with both human intent and model generated descriptions of behavior.

Because of this training scheme, SIMA 2 can explain what it intends to do and list the steps it will take. In practice, this means the agent can answer questions about its current objective, justify its decisions, and expose an interpretable chain of thought about the environment.

Generalization and performance

The task completion plot shows SIMA 1 at about 31% and SIMA 2 at 62% that value on the main evaluation suite, with humans around the 70% range. Integrating Gemini doubles the performance of the original agent on complex tasks. The important point is not the exact number, it is the shape, the new agent closes most of the measured gap between SIMA 1 and human players on long, language specified missions in the training games.

On held out games such as ASKA and MineDojo, which are never seen during training, the DeepMind team show a similar pattern. SIMA 2 has much higher task completion than SIMA 1 in these environments, which indicates a real gain in zero shot generalization rather than overfitting to a fixed game set. The agent also transfers abstract concepts, for example it can reuse an understanding of ‘mining’ in one title when it is asked to ‘harvest’ in another.

Multimodal instructions

SIMA 2 extends the instruction channel beyond plain text. The DeepMind demonstrations show the agent following spoken commands, reacting to sketches drawn on the screen, and executing tasks from prompts that use only emojis. In one example, the user asks SIMA 2 to go to ‘the house that is the color of a ripe tomato’. The Gemini core reasons that ripe tomatoes are red, then selects and walks to the red house.

Gemini also enables instruction following in multiple natural languages and supports mixed prompts where language and visual cues are combined. For physical AI, robotics devs, this is a concrete multimodal stack, a shared representation links text, audio, images, and in game actions, and the agent uses this representation to ground abstract symbols in concrete control sequences.

Self improvement at scale

One of the main research contributions in SIMA 2 is the explicit self improvement loop. After an initial phase that uses human gameplay as a baseline, the team moves the agent into new games and lets it learn only from its own experience. A separate Gemini model generates new tasks for the agent in each world, and a reward model scores each attempt.

These trajectories are stored in a bank of self generated data. Later generations of SIMA 2 use this data during training, which allows the agent to succeed on tasks where earlier generations failed, without any fresh human demonstrations. This is a concrete example of a multitask, model in the loop data engine, where a language model specifies goals and gives feedback, and the agent converts that feedback into new competent policies.

Genie 3 worlds

To push generalization further, DeepMind combines SIMA 2 with Genie 3, a world model that generates interactive 3D environments from a single image or text prompt. In these virtual worlds, the agent has to orient itself, parse instructions, and act toward goals even though the geometry and assets differ from all training games.

The reported behavior is that SIMA 2 can navigate these Genie 3 scenes, identify objects such as benches and trees, and perform requested actions in a coherent way. This is important for researchers, it shows that a single agent can operate across commercial titles and generated environments, using the same reasoning core and control interface.

Key Takeaways

Gemini centered architecture: SIMA 2 integrates Gemini, reported as Gemini 2.5 Flash Lite, as the core reasoning and planning module, wrapped by a visuomotor control stack that acts from pixels through a virtual keyboard and mouse across many commercial games.

Measured performance jump over SIMA 1: On DeepMind’s main task suite, SIMA 2 roughly doubles SIMA 1’s 31 percent task completion rate and approaches human level performance in training games, while also delivering significantly higher success rates on held out environments such as ASKA and MineDojo.

Multimodal, compositional instruction following: The agent can follow long, compositional instructions and supports multimodal prompts, including speech, sketches, and emojis, by grounding language and symbols in a shared representation over visual observations and in game actions.

Self improvement via model generated tasks and rewards: SIMA 2 uses a Gemini based teacher to generate tasks and a learned reward model to score trajectories, building a growing experience bank that allows later generations of the agent to outperform earlier ones without additional human demonstrations.

Stress testing with Genie 3 and implications for robotics: Coupling SIMA 2 with Genie 3, which synthesizes interactive 3D environments from images or text, shows that the agent can transfer skills to newly generated worlds, supporting DeepMind’s claim that this stack is a concrete step toward general purpose embodied agents and, eventually, more capable real world robots.

Editorial Comments

SIMA 2 is a meaningful systems milestone rather than a simple benchmark win. By embedding a trimmed Gemini 2.5 Flash lite model at the core, DeepMind team demonstrates a practical recipe that joins multimodal perception, language based planning, and a Gemini orchestrated self improving loop, validated both in commercial games and Genie 3 generated environments. Overall, SIMA 2 shows how an embodied Gemini stack can act as a realistic precursor for general purpose robotic agents.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Introduces SIMA 2, A Gemini Powered Generalist Agent For Complex 3D Virtual Worlds appeared first on MarkTechPost.

AI Interview Series #2: Explain Some of the Common Model Context Proto …

In this part of the Interview Series, we’ll look at some of the common security vulnerabilities in the Model Context Protocol (MCP) — a framework designed to let LLMs safely interact with external tools and data sources. While MCP brings structure and transparency to how models access context, it also introduces new security risks if not properly managed. In this article, we’ll explore three key threats — MCP Tool Poisoning, Rug Pulls, and Tool Hijacking Attacks

Tool Poisoning

A Tool Poisoning Attack happens when an attacker inserts hidden malicious instructions inside an MCP tool’s metadata or description.

Users only see a clean, simplified tool description in the UI.

LLMs, however, see the full tool definition — including hidden prompts, backdoor commands, or manipulated instructions.

This mismatch allows attackers to silently influence the AI into harmful or unauthorized actions.

Tool Hijacking

A Tool Hijacking Attack happens when you connect multiple MCP servers to the same client, and one of them is malicious. The malicious server injects hidden instructions inside its own tool descriptions that try to redirect, override, or manipulate the behavior of tools provided by a trusted server.

In this case, Server B pretends to offer a harmless add() tool, but its hidden instructions try to hijack the email_sender tool exposed by Server A.

MCP Rug Pulls

An MCP Rug Pull happens when a server changes its tool definitions after the user has already approved them. It’s similar to installing a trusted app that later updates itself into malware — the client believes the tool is safe, but its behavior has silently changed behind the scenes.

Because users rarely re-review tool specs, this attack is extremely hard to detect.

AI Interview Series #1: Explain Some LLM Text Generation Strategies Used in LLMs

The post AI Interview Series #2: Explain Some of the Common Model Context Protocol (MCP) Security Vulnerabilities appeared first on MarkTechPost.

Comparing the Top 4 Agentic AI Browsers in 2025: Atlas vs Copilot Mode …

Agentic AI browsers are moving the model from ‘answering about the web’ to operating on the web. In 2025, four AI browsers define this space: OpenAI’s ChatGPT Atlas, Microsoft Edge with Copilot Mode, The Browser Company’s Dia, and Perplexity’s Comet. Each makes different design choices around autonomy, memory, and privacy. This article compares their architectures, capabilities, and risk profiles so various type of users can decide which browser aligns with their workflows.

What are Agentic Browsers?

Agentic browsers are not just ‘chat over a page’. They expose the browser’s DOM (Document Object Model), tab graph, and history to an AI model and allow it to:

Read and reason over multiple tabs

Maintain task context across time

Take actions such as navigating, filling forms, and completing workflows

OpenAI ChatGPT Atlas, Microsoft Edge Copilot Mode, The Browser Company’s Dia, and Perplexity’s Comet all do this, but with different tradeoffs in autonomy, memory, and security.

High-level comparison

Atlas is the most fully agentic: deep ChatGPT integration, rich browser control, strong but complex memory and privacy story.

Copilot Mode is an incremental but significant extension to Edge: unified Copilot, cross-tab reasoning, early ‘Actions’ for automation, still conservative compared with Atlas and Comet.

Dia is an AI-first browser built on Chromium, optimized for reading, writing, and structured workflows with privacy-first defaults and intentionally limited autonomy.

Comet is a highly agentic personal assistant browser with deep workflow automation, a local-data narrative, and currently the most aggressive legal and security risk profile.

The rest of the article unpacks these differences in a more technical way.

1. ChatGPT Atlas (OpenAI): AI-native browser with full agent mode

1.1 Architecture

Atlas is a dedicated AI browser built around ChatGPT rather than a standard Chromium shell with an extension. It runs on Chromium but wraps it in OpenAI’s OWL process architecture, which separates the rendering engine from the Atlas application and agent layer.

Key characteristics:

macOS only at launch, with Windows, iOS, and Android ‘coming soon’.

ChatGPT is exposed everywhere: omnibox, main panel, and a ChatGPT sidebar that can see the current page and tabs.

This gives Atlas a first-class API into:

Current tab DOM and visible content

Tab list and navigation history

User queries and previous conversation state

1.2 Agent mode: real browser control

Agent Mode is the key differentiator. For Plus / Pro / Business users, Atlas can execute multi-step workflows:

Open and close tabs, follow links, and switch sites

Fill out forms and online applications

Book reservations such as hotels and restaurants

Compare products across multiple sites and return structured summaries

Constraints:

Agent mode cannot access local files or the OS, and cannot download or execute local programs. It is sandboxed inside the browser.

Actions require explicit user consent; Atlas surfaces prompts like ‘Should I start clicking and filling these forms’ before executing workflows.

1.3 Memory and privacy

Atlas introduces browser memories:

It stores filtered summaries of visited pages and inferred user intent, not full page captures. Summaries are retained for about 30 days, enabling queries like ‘reopen the reports I read yesterday’ or ‘continue the Athens itinerary plan’.

Memories are opt-in and can be viewed, edited, or deleted. Memory can be disabled globally or on specific sites, and Atlas supports incognito.

OpenAI also added parental controls that let guardians disable both browser memories and agent mode for child accounts.

Critical points:

Atlas still needs to transmit page snippets and metadata to OpenAI’s servers for summarization, which means sensitive content can be exposed if protections fail.

Security researchers have already demonstrated prompt-injection attacks that exploit Atlas’s omnibox and agent context, confirming that highly agentic browsing increases the attack surface.

1.4 Pricing and fit

Atlas is free to install for ChatGPT users on macOS.

Agent Mode is only available on paid ChatGPT tiers (Plus, Pro, Business, Enterprise).

Fit:

Best for users who want maximum in-browser automation and are comfortable with cloud-centric data handling and a still-evolving security posture.

2. Copilot Mode in Microsoft Edge: tab-reasoning with controlled autonomy

2.1 Architecture

Copilot Mode is Microsoft’s AI layer inside Edge, not a separate browser. It exposes:

A unified Copilot box on new tabs for chat, search, and navigation

Deep integration with Edge context (open tabs, history, and some browser settings) when users opt in.

Microsoft also ties Copilot Mode into:

Journeys: topic-centric clusters over browsing history, which Copilot can summarize and re-open.

Copilot Actions: an early agentic layer capable of actions like clearing cache, unsubscribing from mailing lists, and booking reservations in preview.

2.2 Agentic behavior

Compared with Atlas:

Copilot Mode can reason across multiple tabs, summarize and compare them, and assist with structured tasks like trip planning or multi-site research.

Actions Preview extends this into partially agentic flows, such as booking a restaurant or filling forms, but current evaluations show inconsistent reliability and occasional ‘hallucinated’ completions of tasks that were not successfully executed.

Crucially, Copilot Mode remains more constrained than Atlas or Comet:

It does not expose an openly programmable DOM-level agent with free cursor control

Action templates are narrower and guarded, particularly for email and account-sensitive operations

2.3 Data, privacy, and enterprise posture

Edge with Copilot Mode is clearly aimed at enterprise adoption:

Copilot access to tab and history data is explicitly permissioned; users can disable history-based personalization, Copilot context, and Copilot Mode entirely.

Microsoft integrates Prompt Shields and Azure AI safety layers to mitigate prompt injection and jailbreak attempts.

Fit:

Appropriate where organizations want AI-assisted browsing and cross-tab reasoning while keeping automation scoped and more auditable than a fully agentic browser.

3. Dia (The Browser Company): AI-first, Chromium-based, privacy-forward

3.1 Architecture and UX

Dia is The Browser Company’s AI-centric successor to Arc, built on Chromium and currently available on macOS only.

Core design choices:

The canonical interaction is ‘chat with your tabs‘: Dia’s assistant can read open tabs, referenced tabs, and selections, and answer questions or transform content in place.

Dia includes a Skills system, where users define reusable prompt ‘scripts’ and workflows for tasks like note-taking or research templates.

Dia’s UX is optimized for:

Reading and understanding long-form content

Writing and editing in-page

Learning workflows (tutoring, flashcards, argument comparison)

3.2 Memory and ‘local-first’ privacy

Dia’s main differentiation is its privacy posture:

Browsing history, chats, bookmarks, and saved content are stored locally and encrypted, with data sent to servers only when required to answer a specific query.

The Memory feature stores summaries and learned preferences, but users can disable memory entirely in settings or control what contexts are shared.

The net effect is an AI browser that tries to behave more like a local knowledge layer with scoped cloud calls rather than a continuous telemetry stream.

3.3 Agentic scope and constraints

Dia is intentionally less agentic than Atlas or Comet:

The assistant can read and summarize pages, transform text, generate content, and run Skills over the current tab set.

Current public builds do not expose a general DOM automation agent capable of open-ended clicking and form submission across arbitrary sites.

In practice, Dia behaves as a high-context copilot rather than a fully autonomous web operator. This is aligned with the company’s positioning and with Atlassian’s stated intent after acquiring The Browser Company, which emphasizes individual knowledge worker workflows over transactional automation.

3.4 Pricing and availability

Dia now ships to all Mac users, no invite required, as of October 2025.

Free tier: Core AI chat, Skills, and Memory, with usage limits.

Dia Pro at $20/month unlocks effectively unlimited AI chat usage within terms of use.

Fit:

Strong for educational and writing-heavy workflows, for users who want AI-augmented browsing without handing an agent broad control over the web session.

4. Comet (Perplexity): highly agentic assistant browser with heavy risk surface

4.1 Architecture and capabilities

Comet is Perplexity’s AI browser built on Chromium, positioned as a personal AI assistant and ‘thinking partner‘ rather than a simple search UI.

The Comet Assistant can:

Summarize and explore any page

Execute multi-step workflows for research, coding, meeting prep, and e-commerce

Manage email and calendar via integrated connectors

Handle complex tasks like comparing products, reading reviews, and moving all the way to checkout.

Recent updates extend the agent to work longer and across larger jobs, emphasizing persistent, agentic behavior over many tabs and time periods.

4.2 Data model and privacy claims

Perplexity’s Comet Privacy Notice and product pages claim:

Browsing data, cookies, and saved credentials are stored locally on the device by default.

Users can delete browsing data and stored credentials from Comet settings, and manage cookie behavior.

Integration with 1Password keeps vaults end-to-end encrypted and opaque to Perplexity.

So the official architecture is a hybrid: local browser state with selective context uploads to Comet’s servers and Perplexity’s search models.

However, multiple independent reviews argue that despite these controls, the combination of: Deep integration with third-party services (Gmail, calendar, financial accounts) and high agent autonomy over those services produces a large effective privacy risk envelope, especially for corporate data.

4.3 Security incidents and legal pressure

Comet currently has the most visible security and legal issues among the four:

Indirect prompt-injection / ‘CometJacking‘: LayerX and other researchers showed that malicious URLs and embedded prompts could hijack Comet’s assistant, exfiltrating data from connected services and even performing fraudulent actions.

Although Perplexity has patched specific vulnerabilities, security audits from Brave, Guardio, and others still recommend extreme caution for sensitive workloads.

Amazon lawsuit: Amazon is suing Perplexity over Comet’s ‘agentic shopping’ behavior, alleging that automated shopping sessions accessed customer accounts and impersonated human browsing, violating platform rules and harming personalization systems.

4.4 Pricing and availability

As of October–November 2025, Comet is free to download globally; earlier Max-only and Pro-only restrictions have been removed.

Perplexity monetizes via Pro / Max subscriptions for higher model tiers and via Comet Plus (~$5 / month), which grants access to curated news and publisher content and is bundled into Pro / Max.

Fit:

Very strong for users who want maximum automation across research, communications, and purchases, and who are comfortable operating at the bleeding edge of the security and platform-policy risk curve.

Comparison Table

DimensionChatGPT Atlas (OpenAI)Edge + Copilot Mode (Microsoft)Dia (The Browser Company)Comet (Perplexity)Engine / platformChromium-based; Atlas shell with OWL architecture; macOS now, Windows / mobile planned Edge (Chromium) on Windows and macOS with optional Copilot Mode Chromium-based AI browser; macOS only, GA, no invite; Windows not yet announced Chromium-based browser with integrated Perplexity search and assistant; desktop global, mobile rolling outAgentic autonomyHigh: Agent Mode can click, navigate, fill forms, book reservations, and chain multi-step workflows inside the browser Medium: cross-tab reasoning and Actions; can perform some transactional steps but with limited scope and reliabilityLow–medium: chat, Skills, and memory over tabs; no general agent that freely manipulates arbitrary sites; autonomy intentionally constrained High: Comet Assistant executes long-running workflows across browsing, email, calendar, and e-commerce, including end-to-end shopping and planning flows Memory / personalizationBrowser memories retain summarized context for ~30 days; persistent task context across sessions, opt-in and user-controllableJourneys over history, context sharing for Copilot is opt-in; personalization tied to Microsoft account and privacy controls Local encrypted storage of history, chats, bookmarks; Dia Memory for personalization with ability to limit shared contextLocal-first browsing data plus cloud-side models; settings allow deleting local data and tuning collectionBest-fit use casesComplex research, automation-heavy workflows, and agent experiments where strong autonomy outweighs riskEveryday browsing with AI summaries and research assistance in Microsoft-centric environmentsLearning, writing, and planning where privacy and structured Skills are more important than full automationPower users who want a personal operator for browsing, communication, and shopping, and who will actively manage security and policy risk

Which browser to choose in 2025?

Pick Atlas when you want to explore the frontier of in-browser agents. It offers the richest action surface and memory model, at the cost of greater complexity in safety and compliance design.

Pick Edge + Copilot Mode when you need incremental AI assistance in a browser that already fits Microsoft-centric enterprise governance, and you prefer scoped agents over unconstrained ones.

Pick Dia when your primary workload is reading, learning, and writing, and you want strong local-first guarantees and explicit control over what information the model sees, with minimal automation.

Pick Comet only if you explicitly want a high-autonomy personal operator in your browser and are willing to track security advisories and platform policies closely.

References:

OpenAI – Introducing ChatGPT Atlashttps://openai.com/index/introducing-chatgpt-atlas/

OpenAI – How we built OWL, the new architecture behind our browserhttps://openai.com/index/building-chatgpt-atlas/

Microsoft – AI browser innovation with Copilot Mode in Edgehttps://www.microsoft.com/en-us/microsoft-copilot/for-individuals/do-more-with-ai/ai-for-daily-life/ai-browser-innovation-with-copilot-in-edge

Microsoft – Copilot Mode | Microsoft Edgehttps://www.microsoft.com/en-us/edge/copilot-mode

Dia Browser – Official sitehttps://www.diabrowser.com/

Dia Browser – Skills Galleryhttps://www.diabrowser.com/skills

9to5Mac – Dia, The Browser Company’s AI-powered browser, is now generally available on macOShttps://9to5mac.com/2025/10/08/dia-the-browser-companys-ai-powered-browser-is-now-generally-available-on-macos/

Perplexity – Comet Browser: a Personal AI Assistanthttps://www.perplexity.ai/comet/

1Password – Secure credentials on Comet with 1Passwordhttps://1password.com/partners/perplexity

Reuters – Amazon sues Perplexity over “agentic” shopping toolhttps://www.reuters.com/business/retail-consumer/perplexity-receives-legal-threat-amazon-over-agentic-ai-shopping-tool-2025-11-04/

The post Comparing the Top 4 Agentic AI Browsers in 2025: Atlas vs Copilot Mode vs Dia vs Comet appeared first on MarkTechPost.

Cerebras Releases MiniMax-M2-REAP-162B-A10B: A Memory Efficient Versio …

Cerebras has released MiniMax-M2-REAP-162B-A10B, a compressed Sparse Mixture-of-Experts (SMoE) Causal Language Model derived from MiniMax-M2, using the new Router weighted Expert Activation Pruning (REAP) method. The model keeps the behavior of the original 230B total, 10B active MiniMax M2, while pruning experts and reducing memory for deployment focused workloads such as coding agents and tool calling.

Architecture and core specifications

MiniMax-M2-REAP-162B-A10B has these key properties:

Base model: MiniMax-M2

Compression method: REAP, Router weighted Expert Activation Pruning

Total parameters: 162B

Active parameters per token: 10B

Layers: 62 transformer blocks

Attention heads per layer: 48

Experts: 180 experts, obtained by pruning a 256 expert configuration

Activated experts per token: 8

Context length: 196,608 tokens

License: modified MIT, derived from MiniMaxAI MiniMax M2

The SMoE design means that the model stores 162B parameters, but each token only routes through a small set of experts, so the effective compute cost per token is similar to a 10B dense model. MiniMax M2 itself is positioned as an MoE model built for coding and agentic workflows, with 230B total parameters and 10B active, which this checkpoint inherits.

How REAP compresses MiniMax-M2?

MiniMax-M2-REAP-162B-A10B is created by applying REAP uniformly across all MoE blocks of MiniMax M2, at a 30 percent expert pruning rate.

The REAP method defines a saliency score for each expert that combines:

Router gate values: How often and how strongly the router selects that expert

Expert activation norms: The magnitude of the expert output when active

Experts that contribute minimally to the layer output, under this combined criterion, are removed. The remaining experts keep their original weights and the router keeps separate gates for each of them. This is one shot compression, there is no extra fine tuning after pruning in the method definition.

A core theoretical result in the REAP’s research paper is that expert merging with summed gates causes functional subspace collapse. When experts are merged, the router loses its independent, input dependent control over those experts, so a single merged expert must approximate an input dependent mixture that was originally expressed through multiple experts. The research team proves that, whenever the router policy depends on the input and the experts are not identical, this introduces irreducible error. In contrast, pruning removes some experts but preserves independent control of the survivors, so the error scales with the gate weight of the removed experts.

Across a set of SMoE models in the 20B to 1T parameter range, REAP consistently outperforms expert merging and other pruning criteria on generative benchmarks such as code generation, mathematical reasoning and tool calling, especially at 50 percent compression.

Accuracy under 30 percent expert pruning

The MiniMax-M2-REAP-162B-A10B model gets compared on three checkpoints on standard coding, reasoning and agentic benchmarks:

MiniMax-M2 (230B, base model)

MiniMax-M2-REAP-172B-A10B, 25 percent pruning

MiniMax-M2-REAP-162B-A10B, 30 percent pruning

https://huggingface.co/cerebras/MiniMax-M2-REAP-162B-A10B

On coding benchmarks such as HumanEval, HumanEval Plus, MBPP and MBPP Plus, the 162B REAP model stays very close to the base model. HumanEval sits around 90% range, and MBPP stays in the 80% range, with the 172B and 162B models essentially tracking the original MiniMax-M2 within a few points.

On reasoning benchmarks such as AIME 25 and MATH 500, there are small shifts between the three models, but there is no collapse at 30 percent pruning and the 162B checkpoint remains competitive with the base model.

On tool calling and agentic evaluation, represented by τ2 bench in a telecom setting, the 162B REAP model again matches the base model within small variance. The model card explicitly states that this checkpoint keeps almost identical performance while being about 30 percent lighter in parameter count.

These results line up with the broader REAP study, which reports near lossless compression for code generation and tool calling on several large SMoE architectures when pruning experts using the REAP criterion.

Deployment, memory usage and observed throughput

Cerebras provides a direct vLLM serve example and positions MiniMax-M2-REAP-162B-A10B as a drop in model for the existing MiniMax M2 integration.

Copy CodeCopiedUse a different Browservllm serve cerebras/MiniMax-M2-REAP-162B-A10B
–tensor-parallel-size 8
–tool-call-parser minimax_m2
–reasoning-parser minimax_m2_append_think
–trust-remote-code
–enable_expert_parallel
–enable-auto-tool-choice

If the run hits memory limits, the card recommends lowering –max-num-seqs, for example to 64, to keep batch size in check on a given GPU.

Key Takeaways

SMoE architecture with efficient compute: MiniMax-M2-REAP-162B-A10B is a Sparse Mixture of Experts model with 162B total parameters and 10B active parameters per token, so the compute cost per token is close to a 10B dense model while keeping frontier scale capacity.

REAP expert pruning keeps behavior of MiniMax-M2: The model is produced by applying REAP Router weighted Expert Activation Pruning to MiniMax-M2 at roughly 30 percent expert pruning, pruning experts based on router gate values and expert activation norms while leaving surviving experts and router structure intact.

Near lossless accuracy at 30 percent compression: On coding benchmarks such as HumanEval and MBPP, and on reasoning benchmarks such as AIME25 and MATH 500, the 162B REAP variant tracks the 230B MiniMax-M2 and a 172B REAP variant within a few points, showing near lossless compression for code, reasoning and tool use.

Pruning outperforms expert merging for generative SMoE: The REAP study shows that pruning experts using a saliency criterion avoids the functional subspace collapse seen with expert merging in generative tasks, and performs better across large SMoE models in the 22B to about 1T parameter range.

Comparison Table

Image source: Marktechpost.com

Editorial Comments

Cerebras’ release of MiniMax-M2-REAP-162B-A10B is a strong signal that Router weighted Expert Activation Pruning is ready for real workloads, not just as a research curiosity. The checkpoint shows that a 30 percent expert pruning schedule can keep MiniMax-M2 230B-A10B behavior almost intact while cutting memory and preserving long context coding, reasoning and tool calling performance, which is exactly what SMoE researchers need for practical deployment. Overall, Cerebras is quietly turning expert pruning into production infrastructure for frontier class SMoE models.

Check out the Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Cerebras Releases MiniMax-M2-REAP-162B-A10B: A Memory Efficient Version of MiniMax-M2 for Long Context Coding Agents appeared first on MarkTechPost.

How to Design a Fully Interactive, Reactive, and Dynamic Terminal-Base …

In this tutorial, we build an advanced interactive dashboard using Textual, and we explore how terminal-first UI frameworks can feel as expressive and dynamic as modern web dashboards. As we write and run each snippet, we actively construct the interface piece by piece, widgets, layouts, reactive state, and event flows, so we can see how Textual behaves like a live UI engine right inside Google Colab. By the end, we notice how naturally we can blend tables, trees, forms, and progress indicators into a cohesive application that feels fast, clean, and responsive. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install textual textual-web nest-asyncio

from textual.app import App, ComposeResult
from textual.containers import Container, Horizontal, Vertical
from textual.widgets import (
Header, Footer, Button, DataTable, Static, Input,
Label, ProgressBar, Tree, Select
)
from textual.reactive import reactive
from textual import on
from datetime import datetime
import random

class StatsCard(Static):
value = reactive(0)

def __init__(self, title: str, *args, **kwargs):
super().__init__(*args, **kwargs)
self.title = title

def compose(self) -> ComposeResult:
yield Label(self.title)
yield Label(str(self.value), id=”stat-value”)

def watch_value(self, new_value: int) -> None:
if self.is_mounted:
try:
self.query_one(“#stat-value”, Label).update(str(new_value))
except Exception:
pass

We set up the environment and import all the necessary components to build our Textual application. As we define the StatsCard widget, we establish a reusable component that reacts to changes in value and updates itself automatically. We begin to see how Textual’s reactive system lets us create dynamic UI elements with minimal effort. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DataDashboard(App):
CSS = “””
Screen { background: $surface; }
#main-container { height: 100%; padding: 1; }
#stats-row { height: auto; margin-bottom: 1; }
StatsCard { border: solid $primary; height: 5; padding: 1; margin-right: 1; width: 1fr; }
#stat-value { text-style: bold; color: $accent; content-align: center middle; }
#control-panel { height: 12; border: solid $secondary; padding: 1; margin-bottom: 1; }
#data-section { height: 1fr; }
#left-panel { width: 30; border: solid $secondary; padding: 1; margin-right: 1; }
DataTable { height: 100%; border: solid $primary; }
Input { margin: 1 0; }
Button { margin: 1 1 1 0; }
ProgressBar { margin: 1 0; }
“””

BINDINGS = [
(“d”, “toggle_dark”, “Toggle Dark Mode”),
(“q”, “quit”, “Quit”),
(“a”, “add_row”, “Add Row”),
(“c”, “clear_table”, “Clear Table”),
]

total_rows = reactive(0)
total_sales = reactive(0)
avg_rating = reactive(0.0)

We define the DataDashboard class and configure global styles, key bindings, and reactive attributes. We decide how the app should look and behave right from the top, giving us full control over themes and interactivity. This structure helps us create a polished dashboard without writing any HTML or JS. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def compose(self) -> ComposeResult:
yield Header(show_clock=True)

with Container(id=”main-container”):
with Horizontal(id=”stats-row”):
yield StatsCard(“Total Rows”, id=”card-rows”)
yield StatsCard(“Total Sales”, id=”card-sales”)
yield StatsCard(“Avg Rating”, id=”card-rating”)

with Vertical(id=”control-panel”):
yield Input(placeholder=”Product Name”, id=”input-name”)
yield Select(
[(“Electronics”, “electronics”),
(“Books”, “books”),
(“Clothing”, “clothing”)],
prompt=”Select Category”,
id=”select-category”
)
with Horizontal():
yield Button(“Add Row”, variant=”primary”, id=”btn-add”)
yield Button(“Clear Table”, variant=”warning”, id=”btn-clear”)
yield Button(“Generate Data”, variant=”success”, id=”btn-generate”)
yield ProgressBar(total=100, id=”progress”)

with Horizontal(id=”data-section”):
with Container(id=”left-panel”):
yield Label(“Navigation”)
tree = Tree(“Dashboard”)
tree.root.expand()
products = tree.root.add(“Products”, expand=True)
products.add_leaf(“Electronics”)
products.add_leaf(“Books”)
products.add_leaf(“Clothing”)
tree.root.add_leaf(“Reports”)
tree.root.add_leaf(“Settings”)
yield tree

yield DataTable(id=”data-table”)

yield Footer()

We compose the entire UI layout, arranging containers, cards, form inputs, buttons, a navigation tree, and a data table. As we structure these components, we watch the interface take shape exactly the way we envision it. This snippet lets us design the visual skeleton of the dashboard in a clean, declarative manner. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def on_mount(self) -> None:
table = self.query_one(DataTable)
table.add_columns(“ID”, “Product”, “Category”, “Price”, “Sales”, “Rating”)
table.cursor_type = “row”
self.generate_sample_data(5)
self.set_interval(0.1, self.update_progress)

def generate_sample_data(self, count: int = 5) -> None:
table = self.query_one(DataTable)
categories = [“Electronics”, “Books”, “Clothing”]
products = {
“Electronics”: [“Laptop”, “Phone”, “Tablet”, “Headphones”],
“Books”: [“Novel”, “Textbook”, “Magazine”, “Comic”],
“Clothing”: [“Shirt”, “Pants”, “Jacket”, “Shoes”]
}

for _ in range(count):
category = random.choice(categories)
product = random.choice(products[category])
row_id = self.total_rows + 1
price = round(random.uniform(10, 500), 2)
sales = random.randint(1, 100)
rating = round(random.uniform(1, 5), 1)

table.add_row(
str(row_id),
product,
category,
f”${price}”,
str(sales),
str(rating)
)

self.total_rows += 1
self.total_sales += sales

self.update_stats()

def update_stats(self) -> None:
self.query_one(“#card-rows”, StatsCard).value = self.total_rows
self.query_one(“#card-sales”, StatsCard).value = self.total_sales

if self.total_rows > 0:
table = self.query_one(DataTable)
total_rating = sum(float(row[5]) for row in table.rows)
self.avg_rating = round(total_rating / self.total_rows, 2)
self.query_one(“#card-rating”, StatsCard).value = self.avg_rating

def update_progress(self) -> None:
progress = self.query_one(ProgressBar)
progress.advance(1)
if progress.progress >= 100:
progress.progress = 0

We implement all the logic for generating data, computing statistics, animating progress, and updating cards. We see how quickly we can bind backend logic to frontend components using Textual’s reactive model. This step makes the dashboard feel alive as numbers update instantly and progress bars animate smoothly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser @on(Button.Pressed, “#btn-add”)
def handle_add_button(self) -> None:
name_input = self.query_one(“#input-name”, Input)
category = self.query_one(“#select-category”, Select).value

if name_input.value and category:
table = self.query_one(DataTable)
row_id = self.total_rows + 1
price = round(random.uniform(10, 500), 2)
sales = random.randint(1, 100)
rating = round(random.uniform(1, 5), 1)

table.add_row(
str(row_id),
name_input.value,
str(category),
f”${price}”,
str(sales),
str(rating)
)

self.total_rows += 1
self.total_sales += sales
self.update_stats()
name_input.value = “”

@on(Button.Pressed, “#btn-clear”)
def handle_clear_button(self) -> None:
table = self.query_one(DataTable)
table.clear()
self.total_rows = 0
self.total_sales = 0
self.avg_rating = 0
self.update_stats()

@on(Button.Pressed, “#btn-generate”)
def handle_generate_button(self) -> None:
self.generate_sample_data(10)

def action_toggle_dark(self) -> None:
self.dark = not self.dark

def action_add_row(self) -> None:
self.handle_add_button()

def action_clear_table(self) -> None:
self.handle_clear_button()

if __name__ == “__main__”:
import nest_asyncio
nest_asyncio.apply()
app = DataDashboard()
app.run()

We connect UI events to backend actions using button handlers, keyboard shortcuts, and app-level functions. As we run the app, we interact with a fully functional dashboard that responds instantly to every click and command. This snippet completes the application and demonstrates how easily Textual enables us to build dynamic, state-driven UIs.

In conclusion, we see the whole dashboard come together in a fully functional, interactive form that runs directly from a notebook environment. We experience firsthand how Textual lets us design terminal UIs with the structure and feel of web apps, while staying entirely in Python. This tutorial leaves us confident that we can extend this foundation, even adding charts, API feeds, and multi-page navigation, as we continue to experiment with Textual’s modern reactive UI capabilities.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Fully Interactive, Reactive, and Dynamic Terminal-Based Data Dashboard Using Textual? appeared first on MarkTechPost.

MBZUAI Researchers Introduce PAN: A General World Model For Interactab …

Most text to video models generate a single clip from a prompt and then stop. They do not keep an internal world state that persists as actions arrive over time. PAN, a new model from MBZUAI’s Institute of Foundation Models, is designed to fill that gap by acting as a general world model that predicts future world states as video, conditioned on history and natural language actions.

https://arxiv.org/pdf/2511.09057

From video generator to interactive world simulator

PAN is defined as a general, interactable, long horizon world model. It maintains an internal latent state that represents the current world, then updates that state when it receives a natural language action such as ‘turn left and speed up’ or ‘move the robot arm to the red block.’ The model then decodes the updated state into a short video segment that shows the consequence of that action. This cycle repeats, so the same world state evolves across many steps.

This design allows PAN to support open domain, action conditioned simulation. It can roll out counterfactual futures for different action sequences. An external agent can query PAN as a simulator, compare predicted futures, and choose actions based on those predictions.

GLP architecture, separating what happens from how it looks

The base of PAN is the Generative Latent Prediction, GLP, architecture. GLP separates world dynamics from visual rendering. First, a vision encoder maps images or video frames into a latent world state. Second, an autoregressive latent dynamics backbone based on a large language model predicts the next latent state, conditioned on history and the current action. Third, a video diffusion decoder reconstructs the corresponding video segment from that latent state.

In PAN, the vision encoder and backbone are built on Qwen2.5-VL-7B-Instruct. The vision tower tokenizes frames into patches and produces structured embeddings. The language backbone runs over a history of world states and actions, plus learned query tokens, and outputs the latent representation of the next world state. These latents live in the shared multimodal space of the VLM, which helps ground the dynamics in both text and vision.

The video diffusion decoder is adapted from Wan2.1-T2V-14B, a diffusion transformer for high fidelity video generation. The research team trains this decoder with a flow matching objective, using one thousand denoising steps and a Rectified Flow formulation. The decoder conditions on both the predicted latent world state and the current natural language action, with a dedicated cross attention stream for the world state and another for the action text.

https://arxiv.org/pdf/2511.09057

Causal Swin DPM and sliding window diffusion

Naively chaining single shot video models by conditioning only on the last frame leads to local discontinuities and rapid quality degradation over long rollouts. PAN addresses this with Causal Swin DPM, which augments the Shift Window Denoising Process Model with chunk wise causal attention.

The decoder operates on a sliding temporal window that holds two chunks of video frames at different noise levels. During denoising, one chunk moves from high noise to clean frames and then leaves the window. A new noisy chunk enters at the other end. Chunk wise causal attention ensures that the later chunk can only attend to the earlier one, not to unseen future actions. This keeps transitions between chunks smooth and reduces error accumulation over long horizons.

PAN also adds controlled noise to the conditioning frame, rather than using a perfectly sharp frame. This suppresses incidental pixel details that do not matter for dynamics and encourages the model to focus on stable structure such as objects and layout.

https://arxiv.org/pdf/2511.09057

Training stack and data construction

PAN is trained in two stages. In the first stage, the research team adapts Wan2.1 T2V 14B into the Causal Swin DPM architecture. They train the decoder in BFloat16 with AdamW, a cosine schedule, gradient clipping, FlashAttention3 and FlexAttention kernels, and a hybrid sharded data parallel scheme across 960 NVIDIA H200 GPUs.

In the second stage, they integrate the frozen Qwen2.5 VL 7B Instruct backbone with the video diffusion decoder under the GLP objective. The vision language model remains frozen. The model learns query embeddings and the decoder so that predicted latents and reconstructed videos stay consistent. This joint training also uses sequence parallelism and Ulysses style attention sharding to handle long context sequences. Early stopping ends training after 1 epoch once validation converges, even though the schedule allows 5 epochs.

Training data comes from widely used publicly accessible video sources that cover everyday activities, human object interactions, natural environments, and multi agent scenarios. Long form videos are segmented into coherent clips using shot boundary detection. A filtering pipeline removes static or overly dynamic clips, low aesthetic quality, heavy text overlays, and screen recordings using rule based metrics, pretrained detectors, and a custom VLM filter. The research team then re-captions clips with dense, temporally grounded descriptions that emphasize motion and causal events.

Benchmarks, action fidelity, long horizon stability, planning

The research team evaluates the model along three axes, action simulation fidelity, long horizon forecast, and simulative reasoning and planning, against both open source and commercial video generators and world models. Baselines include WAN 2.1 and 2.2, Cosmos 1 and 2, V JEPA 2, and commercial systems such as KLING, MiniMax Hailuo, and Gen 3.

For action simulation fidelity, a VLM based judge scores how well the model executes language specified actions while maintaining a stable background. PAN reaches 70.3% accuracy on agent simulation and 47% on environment simulation, for an overall score of 58.6%. It achieves the highest fidelity among open source models and surpasses most commercial baselines.

For long horizon forecast, the research team measures Transition Smoothness and Simulation Consistency. Transition Smoothness uses optical flow acceleration to quantify how smooth motion is across action boundaries. Simulation Consistency uses metrics inspired by WorldScore to monitor degradation over extended sequences. PAN scores 53.6% on Transition Smoothness and 64.1% on Simulation Consistency and exceeds all baselines, including KLING and MiniMax, on these metrics.

For simulative reasoning and planning, PAN is used as an internal simulator inside an OpenAI-o3 based agent loop. In step wise simulation, PAN achieves 56.1% accuracy, the best among open source world models.

https://arxiv.org/pdf/2511.09057

Key Takwaways

PAN implements the Generative Latent Prediction architecture, combining a Qwen2.5-VL-7B based latent dynamics backbone with a Wan2.1-T2V-14B based video diffusion decoder, to unify latent world reasoning and realistic video generation.

The Causal Swin DPM mechanism introduces a sliding window, chunk wise causal denoising process that conditions on partially noised past chunks, which stabilizes long horizon video rollouts and reduces temporal drift compared to naive last frame conditioning.

PAN is trained in two stages, first adapting the Wan2.1 decoder to Causal Swin DPM on 960 NVIDIA H200 GPUs with a flow matching objective, then jointly training the GLP stack with a frozen Qwen2.5-VL backbone and learned query embeddings plus decoder.

The training corpus consists of large scale video action pairs from diverse domains, processed with segmentation, filtering, and dense temporal recaptioning, enabling PAN to learn action conditioned, long range dynamics instead of isolated short clips.

PAN achieves state of the art open source results on action simulation fidelity, long horizon forecasting, and simulative planning, with reported scores such as 70.3% agent simulation, 47% environment simulation, 53.6% transition smoothness, and 64.1% simulation consistency, while remaining competitive with leading commercial systems.

Comparison Table

DimensionPANCosmos video2world WFMWan2.1 T2V 14BV JEPA 2OrganizationMBZUAI Institute of Foundation ModelsNVIDIA ResearchWan AI and Open LaboratoryMeta AIPrimary roleGeneral world model for interactive, long horizon world simulation with natural language actionsWorld foundation model platform for Physical AI with video to world generation for control and navigationHigh quality text to video and image to video generator for general content creation and editingSelf supervised video model for understanding, prediction and planning tasksWorld model framingExplicit GLP world model, latent state, action, and next observation defined, focuses on simulative reasoning and planningDescribed as world foundation model that generates future video worlds from past video and control prompt, aimed at Physical AI, robotics, driving, navigationFramed as video generation model, not primarily as world model, no persistent internal world state described in docsJoint embedding predictive architecture for video, focuses on latent prediction rather than explicit generative supervision in observation spaceCore architectureGLP stack, vision encoder from Qwen2.5 VL 7B, LLM based latent dynamics backbone, video diffusion decoder with Causal Swin DPMFamily of diffusion based and autoregressive world models, with video2world generation, plus diffusion decoder and prompt upsampler based on a language modelSpatio temporal variational autoencoder and diffusion transformer T2V model at 14 billion parameters, supports multiple generative tasks and resolutionsJEPA style encoder plus predictor architecture that matches latent representations of consecutive video observationsBackbone and latent spaceMultimodal latent space from Qwen2.5 VL 7B, used both for encoding observations and for autoregressive latent prediction under actionsToken based video2world model with text prompt conditioning and optional diffusion decoder for refinement, latent space details depend on model variantLatent space from VAE plus diffusion transformer, driven mainly by text or image prompts, no explicit agent action sequence interfaceLatent space built from self supervised video encoder with predictive loss in representation space, not generative reconstruction lossAction or control inputNatural language actions in dialogue format, applied at every simulation step, model predicts next latent state and decodes video conditioned on action and historyControl input as text prompt and optionally camera pose for navigation and downstream tasks such as humanoid control and autonomous drivingText prompts and image inputs for content control, no explicit multi step agent action interface described as world model controlDoes not focus on natural language actions, used more as visual representation and predictor module inside larger agents or plannersLong horizon designCausal Swin DPM sliding window diffusion, chunk wise causal attention, conditioning on slightly noised last frame to reduce drift and maintain stable long horizon rolloutsVideo2world model generates future video given past window and prompt, supports navigation and long sequences but the paper does not describe a Causal Swin DPM style mechanismCan generate several seconds at 480 P and 720 P, focuses on visual quality and motion, long horizon stability is evaluated through Wan Bench but without explicit world state mechanismLong temporal reasoning comes from predictive latent modeling and self supervised training, not from generative video rollouts with explicit diffusion windowsTraining data focusLarge scale video action pairs across diverse physical and embodied domains, with segmentation, filtering and dense temporal recaptioning for action conditioned dynamicsMix of proprietary and public Internet videos focused on Physical AI categories such as driving, manipulation, human activity, navigation and nature dynamics, with a dedicated curation pipelineLarge open domain video and image corpora for general visual generation, with Wan Bench evaluation prompts, not targeted specifically at agent environment rolloutsLarge scale unlabelled video data for self supervised representation learning and prediction, details in V JEPA 2 paper

Editorial Comments

PAN is an important step because it operationalizes Generative Latent Prediction with production scale components such as Qwen2.5-VL-7B and Wan2.1-T2V-14B, then validates this stack on well defined benchmarks for action simulation, long horizon forecasting, and simulative planning. The training and evaluation pipeline is clearly documented by the research team, the metrics are reproducible, and the model is released within a transparent world modeling framework rather than as an opaque video demo. Overall, PAN shows how a vision language backbone plus diffusion video decoder can function as a practical world model instead of a pure generative toy.

Check out the Paper, Technical details and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation appeared first on MarkTechPost.

OpenAI Researchers Train Weight Sparse Transformers to Expose Interpre …

If neural networks are now making decisions everywhere from code editors to safety systems, how can we actually see the specific circuits inside that drive each behavior? OpenAI has introduced a new mechanistic interpretability research study that trains language models to use sparse internal wiring, so that model behavior can be explained using small, explicit circuits.

https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf

Training transformers to be weight sparse

Most transformer language models are dense. Each neuron reads from and writes to many residual channels, and features are often in superposition. This makes circuit level analysis difficult. Previous OpenAI work tried to learn sparse feature bases on top of dense models using sparse autoencoders. The new research work instead changes the base model so that the transformer itself is weight sparse.

The OpenAI team trains decoder only transformers with an architecture similar to GPT 2. After each optimizer step with AdamW optimizer, they enforce a fixed sparsity level on every weight matrix and bias, including token embeddings. Only the largest magnitude entries in each matrix are kept. The rest are set to zero. Over training, an annealing schedule gradually drives the fraction of non zero parameters down until the model reaches a target sparsity.

In the most extreme setting, roughly 1 in 1000 weights is non zero. Activations are also somewhat sparse. Around 1 in 4 activations are non zero at a typical node location. The effective connectivity graph is therefore very thin even when the model width is large. This encourages disentangled features that map cleanly onto the residual channels the circuit uses.

https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf

Measuring interpretability through task specific pruning

To quantify whether these models are easier to understand, OpenAI team does not rely on qualitative examples alone. The research team define a suite of simple algorithmic tasks based on Python next token prediction. One example, single_double_quote, requires the model to close a Python string with the right quote character. Another example, set_or_string, requires the model to choose between .add and += based on whether a variable was initialized as a set or a string.

For each task, they search for the smallest subnetwork, called a circuit, that can still perform the task up to a fixed loss threshold. The pruning is node based. A node is an MLP neuron at a specific layer, an attention head, or a residual stream channel at a specific layer. When a node is pruned, its activation is replaced by its mean over the pretraining distribution. This is mean ablation.

The search uses continuous mask parameters for each node and a Heaviside style gate, optimized with a straight through estimator like surrogate gradient. The complexity of a circuit is measured as the count of active edges between retained nodes. The main interpretability metric is the geometric mean of edge counts across all tasks.

Example circuits in sparse transformers

On the single_double_quote task, the sparse models yield a compact and fully interpretable circuit. In an early MLP layer, one neuron behaves as a quote detector that activates on both single and double quotes. A second neuron behaves as a quote type classifier that distinguishes the two quote types. Later, an attention head uses these signals to attend back to the opening quote position and copy its type to the closing position.

In circuit graph terms, the mechanism uses 5 residual channels, 2 MLP neurons in layer 0, and 1 attention head in a later layer with a single relevant query key channel and a single value channel. If the rest of the model is ablated, this subgraph still solves the task. If these few edges are removed, the model fails on the task. The circuit is therefore both sufficient and necessary in the operational sense defined by the paper.

https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf

For more complex behaviors, such as type tracking of a variable named current inside a function body, the recovered circuits are larger and only partially understood. The research team show an example where one attention operation writes the variable name into the token set() at the definition, and another attention operation later copies the type information from that token back into a later use of current. This still yields a relatively small circuit graph.

Key Takeaways

Weight-sparse transformers by design: OpenAI trains GPT-2 style decoder only transformers so that almost all weights are zero, around 1 in 1000 weights is non zero, enforcing sparsity across all weights and biases including token embeddings, which yields thin connectivity graphs that are structurally easier to analyze.

Interpretability is measured as minimal circuit size: The work defines a benchmark of simple Python next token tasks and, for each task, searches for the smallest subnetwork, in terms of active edges between nodes, that still reaches a fixed loss, using node level pruning with mean ablation and a straight through estimator style mask optimization.

Concrete, fully reverse engineered circuits emerge: On tasks such as predicting matching quote characters, the sparse model yields a compact circuit with a few residual channels, 2 key MLP neurons and 1 attention head that the authors can fully reverse engineer and verify as both sufficient and necessary for the behavior.

Sparsity delivers much smaller circuits at fixed capability: At matched pre-training loss levels, weight sparse models require circuits that are roughly 16 times smaller than those recovered from dense baselines, defining a capability interpretability frontier where increased sparsity improves interpretability while slightly reducing raw capability.

Editorial Comments

OpenAI’s work on weight sparse transformers is a pragmatic step toward making mechanistic interpretability operational. By enforcing sparsity directly in the base model, the paper turns abstract discussions of circuits into concrete graphs with measurable edge counts, clear necessity and sufficiency tests, and reproducible benchmarks on Python next token tasks. The models are small and inefficient, but the methodology is relevant for future safety audits and debugging workflows. This research treats interpretability as a first class design constraint rather than an after the fact diagnostic.

Check out the Paper, GitHub Repo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OpenAI Researchers Train Weight Sparse Transformers to Expose Interpretable Circuits appeared first on MarkTechPost.

How to Design an Advanced Multi-Agent Reasoning System with spaCy Feat …

In this tutorial, we build an advanced Agentic AI system using spaCy, designed to allow multiple intelligent agents to reason, collaborate, reflect, and learn from experience. We work through the entire pipeline step by step, observing how each agent processes tasks using planning, memory, communication, and semantic reasoning. By the end, we see how the system evolves into a dynamic multi-agent architecture capable of extracting entities, interpreting context, forming reasoning chains, and constructing knowledge graphs, all while continuously improving through reflection and episodic learning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install spacy networkx matplotlib -q

import spacy
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass, field
from collections import defaultdict, deque
from enum import Enum
import json
import hashlib
from datetime import datetime

class MessageType(Enum):
REQUEST = “request”
RESPONSE = “response”
BROADCAST = “broadcast”
QUERY = “query”

@dataclass
class Message:
sender: str
receiver: str
msg_type: MessageType
content: Dict[str, Any]
timestamp: float = field(default_factory=lambda: datetime.now().timestamp())
priority: int = 1
def get_id(self) -> str:
return hashlib.md5(f”{self.sender}{self.timestamp}”.encode()).hexdigest()[:8]

@dataclass
class AgentTask:
task_id: str
task_type: str
data: Any
priority: int = 1
dependencies: List[str] = field(default_factory=list)
metadata: Dict = field(default_factory=dict)

@dataclass
class Observation:
state: str
action: str
result: Any
confidence: float
timestamp: float = field(default_factory=lambda: datetime.now().timestamp())

class WorkingMemory:
def __init__(self, capacity: int = 10):
self.capacity = capacity
self.items = deque(maxlen=capacity)
self.attention_scores = {}
def add(self, key: str, value: Any, attention: float = 1.0):
self.items.append((key, value))
self.attention_scores[key] = attention
def recall(self, n: int = 5) -> List[Tuple[str, Any]]:
sorted_items = sorted(self.items, key=lambda x: self.attention_scores.get(x[0], 0), reverse=True)
return sorted_items[:n]
def get(self, key: str) -> Optional[Any]:
for k, v in self.items:
if k == key:
return v
return None

class EpisodicMemory:
def __init__(self):
self.episodes = []
self.success_patterns = defaultdict(int)
def store(self, observation: Observation):
self.episodes.append(observation)
if observation.confidence > 0.7:
pattern = f”{observation.state}→{observation.action}”
self.success_patterns[pattern] += 1
def query_similar(self, state: str, top_k: int = 3) -> List[Observation]:
scored = [(obs, self._similarity(state, obs.state)) for obs in self.episodes[-50:]]
scored.sort(key=lambda x: x[1], reverse=True)
return [obs for obs, _ in scored[:top_k]]
def _similarity(self, state1: str, state2: str) -> float:
words1, words2 = set(state1.split()), set(state2.split())
if not words1 or not words2:
return 0.0
return len(words1 & words2) / len(words1 | words2)

We establish all the core structures required for our agentic system. We import key libraries, define message and task formats, and build both working and episodic memory modules. As we define these foundations, we lay the groundwork for reasoning, storage, and communication. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ReflectionModule:
def __init__(self):
self.performance_log = []
def reflect(self, task_type: str, confidence: float, result: Any) -> Dict[str, Any]:
self.performance_log.append({‘task’: task_type, ‘confidence’: confidence, ‘timestamp’: datetime.now().timestamp()})
recent = [p for p in self.performance_log if p[‘task’] == task_type][-5:]
avg_conf = sum(p[‘confidence’] for p in recent) / len(recent) if recent else 0.5
insights = {
‘performance_trend’: ‘improving’ if confidence > avg_conf else ‘declining’,
‘avg_confidence’: avg_conf,
‘recommendation’: self._get_recommendation(confidence, avg_conf)
}
return insights
def _get_recommendation(self, current: float, average: float) -> str:
if current < 0.4:
return “Request assistance from specialized agent”
elif current < average:
return “Review similar past cases for patterns”
else:
return “Continue with current approach”

class AdvancedAgent:
def __init__(self, name: str, specialty: str, nlp):
self.name = name
self.specialty = specialty
self.nlp = nlp
self.working_memory = WorkingMemory()
self.episodic_memory = EpisodicMemory()
self.reflector = ReflectionModule()
self.message_queue = deque()
self.collaboration_graph = defaultdict(int)
def plan(self, task: AgentTask) -> List[str]:
similar = self.episodic_memory.query_similar(str(task.data))
if similar and similar[0].confidence > 0.7:
return [similar[0].action]
return self._default_plan(task)
def _default_plan(self, task: AgentTask) -> List[str]:
return [‘analyze’, ‘extract’, ‘validate’]
def send_message(self, receiver: str, msg_type: MessageType, content: Dict):
msg = Message(self.name, receiver, msg_type, content)
self.message_queue.append(msg)
return msg
def receive_message(self, message: Message):
self.message_queue.append(message)
self.collaboration_graph[message.sender] += 1
def process(self, task: AgentTask) -> Dict[str, Any]:
raise NotImplementedError

class CognitiveEntityAgent(AdvancedAgent):
def process(self, task: AgentTask) -> Dict[str, Any]:
doc = self.nlp(task.data)
entities = defaultdict(list)
entity_contexts = []
for ent in doc.ents:
context_start = max(0, ent.start – 5)
context_end = min(len(doc), ent.end + 5)
context = doc[context_start:context_end].text
entities[ent.label_].append(ent.text)
entity_contexts.append({‘entity’: ent.text, ‘type’: ent.label_, ‘context’: context, ‘position’: (ent.start_char, ent.end_char)})
for ent_type, ents in entities.items():
attention = len(ents) / len(doc.ents) if doc.ents else 0
self.working_memory.add(f”entities_{ent_type}”, ents, attention)
confidence = min(len(entities) / 4, 1.0) if entities else 0.3
obs = Observation(state=f”entity_extraction_{len(doc)}tokens”, action=”extract_with_context”, result=len(entity_contexts), confidence=confidence)
self.episodic_memory.store(obs)
reflection = self.reflector.reflect(‘entity_extraction’, confidence, entities)
return {‘entities’: dict(entities), ‘contexts’: entity_contexts, ‘confidence’: confidence, ‘reflection’: reflection, ‘next_actions’: [‘semantic_analysis’, ‘knowledge_graph’] if confidence > 0.5 else []}

We construct the reflection engine and the base agent class, which provides every agent with reasoning, planning, and memory capabilities. We then implement the Cognitive Entity Agent, which processes text to extract entities with context and stores meaningful observations. As we run this part, we watch the agent learn from experience while dynamically adjusting its strategy. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SemanticReasoningAgent(AdvancedAgent):
def process(self, task: AgentTask) -> Dict[str, Any]:
doc = self.nlp(task.data)
reasoning_chains = []
for sent in doc.sents:
chain = self._extract_reasoning_chain(sent)
if chain:
reasoning_chains.append(chain)
entity_memory = self.working_memory.recall(3)
semantic_clusters = self._cluster_by_semantics(doc)
confidence = min(len(reasoning_chains) / 3, 1.0) if reasoning_chains else 0.4
obs = Observation(state=f”semantic_analysis_{len(list(doc.sents))}sents”, action=”reason_and_cluster”, result=len(reasoning_chains), confidence=confidence)
self.episodic_memory.store(obs)
return {‘reasoning_chains’: reasoning_chains, ‘semantic_clusters’: semantic_clusters, ‘memory_context’: entity_memory, ‘confidence’: confidence, ‘next_actions’: [‘knowledge_integration’]}
def _extract_reasoning_chain(self, sent) -> Optional[Dict]:
subj, verb, obj = None, None, None
for token in sent:
if token.dep_ == ‘nsubj’:
subj = token
elif token.pos_ == ‘VERB’:
verb = token
elif token.dep_ in [‘dobj’, ‘attr’, ‘pobj’]:
obj = token
if subj and verb and obj:
return {‘subject’: subj.text, ‘predicate’: verb.lemma_, ‘object’: obj.text, ‘confidence’: 0.8}
return None
def _cluster_by_semantics(self, doc) -> List[Dict]:
clusters = []
nouns = [token for token in doc if token.pos_ in [‘NOUN’, ‘PROPN’]]
visited = set()
for noun in nouns:
if noun.i in visited:
continue
cluster = [noun.text]
visited.add(noun.i)
for other in nouns:
if other.i != noun.i and other.i not in visited:
if noun.similarity(other) > 0.5:
cluster.append(other.text)
visited.add(other.i)
if len(cluster) > 1:
clusters.append({‘concepts’: cluster, ‘size’: len(cluster)})
return clusters

We design the Semantic Reasoning Agent, which analyzes sentence structures, forms reasoning chains, and groups concepts based on semantic similarity. We integrate working memory to enrich the understanding the agent builds. As we execute this, we see how the system moves from surface-level extraction to deeper inference. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass KnowledgeGraphAgent(AdvancedAgent):
def process(self, task: AgentTask) -> Dict[str, Any]:
doc = self.nlp(task.data)
graph = {‘nodes’: set(), ‘edges’: []}
for sent in doc.sents:
entities = list(sent.ents)
if len(entities) >= 2:
for ent in entities:
graph[‘nodes’].add((ent.text, ent.label_))
root = sent.root
if root.pos_ == ‘VERB’:
for i in range(len(entities) – 1):
graph[‘edges’].append({‘from’: entities[i].text, ‘relation’: root.lemma_, ‘to’: entities[i+1].text, ‘sentence’: sent.text[:100]})
graph[‘nodes’] = list(graph[‘nodes’])
confidence = min(len(graph[‘edges’]) / 5, 1.0) if graph[‘edges’] else 0.3
obs = Observation(state=f”knowledge_graph_{len(graph[‘nodes’])}nodes”, action=”construct_graph”, result=len(graph[‘edges’]), confidence=confidence)
self.episodic_memory.store(obs)
return {‘graph’: graph, ‘node_count’: len(graph[‘nodes’]), ‘edge_count’: len(graph[‘edges’]), ‘confidence’: confidence, ‘next_actions’: []}

class MetaController:
def __init__(self):
self.nlp = spacy.load(‘en_core_web_sm’)
self.agents = {
‘cognitive_entity’: CognitiveEntityAgent(‘CognitiveEntity’, ‘entity_analysis’, self.nlp),
‘semantic_reasoning’: SemanticReasoningAgent(‘SemanticReasoner’, ‘reasoning’, self.nlp),
‘knowledge_graph’: KnowledgeGraphAgent(‘KnowledgeBuilder’, ‘graph_construction’, self.nlp)
}
self.task_history = []
self.global_memory = WorkingMemory(capacity=20)
def execute_with_planning(self, text: str) -> Dict[str, Any]:
initial_task = AgentTask(task_id=”task_001″, task_type=”cognitive_entity”, data=text, metadata={‘source’: ‘user_input’})
results = {}
task_queue = [initial_task]
iterations = 0
max_iterations = 10
while task_queue and iterations < max_iterations:
task = task_queue.pop(0)
agent = self.agents.get(task.task_type)
if not agent or task.task_type in results:
continue
result = agent.process(task)
results[task.task_type] = result
self.global_memory.add(task.task_type, result, result[‘confidence’])
for next_action in result.get(‘next_actions’, []):
if next_action in self.agents and next_action not in results:
next_task = AgentTask(task_id=f”task_{iterations+1:03d}”, task_type=next_action, data=text, dependencies=[task.task_id])
task_queue.append(next_task)
iterations += 1
self.task_history.append({‘results’: results, ‘iterations’: iterations, ‘timestamp’: datetime.now().isoformat()})
return results
def generate_insights(self, results: Dict[str, Any]) -> str:
report = “=” * 70 + “n”
report += ” ADVANCED AGENTIC AI SYSTEM – ANALYSIS REPORTn”
report += “=” * 70 + “nn”
for agent_type, result in results.items():
agent = self.agents[agent_type]
report += f” {agent.name}n”
report += f” Specialty: {agent.specialty}n”
report += f” Confidence: {result[‘confidence’]:.2%}n”
if ‘reflection’ in result:
report += f” Performance: {result[‘reflection’].get(‘performance_trend’, ‘N/A’)}n”
report += ” Key Findings:n”
report += json.dumps({k: v for k, v in result.items() if k not in [‘reflection’, ‘next_actions’]}, indent=6) + “nn”
report += ” System-Level Insights:n”
report += f” Total iterations: {len(self.task_history)}n”
report += f” Active agents: {len(results)}n”
report += f” Global memory size: {len(self.global_memory.items)}n”
return report

We implement the Knowledge Graph Agent, enabling the system to connect entities through relations extracted from text. We then build the Meta-Controller, which coordinates all agents, manages planning, and handles multi-step execution. As we use this component, we watch the system behave like a true multi-agent pipeline with dynamic flow control. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
sample_text = “””
Artificial intelligence researchers at OpenAI and DeepMind are developing
advanced language models. Sam Altman leads OpenAI in San Francisco, while
Demis Hassabis heads DeepMind in London. These organizations collaborate
with universities like MIT and Stanford. Their research focuses on machine
learning, neural networks, and reinforcement learning. The breakthrough
came when transformers revolutionized natural language processing in 2017.
“””
controller = MetaController()
results = controller.execute_with_planning(sample_text)
print(controller.generate_insights(results))
print(“Advanced multi-agent analysis complete with reflection and learning!”)

We run the entire agentic system end-to-end on a sample text. We execute planning, call each agent in sequence, and generate a comprehensive analysis report. As we reach this stage, we see the full power of the multi-agent architecture working together in real time.

In conclusion, we developed a comprehensive multi-agent reasoning framework that operates on real-world text using spaCy, integrating planning, learning, and memory into a cohesive workflow. We observe how each agent contributes a unique layer of understanding, and we see the Meta-Controller orchestrate them to generate rich, interpretable insights. Lastly, we recognize the flexibility and extensibility of this agentic design, and we feel confident that we can now adapt it to more complex tasks, larger datasets, or even integrate language models to further enhance the system’s intelligence.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design an Advanced Multi-Agent Reasoning System with spaCy Featuring Planning, Reflection, Memory, and Knowledge Graphs appeared first on MarkTechPost.

Comparing the Top 6 Agent-Native Rails for the Agentic Internet: MCP, …

As AI agents move from single-app copilots to autonomous systems that browse, transact, and coordinate with each other, a new infrastructure layer is emerging underneath them. This article compares six key “agent-native rails” — MCP, A2A, AP2, ACP, x402, and Kite — focusing on how they standardize tool access, inter-agent communication, payment authorization, and settlement, and what that means for engineers designing secure, commerce-capable agentic systems.

The agent stack is around six trending agentic ‘rails’:

MCP – standard interface for tools and data.

A2A – transport and lifecycle for agent-to-agent calls.

AP2 – trust and mandates for agent-initiated payments.

ACP – interaction model for agentic checkout and commerce flows.

x402 – HTTP-native, on-chain payment protocol for APIs and agents.

Kite – L1 + state channels for high-frequency agent payments and policy-enforced autonomy.

They are complementary, not competing: MCP and A2A wire agents to context and each other, AP2/ACP encode commercial intent, and x402/Kite handle settlement.

The 6 rails at a glance

RailLayerPrimary roleTransport / substrateMCP (Model Context Protocol)Tools & dataStandard interface to tools, data sources, promptsJSON-RPC over stdio / process, HTTP / SSEA2A (Agent2Agent)Agent meshDiscovery and task lifecycle between agentsJSON-RPC 2.0 over HTTPS, optional SSE streamsAP2 (Agent Payments Protocol)Payment control planeVerifiable mandates and roles for agent paymentsProtocol-agnostic over existing rails, including blockchains like SuiACP (Agentic Commerce Protocol)Commerce flowsShared language for catalog, offers, checkout stateProtocol spec + HTTP APIs, open standard co-developed by OpenAI and Stripex402Settlement railInternet-native, per-request payments for APIs and agentsHTTP 402 with on-chain stablecoins such as USDCKiteL1 + state channelsAgent-centric chain with identity and streaming micropaymentsL1 chain + off-chain state-channel rails for agents

The rest of the article unpacks each rail along four axes:

Capabilities

Security posture

Ecosystem traction

OS / runtime integration trajectory

1. MCP: tool and context rail

Capabilities

The Model Context Protocol is an open protocol for connecting LLM applications to external tools and data. It defines a client–server architecture:

MCP clients (agents, IDEs, chat UIs) connect to

MCP servers that expose tools, resources, and prompts via a standardized JSON-RPC schema.

Tools are strongly typed (name + JSON schema for parameters and results) and can wrap arbitrary systems: HTTP APIs, databases, file operations, internal services, etc.

The same protocol works across transports (stdio for local processes, HTTP/SSE for remote servers), which is why multiple runtimes can consume the same MCP servers.

Security posture

MCP is deliberately agnostic about identity and payments. Security is inherited from the host:

Servers can run locally or remotely and may have full access to files, networks, and cloud APIs.

The main risks are classic: arbitrary code execution in tools, prompt injection, over-privileged credentials, and exfiltration of sensitive data.

Security guidance from Red Hat and others focuses on:

Least-privilege credentials per MCP server.

Sandboxing tools where possible.

Strong review and signing of server configurations.

Logging and audit for tool calls.

MCP itself does not give you access control semantics like ‘this agent can call this tool only under policy P’; those are layered on by hosts and IAM systems.

Ecosystem traction

MCP moved from Anthropic-only to ecosystem standard quickly:

Anthropic launched MCP and open-sourced the spec and TypeScript schemas.

OpenAI added full MCP client support in ChatGPT Developer Mode and the platform ‘Connectors’ system.

Microsoft integrated MCP into VS Code, Visual Studio, GitHub Copilot, and Copilot for Azure, including an “Azure MCP server.”

LangChain and LangGraph ship langchain-mcp-adapters for treating MCP tools as first-class LangChain tools.

Cloudflare runs a catalog of managed remote MCP servers and exposes them via its Agents SDK.

MCP is now effectively the ‘USB-C port’ for agent tools across IDEs, browsers, cloud agents, and edge runtimes

2. A2A: agent-to-agent protocol

Capabilities

The Agent2Agent (A2A) protocol is an open standard for inter-agent communication and task handoff. The spec defines:

A2A client – initiates tasks on behalf of a user or system.

A2A server (remote agent) – exposes a JSON-RPC endpoint that executes tasks.

Agent cards – JSON metadata at well-known paths (for example, /.well-known/agent-card.json) describing capabilities, endpoint, and auth.

Transport is standardized:

JSON-RPC 2.0 over HTTPS for requests and responses.

Optional SSE streams for long-running or streaming tasks.

This gives agents a common ‘RPC fabric’ independent of vendor or framework.

Security posture

At the protocol layer, A2A leans on common web primitives:

HTTPS with standard auth (API keys, OAuth-like tokens, mTLS) negotiated based on agent cards.

JSON-RPC 2.0 message format; parser correctness is a concern, since bugs in JSON-RPC handling become a security vector.

Red Hat and other analyses highlight:

Keep JSON-RPC libraries patched.

Protect against replay and downgrade attacks at the HTTP / TLS layer.

Treat agent-to-agent traffic like service-mesh traffic: identity, authz, and rate-limiting matter.

The protocol does not itself decide which agents should talk; that is a policy question for the platform.

Ecosystem traction

Google introduced A2A and is driving it as an interoperability layer for agents across enterprise platforms.

The A2A open-source org maintains the reference spec and implementation.

Amazon Bedrock AgentCore Runtime now supports A2A as a first-class protocol, with documented contract requirements.

Third-party frameworks (for example, CopilotKit) are adopting A2A for cross-agent and app-agent communication.

3. AP2: payment control layer

Capabilities

Agent Payments Protocol (AP2) is Google’s open standard for agent-initiated payments. Its core problem statement: when an AI agent pays, how do we know it had permission, the payment matches user intent, and someone is clearly accountable?

AP2 introduces:

Mandates – cryptographically signed digital contracts that encode who can pay, under which limits, for what kinds of transactions.

Role separation – payer agents, merchants, issuers, networks, and wallets each have explicit protocol roles.

Rail-agnostic design – AP2 can authorize payments over cards, bank transfers, or programmable blockchains such as Sui.

The protocol is designed to compose with A2A and MCP: A2A handles the messaging, MCP connects to tools, AP2 governs the payment semantics.

Security posture

Security is the main reason AP2 exists:

Mandates are signed using modern public-key cryptography and can be independently verified.

The protocol explicitly targets authorization, authenticity, and accountability: did the agent have permission, does the action match user intent, and who is liable if something goes wrong.

Ecosystem traction

AP2 is still early but already has meaningful backing:

Google announced AP2 with more than 60 organizations across ecommerce, payments, banking, and crypto as collaborators or early supporters.

Cohorts include networks like Mastercard and American Express, wallets and PSPs such as PayPal, and crypto players including Coinbase.

4. ACP: commerce interaction model

Capabilities

The Agentic Commerce Protocol (ACP), co-developed by OpenAI and Stripe, is the interaction model underlying ChatGPT Instant Checkout. It gives agents and merchants a shared language for:

Product discovery (catalog and offers).

Configuration (variants, shipping options).

Checkout state (selected item, price, shipping, terms).

Fulfillment and post-purchase status.

ACP is designed to:

Work across processors and business types without forcing backend rewrites.

Keep merchants as the merchant of record for fulfillment, returns, and support, even when the interaction starts in an agent.

Security posture

In ACP deployments:

Payments are handled by processors such as Stripe; ACP itself focuses on the structure of the commerce interaction, not on cryptography.

OpenAI’s Instant Checkout uses limited-scope payment credentials and explicit confirmation steps in the ChatGPT UI, which makes agent-initiated purchases visible to the user.

ACP does not replace anti-fraud, KYC, or PCI responsibilities; those remain with the PSPs and merchants.

Ecosystem traction

OpenAI and Stripe have open-sourced ACP and are actively recruiting merchants and platforms.

Instant Checkout is live for Etsy sellers, with Shopify merchants and additional regions coming next, and multiple press reports highlight ACP as the underlying protocol.

Salesforce has announced ACP-based integrations for its Agentforce Commerce stack.

ACP is essentially becoming the agent-side ‘checkout API‘ for multiple commerce ecosystems.

5. x402: HTTP-native settlement

Capabilities

x402 is Coinbase’s open payment protocol for AI agents and APIs. It revives HTTP status code 402 Payment Required as the trigger for machine-initiated, per-request payments.

Key properties:

Instant, automatic stablecoin payments over HTTP, primarily using USDC on chains like Base.

Clients (agents, apps) can pay for API calls, content, or services without accounts or sessions, by programmatically responding to 402 challenges.

Designed for both human and machine consumers, but the machine-to-machine case is explicitly emphasized.

Security posture

Settlement is on-chain, so the usual blockchain guarantees (and risks) apply: immutability, transparent balances, but exposure to contract bugs and key theft.

Coinbase runs the compliant infrastructure (KYT, sanctions screening, etc.) behind its managed offering.

There are no chargebacks; dispute handling must be layered at ACP/AP2 or application level.

Ecosystem traction

Coinbase and Cloudflare announced the x402 Foundation to push x402 as an open standard for internet payments, targeting both agents and human-facing APIs.

Cloudflare integrated x402 into its Agents SDK and MCP integration, so Workers and agents can offer paywalled endpoints and call x402 servers with a single wrapper.

6. Kite: agent-native L1 and state channels

Capabilities

Kite is an AI-oriented L1 chain and payment rail designed for agentic commerce. It states:

State-channel based micropayments– agents open off-chain channels and stream tiny payments with instant finality, settling periodically on-chain.

Agent-centric identity and constraints– cryptographic identity is used to bind agents and users, with protocol-level spend constraints and policy enforcement.

PoAI-oriented design– the chain is explicitly tuned for the AI-agent economy, not generic DeFi.

Security posture

Kite inherits L1 security concerns (consensus safety, smart-contract correctness) plus state-channel specifics:

Off-chain channels must be protected against fraud (for example, outdated state publication) and key compromise.

Policy constraints are enforced at protocol level; if implemented correctly, this can significantly reduce the chance of runaway spending by agents.

Because the design is agent-specific, there is less ‘legacy baggage’ than in generalized DeFi chains, but also less battle-tested code.

Ecosystem traction

PayPal Ventures and others have publicly backed Kite as part of the agentic commerce stack.

Crypto and infra publications describe it as a complementary rail to x402, optimized for streaming, high-frequency interactions between agents.

The ecosystem is still young compared to mainstream L1s, but it is clearly positioned as an ‘AI-payments L1,’ not a general-purpose chain.

How the rails compose in real systems

A realistic agentic workflow will touch several of these rails:

Tooling and data

An IDE agent, OS agent, or backend agent connects to internal APIs, file systems, and monitoring systems via MCP servers.

Multi-agent orchestration

The primary agent delegates specialized tasks (for example, cost optimization, legal review, marketing ops) to other agents via A2A.

Commerce flow

For purchasing, the agent enters an ACP flow with a merchant: fetch catalog, configure a product, receive a priced offer, confirm checkout state.

Payment authorization

The user has previously granted an AP2 mandate to a wallet-backed payment agent, specifying limits and scope. The commerce or orchestration agent requests payment via that AP2-capable payment agent.

Settlement

Depending on the scenario, the payment agent may:

Use traditional rails (card, bank) under AP2, or

Use x402 for per-call on-chain payments to an API, or

Use Kite state channels for streaming micro-transactions between agents.

This composition preserves separation of concerns:

MCP & A2A: who talks to whom, and about what.

AP2 & ACP: how intent, consent, and liability for commerce are encoded.

x402 & Kite: how value is actually moved at low latency.

References:

Model Context Protocol – official sitehttps://modelcontextprotocol.io/

Anthropic: “Introducing the Model Context Protocol”https://www.anthropic.com/news/model-context-protocol

Claude Docs: “Model Context Protocol (MCP)”https://docs.claude.com/en/docs/mcp

OpenAI Docs: “Connectors and MCP servers”https://platform.openai.com/docs/guides/tools-connectors-mcp

OpenAI Docs: “MCP Server Documentation”https://platform.openai.com/docs/mcp

LangChain MCP Adapters – GitHubhttps://github.com/langchain-ai/langchain-mcp-adapters

LangChain Docs: “Model Context Protocol (MCP)”https://docs.langchain.com/oss/python/langchain/mcp

npm package: @langchain/mcp-adaptershttps://www.npmjs.com/package/%40langchain/mcp-adapters

Azure AI Foundry: “Create an MCP Server with Azure AI Agent Service”https://devblogs.microsoft.com/foundry/integrating-azure-ai-agents-mcp/

Azure AI Foundry Docs: “Connect to Model Context Protocol servers (preview)”https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/tools/model-context-protocol

Azure AI Foundry MCP Server – May 2025 updatehttps://devblogs.microsoft.com/foundry/azure-ai-foundry-mcp-server-may-2025/

Windows AI Foundry (MCP integration in Windows)https://developer.microsoft.com/en-us/windows/ai/

The Verge: “Windows is getting support for the ‘USB-C of AI apps’”https://www.theverge.com/news/669298/microsoft-windows-ai-foundry-mcp-support

Agent2Agent (A2A) Protocol – official specificationhttps://a2a-protocol.org/latest/specification/

Google Developers Blog: “Announcing the Agent2Agent Protocol (A2A)”https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/

IBM Think: “What is A2A protocol (Agent2Agent)?”https://www.ibm.com/think/topics/agent2agent-protocol

Amazon Bedrock: “Deploy A2A servers in AgentCore Runtime”https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/runtime-a2a.html

Amazon Bedrock: “A2A protocol contract”https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/runtime-a2a-protocol-contract.html

AWS News: “Amazon Bedrock AgentCore is now generally available”https://aws.amazon.com/about-aws/whats-new/2025/10/amazon-bedrock-agentcore-available/

Google Cloud Blog: “Announcing Agent Payments Protocol (AP2)”https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol

AP2 overview / technical details (Google / partner materials)https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol

Coinbase x402 + AP2 launch with Googlehttps://www.coinbase.com/developer-platform/discover/launches/google_x402

Omni (Swedish) coverage: “Google teamar upp med betaljättar – vill låta AI-agenter shoppa åt dig”https://omni.se/a/RzkWqO

OpenAI: “Buy it in ChatGPT: Instant Checkout and the Agentic Commerce Protocol”https://openai.com/index/buy-it-in-chatgpt/

OpenAI Developer Docs: “Agentic Commerce Protocol – Get started”https://developers.openai.com/commerce/guides/get-started/

Stripe Newsroom: “Stripe powers Instant Checkout in ChatGPT and releases the Agentic Commerce Protocol”https://stripe.com/newsroom/news/stripe-openai-instant-checkout

TechRadar Pro: “You can now buy things through ChatGPT with a single click”https://www.techradar.com/pro/you-can-now-buy-things-through-chatgpt-with-a-single-click-if-youre-one-of-the-lucky-ones

Reuters: “OpenAI partners with Etsy, Shopify on ChatGPT payment checkout”https://www.reuters.com/world/americas/openai-partners-with-etsy-shopify-chatgpt-checkout-2025-09-29/

Salesforce Press Release: “Salesforce Announces Support for Agentic Commerce Protocol with Stripe and OpenAI”https://www.salesforce.com/news/press-releases/2025/10/14/stripe-openai-agentic-commerce-protocol-announcement/

Salesforce Investor News: “Salesforce and OpenAI Partner Across Enterprise Work and Commerce”https://investor.salesforce.com/news/news-details/2025/Salesforce-and-OpenAI-Partner-Across-Enterprise-Work-and-Commerce/default.aspx

Salesforce: Agentforce Commercehttps://www.salesforce.com/commerce/

Coinbase Developer Platform: “x402: The internet-native payment protocol”https://www.coinbase.com/developer-platform/products/x402

Base Docs: “Building Autonomous Payment Agents with x402”https://docs.base.org/base-app/agents/x402-agents

Cloudflare Agents Docs: “x402 · Cloudflare Agents docs”https://developers.cloudflare.com/agents/x402/

Cloudflare Blog: “Launching the x402 Foundation with Coinbase, and support for x402 transactions”https://blog.cloudflare.com/x402/

Cloudflare x402 tag pagehttps://blog.cloudflare.com/tag/x402/

Zuplo Blog: “Autonomous API & MCP Server Payments with x402”https://zuplo.com/blog/mcp-api-payments-with-x402

Kite whitepaper: “Building Trustless Payment Infrastructure for Agentic AI”https://gokite.ai/kite-whitepaper

Kite: “Whitepaper”https://gokite.ai/whitepaper

Kite Docs: “Introduction & Mission”https://docs.gokite.ai/get-started-why-kite/introduction-and-mission

PayPal Newsroom: “Kite Raises $18M in Series A Funding To Enforce Trust in the Agentic Web”https://newsroom.paypal-corp.com/2025-09-02-Kite-Raises-18M-in-Series-A-Funding-To-Enforce-Trust-in-the-Agentic-Web

PayPal Ventures: “The state of agentic commerce and why we invested in Kite AI”https://paypal.vc/news/news-details/2025/The-state-of-agentic-commerce-and-why-we-invested-in-Kite-AI-2025-LroAXfplpA/default.aspx

Binance Research: “Kite enables an agentic internet…”https://www.binance.com/en-KZ/research/projects/kite

Phemex Academy: “What Is Kite (KITE)? Guide to the AI Agent Economy”https://phemex.com/academy/what-is-kite-ai-agent-economy

Finextra: “PayPal leads funding round in agentic AI firm Kite”https://www.finextra.com/newsarticle/46535/paypal-leads-funding-round-in-agentic-ai-firm-kite

Plug and Play Tech Center: “How Kite is Building the Infrastructure for the Agentic Internet”https://www.plugandplaytechcenter.com/venture-capital/investment-announcements/kite-investment

PYMNTS: “PayPal Ventures-Backed Kite Nets $18M for Agentic AI”https://www.pymnts.com/news/investment-tracker/2025/paypal-backed-kite-raises-18-million-for-agentic-web/

GlobeNewswire: “Kite announces investment from Coinbase Ventures…”https://www.globenewswire.com/news-release/2025/10/27/3174837/0/en/Kite-announces-investment-from-Coinbase-Ventures-to-Advance-Agentic-Payments-with-the-x402-Protocol.html

Keycard – official sitehttps://www.keycard.ai/

Keycard: product page (alternate URL)https://www.keycard.sh/

Help Net Security: “Keycard emerges from stealth with identity and access platform for AI agents”https://www.helpnetsecurity.com/2025/10/22/keycard-ai-agents-identity-access-platform/

GlobeNewswire: “Keycard Launches to Solve the AI Agent Identity and Access Problem…”https://www.globenewswire.com/news-release/2025/10/21/3170297/0/en/Keycard-Launches-to-Solve-the-AI-Agent-Identity-and-Access-Problem-With-38-Million-in-Funding-From-Andreessen-Horowitz-Boldstart-Ventures-and-Acrew-Capital.html

The post Comparing the Top 6 Agent-Native Rails for the Agentic Internet: MCP, A2A, AP2, ACP, x402, and Kite appeared first on MarkTechPost.

Build a biomedical research agent with Biomni tools and Amazon Bedrock …

This post is co-authored with the Biomni group from Stanford.
Biomedical researchers spend approximately 90% of their time manually processing massive volumes of scattered information. This is evidenced by Genentech’s challenge of processing 38 million biomedical publications in PubMed, public repositories like the Human Protein Atlas, and their internal repository of hundreds of millions of cells across hundreds of diseases. There is a rapid proliferation of specialized databases and analytical tools across different modalities including genomics, proteomics, and pathology. Researchers must stay current with the large landscape of tools, leaving less time for the hypothesis-driven work that drives breakthrough discoveries.
AI agents powered by foundation models offer a promising solution by autonomously planning, executing, and adapting complex research tasks. Stanford researchers built Biomni that exemplifies this potential. Biomni is a general-purpose biomedical AI agent that integrates 150 specialized tools, 105 software packages, and 59 databases to execute sophisticated analyses such as gene prioritization, drug repurposing, and rare disease diagnosis.
However, deploying such agents in production requires robust infrastructure capable of handling computationally intensive workflows and multiple concurrent users while maintaining security and performance standards. Amazon Bedrock AgentCore is a set of comprehensive services to deploy and operate highly capable agents using any framework or model, with enterprise-grade security and scalability.
In this post, we show you how to implement a research agent using AgentCore with access to over 30 specialized biomedical database tools from Biomni, thereby accelerating scientific discovery while maintaining enterprise-grade security and production scale. The code for this solution is available in the open-source toolkit repository of starter agents for life sciences on Amazon Web Services (AWS). The step by step instruction helps you deploy your own tools and infrastructure, along with AgentCore components, and examples.
Prototype-to-production complexity gap
Moving from a local biomedical research prototype to a production system accessible by multiple research teams requires addressing complex infrastructure challenges.
Agent deployment with enterprise security
Enterprise security challenges include OAuth-based authentication, secure tool sharing through scalable gateways, comprehensive observability for research audit trails, and automatic scaling to handle concurrent research workloads. Many promising prototypes fail to reach production because of the complexity of implementing these enterprise-grade requirements while maintaining the specialized domain expertise needed for accurate biomedical analysis.
Session-aware research context management
Biomedical research workflows often span multiple conversations and require persistent memory of previous analyses, experimental parameters, and research preferences across extended research sessions. Research agents must maintain contextual awareness of ongoing projects, remember specific protein targets, experimental conditions, and analytical preferences. All that must be done while facilitating proper session isolation between different researchers and research projects in a multi-tenant production environment.
Scalable tool gateway
Implementing a reusable tool gateway that can handle concurrent requests from research agent, proper authentication, and consistent performance becomes critical at scale. The gateway must enable agents to discover and use tools through secure endpoints, help agents find the right tools through contextual search capabilities, and manage both inbound authentication (verifying agent identity) and outbound authentication (connecting to external biomedical databases) in a unified service. Without this architecture, research teams face authentication complexity and reliability issues that prevent effective scaling.
Solution overview
We use Strands Agents, an open source agent framework, to build a research agent with local tool implementation for PubMed biomedical literature search. We extended the agent’s capabilities by integrating Biomni database tools, providing access to over 30 specialized biomedical databases.
The overall architecture is shown in the following diagram.

The AgentCore Gateway service centralizes Biomni database tools as more secure, reusable endpoints with semantic search capabilities. AgentCore Memory service maintains contextual awareness across research sessions using specialized strategies for research context. Security is handled by AgentCore Identity service, which manages authentication for both users and tool access control. Deployment is streamlined with the AgentCore Runtime service, providing scalable, managed deployment with session isolation. Finally, the AgentCore Observability service enables comprehensive monitoring and auditing of research workflows that are critical for scientific reproducibility.
Step 1 – Creating tools such as the Biomni database tools using AgentCore Gateway
In real-world use cases, we need to connect agents to different data sources. Each agent might duplicate the same tools, leading to extensive code, inconsistent behavior, and maintenance nightmares. AgentCore Gateway service streamlines this process by centralizing tools into reusable, secure endpoints that agents can access. Combined with the AgentCore Identity service for authentication, AgentCore Gateway creates an enterprise-grade tool sharing infrastructure. To give more context to the agent with reusable tools, we provided access to over 30 specialized public database APIs through the Biomni tools registered on the gateway. The gateway exposes Biomni’s database tools through the Model Context Protocol (MCP), allowing the research agent to discover and invoke these tools alongside local tools like PubMed. It handles authentication, rate limiting, and error handling, providing a seamless research experience.

def create_gateway(gateway_name: str, api_spec: list) -> dict:
# JWT authentication with Cognito
auth_config = {
“customJWTAuthorizer”: {
“allowedClients”: [
get_ssm_parameter(“/app/researchapp/agentcore/machine_client_id”)
],
“discoveryUrl”:
get_ssm_parameter(“/app/researchapp/agentcore/cognito_discovery_url”),
}
}

# Enable semantic search for BioImm tools
search_config = {“hcp”: {“searchType”: “SEMANTIC”}}

# Create the gateway
gateway = bedrock_agent_client.create_gateway(
name=gateway_name,
collectionexecution_role_arn,
protocolType=”MCP”,
authorizerType=”CUSTOM_JWT”,
authorizerConfiguration=auth_config,
protocolConfiguration=search_config,
description=”My App Template AgentCore Gateway”,
)

We use an
AWS Lambda function to host the Biomni integration code. The Lambda function is automatically configured as an MCP target in the AgentCore Gateway. The Lambda function exposes its available tools through the API specification (
api_spec.json).
# Gateway Target Configuration
lambda_target_config = {
“mcp”: {
“lambda”: {
“lambdaArn”: get_ssm_parameter(“/app/researchapp/agentcore/lambda_arn”),
“toolSchema”: {“inlinePayload”: api_spec},
}
}
}

# Create the target
create_target_response = gateway_client.create_gateway_target(
gatewayIdentifier=gateway_id,
name=”LambdaUsingSDK”,
description=”Lambda Target using SDK”,
targetConfiguration=lambda_target_config,
credentialProviderConfigurations=[{
“credentialProviderType”: “GATEWAY_IAM_ROLE”
}],
)
The full list of Biomni database tools included on the gateway are listed in the following table:

Group
Tool
Description

Protein and structure databases
UniProt
Query the UniProt REST API for comprehensive protein sequence and functional information

AlphaFold
Query the AlphaFold Database API for AI-predicted protein structure predictions

InterPro
Query the InterPro REST API for protein domains, families, and functional sites

PDB (Protein Data Bank)
Query the RCSB PDB database for experimentally determined protein structures

STRING
Query the STRING protein interaction database for protein-protein interaction networks

EMDB (Electron Microscopy Data Bank)
Query for 3D macromolecular structures determined by electron microscopy

Genomics and variants
ClinVar
Query NCBI’s ClinVar database for clinically relevant genetic variants and their interpretations

dbSNP
Query the NCBI dbSNP database for single nucleotide polymorphisms and genetic variations

gnomAD
Query gnomAD for population-scale genetic variant frequencies and annotations

Ensembl
Query the Ensembl REST API for genome annotations, gene information, and comparative genomics

UCSC Genome Browser
Query the UCSC Genome Browser API for genomic data and annotations

Expression and omics
GEO (Gene Expression Omnibus)
Query NCBI’s GEO for RNA-seq, microarray, and other gene expression datasets

PRIDE
Query the PRIDE database for proteomics identifications and mass spectrometry data

Reactome
Query the Reactome database for biological pathways and molecular interactions

Clinical and drug data
cBioPortal
Query the cBioPortal REST API for cancer genomics data and clinical information

ClinicalTrials.gov
Query ClinicalTrials.gov API for information about clinical studies and trials

OpenFDA
Query the OpenFDA API for FDA drug, device, and food safety data

GtoPdb (Guide to PHARMACOLOGY)
Query the Guide to PHARMACOLOGY database for drug targets and pharmacological data

Disease and phenotype
OpenTargets
Query the OpenTargets Platform API for disease-target associations and drug discovery data

Monarch Initiative
Query the Monarch Initiative API for phenotype and disease information across species

GWAS Catalog
Query the GWAS Catalog API for genome-wide association study results

RegulomeDB
Query the RegulomeDB database for regulatory variant annotations and functional predictions

Specialized databases
JASPAR
Query the JASPAR REST API for transcription factor binding site profiles and motifs

WoRMS (World Register of Marine Species)
Query the WoRMS REST API for marine species taxonomic information

Paleobiology Database (PBDB)
Query the PBDB API for fossil occurrence and taxonomic data

MPD (Mouse Phenome Database)
Query the Mouse Phenome Database for mouse strain phenotype data

Synapse
Query Synapse REST API for biomedical datasets and collaborative research data

The following are examples of how individual tools get triggered through the MCP from our test suite:

# Protein and Structure Analysis
“Use uniprot tool to find information about human insulin protein”
# → Triggers uniprot MCP tool with protein query parameters
“Use alphafold tool for structure predictions for uniprot_id P01308”
# → Triggers alphafold MCP tool for 3D structure prediction
“Use pdb tool to find protein structures for insulin”
# → Triggers pdb MCP tool for crystallographic structures
# Genetic Variation Analysis
“Use clinvar tool to find pathogenic variants in BRCA1 gene”
# → Triggers clinvar MCP tool with gene variant parameters
“Use gnomad tool to find population frequencies for BRCA2 variants”
# → Triggers gnomad MCP tool for population genetics data

As the tool collection grows, the agent can use built-in semantic search capabilities to discover and select tools based on the task context. This improves agent performance and reducing development complexity at scale. For example, the user asks, “tell me about HER2 variant rs1136201.” Instead of listing all 30 or more tools from the gateway back to the agent, semantic search returns ‘n’ most relevant tools. For example, Ensembl, Gwas catalog, ClinVar, and Dbsnp to the agent. The agent now uses a smaller subset of tools as input to the model to return a more efficient and faster response.
The following graphic illustrates using AgentCore Gateway for tool search.

You can now test your deployed AgentCore gateway using the following test scripts and compare how semantic search narrows down the list of relevant tools based on the search query.
uv run tests/test_gateway.py –prompt “What tools are available?”
uv run tests/test_gateway.py –prompt “Find information about human insulin protein” –use-search
Step 2- Strands research agent with a local tool
The following code snippet shows model initialization, implementing the PubMed local tool that’s declared using the Strands @tool decorator. We’ve implemented the PubMed tool in research_tools.py that calls PubMed APIs to enable biomedical literature search capabilities within the agent’s execution context.

PubMed Tool Creation

from agent.agent_config.tools.PubMed import PubMed

@tool(
name=”Query_pubmed”,
description=(
“Query PubMed for relevant biomedical literature based on the user’s query. ”
“This tool searches PubMed abstracts and returns relevant studies with ”
“titles, links, and summaries.”
),
)
def query_pubmed(query: str) -> str:
“””
Query PubMed for relevant biomedical literature based on the user’s query.

This tool searches PubMed abstracts and returns relevant studies with
titles, links, and summaries.

Args:
query: The search query for PubMed literature

Returns:
str: Formatted results from PubMed search
“””
pubmed = PubMed()

print(f”nPubMed Query: {query}n”)
result = pubmed.run(query)
print(f”nPubMed Results: {result}n”)

return result

Create the Strands research agent with the local tool and Claude Sonnet 4 Interleaved Thinking.

class ResearchAgent:
def __init__(
self,
bearer_token: str,
memory_hook: MemoryHook = None,
session_manager: AgentCoreMemorySessionManager = None,
bedrock_model_id: str = “us.anthropic.claude-sonnet-4-20250514-v1.0”,
#bedrock_model_id: str = “openai.gpt-oss-120b-1.0”, # Alternative
system_prompt: str = None,
tools: List[callable] = None,
):

self.model_id = bedrock_model_id
# For Anthropic Sonnet 4 interleaved thinking
self.model = BedrockModel(
model_id=self.model_id,
additional_request_fields={
“anthropic_beta”: [“interleaved-thinking-2025-05-14”],
“thinking”: {“type”: “enabled”, “budget_tokens”: 8000},
},
)

self.system_prompt = (
system_prompt
if system_prompt
else “””
You are a **Comprehensive Biomedical Research Agent** specialized in conducting
systematic literature reviews and multi-database analyses to answer complex biomedical research
questions. Your primary mission is to synthesize evidence from both published literature
(PubMed) and real-time database queries to provide comprehensive, evidence-based insights for
pharmaceutical research, drug discovery, and clinical decision-making.

Your core capabilities include literature analysis and extracting data from 30+ specialized
biomedical databases** through the Bioimm gateway, enabling comprehensive data analysis. The
database tool categories include genomics and genetics, protein structure and function, pathways
and system biology, clinical and pharmacological data, expression and omics data and other
specialized databases.
“””
)

In addition, we implemented citations that use a structured system prompt to enforce numbered in-text citations [1], [2], [3] with standardized reference formats for both academic literature and database queries, marking sure every data source is properly attributed. This allows researchers to quickly access and reference the scientific literature that supports their biomedical research queries and findings.

“””
<citation_requirements>
– ALWAYS use numbered in-text citations [1], [2], [3], etc. when referencing any data source
– Provide a numbered “References” section at the end with full source details
– For academic literature: format as “1. Author et al. Title. Journal. Year. ID: [PMID/DOI], available at: [URL]”
– For database sources: format as “1. Database Name (Tool: tool_name), Query: [query_description], Retrieved: [current_date]”
– Use numbered in-text citations throughout your response to support all claims and data points
– Each tool query and each literature source must be cited with its own unique reference number
– When tools return academic papers, cite them using the academic format with full bibliographic details
– Structure: Format each reference on a separate line with proper numbering – NO bullet points
– Present the References section as a clean numbered list, not a confusing paragraph
– Maintain sequential numbering across all reference types in a single “References” section
</citation_requirements>
“””

You can now test your agent locally:
uv run tests/test_agent_locally.py –prompt “Find information about human insulin protein”
uv run tests/test_agent_locally.py –prompt “Find information about human insulin protein” –use-search
Step 3 – Add Persistent Memory for contextual research assistance
The research agent implements the AgentCore Memory service with three strategies: semantic for factual research context, user_preference for research methodologies, and summary for session continuity. The AgentCore Memory session manager is integrated with Strands session management and retrieves relevant context before queries and save interactions after responses. This enables the agent to remember research preferences, ongoing projects, and domain expertise across sessions without manual context re-establishment.
# Test memory functionality with research conversations
python tests/test_memory.py load-conversation<br />python tests/test_memory.py load-prompt “My preferred response format is detailed explanations”
Step 4 – Deploy with AgentCore Runtime
To deploy our agent, we use AgentCore Runtime to configure and launch the research agent as a managed service. The deployment process configures the runtime with the agent’s main entrypoint (agent/main.py), assigns an IAM execution role for AWS service access, and supports both OAuth and IAM authentication modes. After deployment, the runtime becomes a scalable, serverless agent that can be invoked using API calls. The agent automatically handles session management, memory persistence, and tool orchestration while providing secure access to the Biomni gateway and local research tools.
agentcore configure –entrypoint agent/main.py -er arn:aws:iam::&lt;Account-Id&gt;:role/&lt;Role&gt; –name researchapp&lt;AgentName&gt;
For more information about deploying with AgentCore Runtime, see Get started with AgentCore Runtime in the Amazon Bedrock AgentCore Developer Guide.
Agents in action 
The following are three representative research scenarios that showcase the agent’s capabilities across different domains: drug mechanism analysis, genetic variant investigation, and pathway exploration. For each query, the agent autonomously determines which combination of tools to use, formulates appropriate sub-queries, analyzes the returned data, and synthesizes a comprehensive research report with proper citations. The accompanying demo video shows the complete agent workflow, including tools selection, reasoning, and response generation.

Conduct a comprehensive analysis of trastuzumab (Herceptin) mechanism of action and resistance mechanisms you’ll need:

HER2 protein structure and binding sites
Downstream signaling pathways affected
Known resistance mechanisms from clinical data
Current clinical trials investigating combination therapies
Biomarkers for treatment response predictionQuery relevant databases to provide a comprehensive research report.

Analyze the clinical significance of BRCA1 variants in breast cancer risk and treatment response. Investigate:

Population frequencies of pathogenic BRCA1 variants
Clinical significance and pathogenicity classifications
Associated cancer risks and penetrance estimates
Treatment implications (PARP inhibitors, platinum agents)
Current clinical trials for BRCA1-positive patients Use multiple databases to provide comprehensive evidence

The following video is a demonstration of a biomedical research agent:

Scalability and observability
One of the most critical challenges in deploying sophisticated AI agents is making sure they scale reliably while maintaining comprehensive visibility into their operations. Biomedical research workflows are inherently unpredictable—a single genomic analysis might process thousands of files, while a literature review could span millions of publications. Traditional infrastructure struggles with these dynamic workloads, particularly when handling sensitive research data that requires strict isolation between different research projects.In this deployment, we use Amazon Bedrock AgentCore Observability to visualize each step in the agent workflow. You can use this service to inspect an agent’s execution path, audit intermediate outputs, and debug performance bottlenecks and failures. For biomedical research, this level of transparency is not just helpful—it’s essential for regulatory compliance and scientific reproducibility.
Sessions, traces, and spans form a three-tiered hierarchical relationship in the observability framework. A session contains multiple traces, with each trace representing a discrete interaction within the broader context of the session. Each trace contains multiple spans that capture fine-grained operations. The following screenshot shoes the usage of one agent: Number of sessions, token usage, and error rate in production

The following screenshot shows the agents in production and their usage (number of Sessions, number of invocations)

The built-in dashboards show performance bottlenecks and identify why certain interactions might fail, enabling continuous improvement and reducing the mean time to detect (MTTD) and mean time to repair (MTTR). For biomedical applications where failed analyses can delay critical research timelines, this rapid issue resolution capability makes sure that research momentum is maintained.
Future direction
While this implementation focuses on only a subset of tools, the AgentCore Gateway architecture is designed for extensibility. Research teams can seamlessly add new tools without requiring code changes by using the MCP protocol. Newly registered tools are automatically discoverable by agents allowing your research infrastructure to evolve alongside the rapidly changing tool sets.
For computational analysis that requires code execution, the AgentCore Code Interpreter service can be integrated into the research workflow. With AgentCore Code Interpreter the research agent can retrieve data and execute Python-based analysis using domain-specific libraries like BioPython, scikit-learn, or custom genomics packages.
Future extensions could support multiple research agents to collaborate on complex projects, with specialized agents for literature review, experimental design, data analysis, and result interpretation working together through multi-agent collaboration. Organizations can also develop specialized research agents tailored to specific therapeutic areas, disease domains, or research methodologies that share the same enterprise infrastructure and tool gateway.
Looking ahead with Biomni
“Biomni today is already useful for academic research and open exploration. But to enable real discovery—like advancing drug development—we need to move beyond prototypes and make the system enterprise-ready. Embedding Biomni into the workflows of biotech and pharma is essential to turn research potential into tangible impact.
That’s why we are excited to integrate the open-source environment with Amazon Bedrock AgentCore, bridging the gap from research to production. Looking ahead, we’re also excited about extending these capabilities with the Biomni A1 agent architecture and the Biomni-R0 model, which will unlock even more sophisticated biomedical reasoning and analysis. At the same time, Biomni will remain a thriving open-source environment, where researchers and industry teams alike can contribute tools, share workflows, and push the frontier of biomedical AI together with AgentCore.”
Conclusion
This implementation demonstrates how organizations can use Amazon Bedrock AgentCore to transform biomedical research prototypes into production-ready systems. By integrating Biomni’s comprehensive collection of over 150 specialized tools through the AgentCore Gateway service, we illustrate how teams can create enterprise-grade tool sharing infrastructure that scales across multiple research domains.The combination of Biomni’s biomedical tools with the enterprise infrastructure of Bedrock AgentCore organizations can build research agents that maintain scientific rigor while meeting production requirements for security, scalability, and observability. Biomni’s diverse tool collection—spanning genomics, proteomics, and clinical databases—exemplifies how specialized research capabilities can be centralized and shared across research teams through a secure gateway architecture.
To begin building your own biomedical research agent with Biomni tools, explore the implementation by visiting our GitHub repository for the complete code and documentation. You can follow the step-by-step implementation guide to set up your research agent with local tools, gateway integration, and Bedorck AgentCore deployment. As your needs evolve, you can extend the system with your organization’s proprietary databases and analytical tools. We encourage you to join the growing environment of life sciences AI agents and tools by sharing your extensions and improvements.

About the authors
Hasan Poonawala is a Senior AI/ML Solutions Architect at AWS, working with Healthcare and Life Sciences customers. Hasan helps design, deploy and scale Generative AI and Machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.
Pierre de Malliard is a Senior AI/ML Solutions Architect at Amazon Web Services and supports customers in the Healthcare and Life Sciences Industry. He is currently based in New York City.
Necibe Ahat is a Senior AI/ML Specialist Solutions Architect at AWS, working with Healthcare and Life Sciences customers. Necibe helps customers to advance their generative AI and machine learning journey. She has a background in computer science with 15 years of industry experience helping customers ideate, design, build and deploy solutions at scale. She is a passionate inclusion and diversity advocate.
Kexin Huang is a final-year PhD student in Computer Science at Stanford University, advised by Prof. Jure Leskovec. His research applies AI to enable interpretable and deployable biomedical discoveries, addressing core challenges in multi-modal modeling, uncertainty, and reasoning. His work has appeared in Nature Medicine, Nature Biotechnology, Nature Chemical Biology, Nature Biomedical Engineering and top ML venues (NeurIPS, ICML, ICLR), earning six best paper awards. His research has been highlighted by Forbes, WIRED, and MIT Technology Review, and he has contributed to AI research at Genentech, GSK, Pfizer, IQVIA, Flatiron Health, Dana-Farber, and Rockefeller University.

Make your web apps hands-free with Amazon Nova Sonic

Graphical user interfaces have carried the torch for decades, but today’s users increasingly expect to talk to their applications. Amazon Nova Sonic is a state-of-the-art foundation model from Amazon Bedrock, that helps enable this shift by providing natural, low-latency, bidirectional speech conversations over a simple streaming API. Users can collaborate with the applications through voice and embedded intelligence rather than merely operating them.
In this post we show how we added a true voice-first experience to a reference application—the Smart Todo App—turning routine task management into a fluid, hands-free conversation.
Rethinking user interaction through collaborative AI voice agents
Important usability enhancements are often deprioritized—not because they aren’t valuable, but because they’re difficult to implement within traditional mouse-and-keyboard interfaces. Features like intelligent batch actions, personalized workflows, or voice-guided assistance are frequently debated but deferred due to UI complexity. This is about voice as an additional, general-purpose interaction mode—not a replacement for device-specific controls or an accessibility-only solution. Voice enables new interaction patterns, it also benefits users of assistive technologies, such as screen readers, by offering an additional, inclusive way to interact with the application.
Amazon Nova Sonic goes far beyond one-shot voice commands. The model can plan multistep workflows, call backend tools, and keep context across turns so that your application can collaborate with the users.
The following table shows voice interactions from different application domains, like task management, CRM, and help desk.

Voice interaction (example phrase)
Intent / goal
System action / behavior
Confirmation / UX

Mark all my tasks as complete.
Bulk-complete tasks
Find user’s open tasks → mark complete → archive if configured
All 12 open tasks are marked complete.

Create a plan for preparing the Q3 budget: break it into steps, assign owners, and set deadlines.
Create multistep workflow
Generate plan → create tasks → assign owners → set deadlines → surface review options
Plan created with 6 tasks. Notify owners?

Find enterprise leads in APAC with ARR over $1M and draft personalized outreach.
Build targeted prospect list and draft outreach
Query CRM → assemble filtered list → draft personalized messages for review
Drafted 24 personalized outreach messages. Review and send?

Prioritize all P1 tickets opened in the last 24 hours and assign them to on-call.
Triage and assign
Filter tickets → set priority → assign to on-call → log changes
12 P1 tickets prioritized and assigned to the on-call team.

Amazon Nova Sonic understands the intent, invokes the required APIs, and confirms the results—no forms required. This helps to create an environment where productivity is multiplied, and context becomes the interface. It’s not about replacing traditional UI, it’s about unlocking new capabilities through voice.
The sample application at a glance
With the Smart Todo reference application, users can create to-do lists and manage notes within those lists. The application offers a focused yet flexible interface for task tracking and note organization. With the addition of voice, the application becomes a hands-free experience that unlocks more natural and productive interactions. In Smart Todo App, users can say:

“Add a note to follow up on the project charter.”
“Archive all completed tasks.”

Behind each command are focused actions—like creating a new note, organizing content, or updating task status—executed through speech in a way that feels natural and efficient.
How Amazon Nova Sonic bidirectional APIs work
Amazon Nova Sonic implements a real-time, bidirectional streaming architecture. After a session is initiated with InvokeModelWithBidirectionalStream, audio input and model responses flow simultaneously over an open stream:

Session Start – Client sends a sessionStart event with model configuration (for example, temperature and topP).
Prompt and Content Start – Client sends structured events indicating whether upcoming data is audio, text, or tool input.
Audio Streaming – Microphone audio is streamed as base64-encoded audio input events.
Model Responses – As the model processes input, it streams the following responses asynchronously:

Automatic speech recognition (ASR) results
Tool use invocations
Text responses
Audio output for playback

Session Close – Conversations are explicitly closed by sending contentEnd, promptEnd, and sessionEnd events.

Nova Sonic Architecture Diagram

You can use this event-driven approach to interrupt the assistant (barge-in), enable multi-turn conversations, and support real-time adaptability.
Solution architecture
For this solution, we use a serverless application architecture pattern, where the UI is a React single page application. The React single page application is integrated with backend web APIs running on server-side containers. The Smart Todo App is deployed using a scalable and security-aware AWS architecture that’s designed to support real-time voice interactions. The following image provides an architecture overview of AWS services working together to support bidirectional streaming needs of a voice enabled application.

Key AWS services include:

Amazon Bedrock – Powers real-time, bidirectional speech interactions through the Amazon Nova Sonic foundation model.
Amazon CloudFront – A content delivery network (CDN) that distributes the application globally with low latency. It routes /(root) traffic to the React application hosted on an Amazon S3 bucket and /api and /novasonic traffic to the Application Load Balancer.
AWS Fargate for Amazon Amazon Elastic Container Service (Amazon ECS) – Runs the backend containerized services for WebSocket handling and REST APIs capable of supporting long lived bidirectional streams.
Application Load Balancer (ALB) – Forwards web traffic /api (HTTPS REST API calls) to backend ECS services, handling Smart Todo App APIs, and /novasonic (WebSocket connections) to ECS services managing real-time voice streaming with Amazon Nova Sonic.
Amazon Virtual Private Cloud (Amazon VPC) – Provides network isolation and security for backend services. The Public Subnets host the Application Load Balancer (ALB) and Private Subnets host ECS Fargate tasks running WebSocket and REST APIs.
NAT Gateway allows Amazon ECS tasks in private subnets to more securely connect to the internet for operations like Cognito JWT token verification endpoints.
Amazon Simple Storage Service (Amazon S3) –Hosts React frontend for user interactions
AWS WAF – Helps protect the Application Load Balancer (ALB) from malicious traffic and enforces security rules at the application layer.
Amazon Cognito – Manages authentication and issues tokens.
Amazon DynamoDB – Stores application data such as to-do lists and notes.

The following image illustrates how the user requests are served with support for low-latency bidirectional streaming.

Request Workflow

Deploying the solution
To evaluate this solution, we provided sample code of a Smart Todo App available at GitHub repository.
Smart Todo App consists of multiple independent Node.js projects, including a CDK infrastructure project, a React frontend application, and backend API services. The deployment workflow makes sure that the components are correctly built and integrated with AWS services like Amazon Cognito, Amazon DynamoDB, and Amazon Bedrock.
Prerequisites

AWS account with appropriate permissions that facilitate security best practices, including least-privilege permissions.
Docker Engine installed locally and running to build container image locally.
AWS CLI configured with AWS admin credentials.
Node.js >= 20.x and npm installed.
Amazon Nova Sonic enabled in Amazon Bedrock. For more information, see Add or remove access to Amazon Bedrock foundation models.

Deployment steps

Clone the following repository:

git clone https://github.com/aws-samples/sample-amazon-q-developer-vibe-coded-projects.git
cd NovaSonicVoiceAssistant

For first-time deployment, use the following automated script:

npm run deploy:first-time

This script will:

Install the dependencies using npm (node package manager)
Build the components and container image using locally installed docker engine
Deploy the infrastructure using CDK (CDK BootStrap ==> CDK Synth ==> CDK Deploy)
Update environment variables with Amazon Cognito settings
Rebuild the UI with updated environment variables
Deploy the final infrastructure (CDK Deploy)

Verifying deployment
After deployment is successful, complete the following steps:

Access the Amazon CloudFront URL provided in the CDK outputs. Note: The URL shown in the image is for reference only, every deployment will get a unique URL.

Successful deployment screen shot

Create a new user by signing up using the Create Account section.

Create User and Log in

Test the voice functionality to verify the integration with Amazon Nova Sonic. The following image illustrates a conversation between the signed-in user and the Amazon Bedrock agent. The AI agent is able to invoke existing APIs, and the UI is updated in real time to reflect agent’s actions.

Granting Microphone access to the application

Voice interaction in Smart Todo App

Clean up
You can remove the stacks with the following command.

# move to the infra folder, assuming you are in the project’s root folder
cd infra
# Removes the AWS stack
npm run destroy

Next steps
Voice isn’t just an accessibility add-on—it’s becoming the primary interface for complex workflows. Turns out talking is faster than selecting—especially when your app talks back.
Try these resources to get started.

Sample Code repo – A working Amazon Nova Sonic integration you can run locally. See how real-time voice interactions, intent handling, and multistep flows are implemented end to end.
Amazon Nova Sonic hands-on workshop – A guided lab that walks you through deploying Amazon Nova Sonic in your AWS account and testing voice-native features.
Amazon Nova Sonic docs – Provides API reference, streaming examples, and best practices to help you design and deploy voice-driven workflows.
Contact your AWS account team to learn more about how AI-driven solutions can transform your operations.

About the authors
Manu Mishra is a Senior Solutions Architect at AWS, specializing in artificial intelligence, data and analytics, and security. His expertise spans strategic oversight and hands-on technical leadership, where he reviews and guides the work of both internal and external customers. Manu collaborates with AWS customers to shape technical strategies that drive impactful business outcomes, providing alignment between technology and organizational goals.
AK Soni is a Senior Technical Account Manager with AWS Enterprise Support, where he empowers enterprise customers to achieve their business goals by offering proactive guidance on implementing innovative cloud and AI/ML-based solutions aligned with industry best practices. With over 19 years of experience in enterprise application architecture and development, he uses his expertise in generative AI technologies to enhance business operations and overcome existing technological limitations.
Raj Bagwe is a Senior Solutions Architect at Amazon Web Services, based in San Francisco, California. With over 6 years at AWS, he helps customers navigate complex technological challenges and specializes in Cloud Architecture, Security and Migrations. In his spare time, he coaches a robotics team and plays volleyball. He can be reached at X handle @rajesh_bagwe.