From Transformers to Associative Memory, How Titans and MIRAS Rethink …

What comes after Transformers? Google Research is proposing a new way to give sequence models usable long term memory with Titans and MIRAS, while keeping training parallel and inference close to linear.

Titans is a concrete architecture that adds a deep neural memory to a Transformer style backbone. MIRAS is a general framework that views most modern sequence models as instances of online optimization over an associative memory.

Why Titans and MIRAS?

Standard Transformers use attention over a key value cache. This gives strong in context learning, but cost grows quadratically with context length, so practical context is limited even with FlashAttention and other kernel tricks.

Efficient linear recurrent neural networks and state space models such as Mamba-2 compress the history into a fixed size state, so cost is linear in sequence length. However, this compression loses information in very long sequences, which hurts tasks such as genomic modeling and extreme long context retrieval.

Titans and MIRAS combine these ideas. Attention acts as a precise short term memory on the current window. A separate neural module provides long term memory, learns at test time, and is trained so that its dynamics are parallelizable on accelerators.

https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/

Titans, a neural long term memory that learns at test time

The Titans research paper introduces a neural long term memory module that is itself a deep multi layer perceptron rather than a vector or matrix state. Attention is interpreted as short term memory, since it only sees a limited window, while the neural memory acts as persistent long term memory.

For each token, Titans defines an associative memory loss

ℓ(Mₜ₋₁; kₜ, vₜ) = ‖Mₜ₋₁(kₜ) − vₜ‖²

where Mₜ₋₁ is the current memory, kₜ is the key and vₜ is the value. The gradient of this loss with respect to the memory parameters is the “surprise metric”. Large gradients correspond to surprising tokens that should be stored, small gradients correspond to expected tokens that can be mostly ignored.

The memory parameters are updated at test time by gradient descent with momentum and weight decay, which together act as a retention gate and forgetting mechanism.To keep this online optimization efficient, the research paper shows how to compute these updates with batched matrix multiplications over sequence chunks, which preserves parallel training across long sequences.

Architecturally, Titans uses three memory branches in the backbone, often instanced in the Titans MAC variant:

a core branch that performs standard in context learning with attention

a contextual memory branch that learns from the recent sequence

a persistent memory branch with fixed weights that encodes pretraining knowledge

The long term memory compresses past tokens into a summary, which is then passed as extra context into attention. Attention can choose when to read that summary.

Experimental results for Titans

On language modeling and commonsense reasoning benchmarks such as C4, WikiText and HellaSwag, Titans architectures outperform state of the art linear recurrent baselines Mamba-2 and Gated DeltaNet and Transformer++ models of comparable size. The Google research attribute this to the higher expressive power of deep memory and its ability to maintain performance as context length grows. Deep neural memories with the same parameter budget but higher depth give consistently lower perplexity.

For extreme long context recall, the research team uses the BABILong benchmark, where facts are distributed across very long documents. Titans outperforms all baselines, including very large models such as GPT-4, while using many fewer parameters, and scales to context windows beyond 2,000,000 tokens.

The research team reports that Titans keeps efficient parallel training and fast linear inference. Neural memory alone is slightly slower than the fastest linear recurrent models, but hybrid Titans layers with Sliding Window Attention remain competitive on throughput while improving accuracy.

https://arxiv.org/pdf/2504.13173

MIRAS, a unified framework for sequence models as associative memory

The MIRAS research paper, “It’s All Connected: A Journey Through Test Time Memorization, Attentional Bias, Retention, and Online Optimization,” generalizes this view. It observes that modern sequence models can be seen as associative memories that map keys to values while balancing learning and forgetting.

MIRAS defines any sequence model through four design choices:

Memory structure for example a vector, linear map, or MLP

Attentional bias the internal loss that defines what similarities the memory cares about

Retention gate the regularizer that keeps the memory close to its past state

Memory algorithm the online optimization rule, often gradient descent with momentum

Using this lens, MIRAS recovers several families:

Hebbian style linear recurrent models and RetNet as dot product based associative memories

Delta rule models such as DeltaNet and Gated DeltaNet as MSE based memories with value replacement and specific retention gates

Titans LMM as a nonlinear MSE based memory with local and global retention optimized by gradient descent with momentum

Crucially, MIRAS then moves beyond the usual MSE or dot product objectives. The research team constructs new attentional biases based on Lₚ norms, robust Huber loss and robust optimization, and new retention gates based on divergences over probability simplices, elastic net regularization and Bregman divergence.

From this design space, the research team instantiate three attention free models:

Moneta uses a 2 layer MLP memory with Lₚ attentional bias and a hybrid retention gate based on generalized norms

Yaad uses the same MLP memory with Huber loss attentional bias and a forget gate related to Titans

Memora uses regression loss as attentional bias and a KL divergence based retention gate over a probability simplex style memory.

These MIRAS variants replace attention blocks in a Llama style backbone, use depthwise separable convolutions in the Miras layer, and can be combined with Sliding Window Attention in hybrid models. Training remains parallel by chunking sequences and computing gradients with respect to the memory state from the previous chunk.

In research experiments, Moneta, Yaad and Memora match or surpass strong linear recurrent models and Transformer++ on language modeling, commonsense reasoning and recall intensive tasks, while maintaining linear time inference.

Key Takeaways

Titans introduces a deep neural long term memory that learns at test time, using gradient descent on an L2 associative memory loss so the model selectively stores only surprising tokens while keeping updates parallelizable on accelerators.

Titans combines attention with neural memory for long context, using branches like core, contextual memory and persistent memory so attention handles short range precision and the neural module maintains information over sequences beyond 2,000,000 tokens.

Titans outperforms strong linear RNNs and Transformer++ baselines, including Mamba-2 and Gated DeltaNet, on language modeling and commonsense reasoning benchmarks at comparable parameter scales, while staying competitive on throughput.

On extreme long context recall benchmarks such as BABILong, Titans achieves higher accuracy than all baselines, including larger attention models such as GPT 4, while using fewer parameters and still enabling efficient training and inference.

MIRAS provides a unifying framework for sequence models as associative memories, defining them by memory structure, attentional bias, retention gate and optimization rule, and yields new attention free architectures such as Moneta, Yaad and Memora that match or surpass linear RNNs and Transformer++ on long context and reasoning tasks.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post From Transformers to Associative Memory, How Titans and MIRAS Rethink Long Context Modeling appeared first on MarkTechPost.

How AWS delivers generative AI to the public sector in weeks, not year …

When critical services depend on quick action, from the safety of vulnerable children to environmental protection, you need working AI solutions in weeks, not years. Amazon recently announced an investment of up to $50 billion in expanded AI and supercomputing infrastructure for US government agencies, demonstrating both the urgency and commitment from Amazon Web Services (AWS) to accelerating public sector innovation. The AWS Generative AI Innovation Center is already making this happen, consistently delivering production-ready solutions for government organizations.
What makes this time different
The convergence of three factors makes this technology moment different:

Mission urgency – Public sector organizations currently face the challenge of managing both growing workloads in mission-critical areas, such as veterans’ benefits claims and bridge safety inspections, and workforce and budget limitations.
Technology readiness – Production-ready AI solutions can now be deployed securely and at scale, with unprecedented compute capacity being built specifically for US government requirements.
Proven success models – Early adopters have demonstrated that rapid AI implementation is possible in government settings, creating blueprints for others to follow.

Drawing from over a thousand implementations, the Generative AI Innovation Center combines AWS infrastructure and security conformance to help you transform mission delivery.

Accelerating real-world innovation
Public sector organizations working to improve mission speed and effectiveness can collaborate with the Innovation Center to develop targeted solutions. These three case studies show this approach in action.
AI systems that support critical care to protect vulnerable children
When protecting a child’s welfare, having key information surface at exactly the right moment is crucial. Systems must work reliably, every time.
This was the challenge the Miracle Foundation faced when managing foster care caseloads globally. In the span of weeks, the Innovation Center worked alongside caseworkers to build a production AI assistant that analyzes case files, flags urgent situations, and recommends evidence-based interventions tailored to each child’s unique circumstances.
“When a caseworker misses an urgent signal in a child’s file, it can have life-changing consequences,” explains Innovation Center strategist Brittany Roush. “We were building a system that needed to surface critical information at exactly the right moment.”
The solution aims to help caseworkers make faster, more informed decisions for vulnerable children around the world. It also includes built-in enterprise-grade security, designed for scalability and delivered with comprehensive knowledge transfer so the Miracle Foundation team can fully manage and evolve their system.
It’s important to start with actual users on day one. The Miracle Foundation team interfaced directly with caseworkers to understand workflows before writing a single line of code. This user-first approach removed months of work to gather requirements and iterate through revisions.
Innovation at institutional scale
The University of Texas at Austin (UT Austin) approached the Innovation Center about personalized academic support for 52,000 students. The team delivered UT Sage, a production AI tutoring service designed by learning scientists and trained by faculty, which is now in open beta across the UT Austin campus. Unlike generic AI tools, UT Sage provides custom, course-specific support while maintaining academic integrity standards. “It’s like having a knowledgeable teaching assistant available whenever you need help,” one student reported during testing.
“The UT Sage project empowers our faculty to create personalized learning tools, designed to motivate student engagement,” said Julie Schell, Assistant Vice Provost and Director of the Office of Academic Technology. “With the potential to deploy across hundreds of courses, we are aiming to enhance learning outcomes and reduce the time and effort required to design student-centered, high-quality course materials.”
Build flexible foundations, not point solutions. The team architected UT Sage as a service that faculty could adapt to specific courses. This extensible design enabled institutional scale from day one, avoiding the trap of a successful pilot that never scales, which can plague technology projects.
Transforming government speed with the EPA
The U.S. Environmental Protection Agency partnered with the innovation center to transform document processing workflows that used to take weeks or months. The team, in partnership with the EPA, delivered two breakthrough solutions that demonstrate both the team’s velocity and increasing technical complexity:

Chemical risk assessment acceleration – An intelligent document processing system that evaluates research studies against predetermined scientific criteria. What once required hours of manual review by EPA scientists now takes minutes. The system achieved an 85% reduction in processing time while maintaining 85% accuracy. Processing 250 documents costs the team $40 through the system, compared to requiring 500 hours of scientist time manually.
Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA) application reviews – Automated creation of data evaluation records (DERs) from health and safety studies for pesticide applications under FIFRA. This process traditionally took EPA reviewers 4 months of manual work. The AI solution now generates these critical regulatory documents in seconds, achieving a 99% cost reduction while potentially accelerating approval timelines for safe pesticide products.

Both solutions incorporate rigorous human-in-the-loop review processes to maintain scientific integrity and regulatory compliance alignment. EPA scientists oversee AI-generated assessments, but they can now focus their expertise on analysis and decision-making rather than manual data processing.
“We’re not replacing scientific judgment,” explained an EPA team member. “We’re eliminating the tedious work so our scientists can spend more time on what matters most—protecting public health and the environment.”
The EPA cases demonstrate that AI augmentation can deliver both speed and trust. The team designed review workflows into the architecture to improve trust, making the systems immediately acceptable to scientific staff and leadership.
Strategies to increase the pace of innovation
Experts at the Innovation Center have developed several strategies to help organizations excel with generative AI. To facilitate building your own production systems and increase the pace of innovation, follow these best practices:

Build on day 1, not week 6 – Traditional projects spend months on requirements and architecture. The Innovation Center starts building immediately, using extensive libraries of reusable, secure infrastructure-as-code (IaC) components. They also use tools such as Kiro, an AI integrated development environment (IDE) that efficiently converts developer prompts into detailed specifications and working code. This approach has been refined with each engagement, meaning the team is building increasingly complex use cases faster than ever before. Access to the expanded government AI infrastructure of AWS can further accelerate this development process, so you can tackle increasingly sophisticated use cases.
Get the right people on your team – Each engagement brings together scientists, architects, security experts, and domain specialists who understand public sector missions. This cross-functional composition minimizes the typical back-and-forth that often complicates requirement gathering and refinement. Everyone who’s needed to make decisions is already in the discussion, collaboratively working toward a common goal.
Knowledge transfer happens throughout, not at the end – Don’t wait to think about technology hand-offs. Advancing a project to the next team without prior coordination is rarely an effective strategy. The deep collaboration between stakeholders working alongside Innovation Center experts happens throughout development. Knowledge transfer occurs naturally in daily collaboration, with formal documentation being handed off at the end. The Innovation Center team then continues to support in an advisory capacity until the solution goes into production.
Harness the secure and reliable infrastructure and services of AWS – For public sector organizations, moving fast can’t mean compromising on security or compliance. Every solution is architected on secure AWS infrastructure with the ability to meet even stringent Federal Risk and Authorization Management Program (FedRAMP) High requirements. The Innovation Center follows a secure-by-design approach where compliance alignment is woven into the entire development lifecycle. By making compliance alignment concurrent, not sequential, the team demonstrates that security and speed aren’t trade-offs. The upcoming expansion of the AWS government cloud infrastructure further strengthens these security and compliance capabilities, providing you with one of the most comprehensive and secure AI computing environments.

Next steps in public sector AI
Every case study in this post started with a specific, pressing challenge. Each example achieved institutional scale by delivering value quickly, not by waiting for the perfect moment. Start with one persistent operational need, deliver results in weeks, then expand. With the AWS investment of up to $50 billion in purpose-built government AI infrastructure, these transformations can now happen at even greater scale and speed. Each successful engagement creates a blueprint for the next, continuously expanding what’s possible for public sector AI.
Learn more about the AWS Generative AI Innovation Center and how they’re helping public sector organizations turn AI potential into production reality.

About the authors
Kate Zimmerman serves as the Generative AI Innovation Center Geo Leader for Worldwide Public Sector at AWS. Kate leads a team of generative AI strategists and scientists, architecting innovative solutions for public sector organizations globally. Her role combines strategic leadership with hands-on technical expertise, and she works directly with Director, VP, and C-level executives to drive GenAI adoption and deliver mission-critical outcomes. With 13+ years of experience spanning commercial cloud, defense, national security, and aerospace, Kate brings a unique perspective to driving transformative AI/ML solutions. Previously, as Chief Scientist & VP of Data and Analytics at HawkEye 360, she led 50+ developers, engineers, and scientists to launch the company’s first production AI/ML capabilities. Her tenure at AWS included leadership roles as Senior Manager & Principal Architect of the ML Solutions Lab, where she accelerated AI/ML adoption among national security customers, and Senior Solutions Architect supporting the National Reconnaissance Office. Kate also served in the USAF on active duty for 5 years developing advance satellite systems and continues to serve as a reservist supporting strategic AI/ML initiatives with the USAF 804th Test Group.
Sri Elaprolu serves as Director of the AWS Generative AI Innovation Center, where he leverages nearly three decades of technology leadership experience to drive artificial intelligence and machine learning innovation. In this role, he leads a global team of machine learning scientists and engineers who develop and deploy advanced generative and agentic AI solutions for enterprise and government organizations facing complex business challenges. Throughout his nearly 13-year tenure at AWS, Sri has held progressively senior positions, including leadership of ML science teams that partnered with high-profile organizations such as the NFL, Cerner, and NASA. These collaborations enabled AWS customers to harness AI and ML technologies for transformative business and operational outcomes. Prior to joining AWS, he spent 14 years at Northrop Grumman, where he successfully managed product development and software engineering teams. Sri holds a Master’s degree in Engineering Science and an MBA with a concentration in general management, providing him with both the technical depth and business acumen essential for his current leadership role.

S&P Global Data integration expands Amazon Quick Research capabili …

Today, we are pleased to announce a new integration between Amazon Quick Research and S&P Global. This integration brings both S&P Global Energy news, research, and insights and S&P Global Market Intelligence data to Quick Research customers in one deep research agent.
The S&P Global integration extends the capabilities of Quick Research so that business professionals can analyze multiple data sources—including global energy news and premium financial intelligence—in one workspace, eliminating the need to switch between platforms and transforming weeks of research into minutes of focused insight generation. Quick Suite connects information across internal repositories, popular applications, AWS services, and through Model Context Protocol (MCP) integrations, to over 1,000 apps. This agentic AI application is reshaping how work gets done by transforming how teams find insights, conduct deep research, automate tasks, visualize data, and take actions across apps.
In this post, we explore S&P Global’s data sets and the solution architecture of the integration with Quick Research.
Solution overview
S&P Global has pioneered two MCP server implementations on AWS so organizations can easily integrate trusted financial services and energy content into AI-powered workflows while maintaining the quality, security, and reliability that business leaders demand.

“Our collaboration with AWS expands how S&P Global delivers trusted intelligence through the next generation of agentic AI experiences. By working alongside leading AI companies, our goal is to ensure customers can access our trusted data and insights wherever their workflows take place.” 
– Bhavesh Dayalji, Chief AI Officer of S&P Global and CEO of Kensho.

S&P Global Energy: Comprehensive commodity and energy intelligence
The S&P Global Energy integration, now available in Amazon Quick Research, utilizes an AI Ready Data MCP server to deliver comprehensive access to commodity and energy market intelligence spanning Oil, Gas, Power, Metals, Clean Energy, Agriculture, and Shipping sectors across global markets. Built on S&P Global’s reputation as a trusted market authority, the MCP server uses hundreds of thousands of expert-created documents including analyses, commentaries, and news articles reflecting decades of industry expertise.
The solution provides a unique multi-horizon perspective, offering intelligence from daily market updates to one-year outlooks and extending to 20+ year scenario analyses. With data refreshing every 30 minutes, business leaders gain near real-time access to commodity and energy intelligence, dramatically accelerating decision velocity when exploring regulatory challenges, investment opportunities, or environmental implications.
S&P Global Market Intelligence: Trusted financial intelligence
The S&P Global Market Intelligence integration, now available in Amazon Quick Research, uses the Kensho LLM-ready API MCP server developed by Kensho, S&P Global’s AI innovation hub. This MCP server makes trusted financial data accessible through natural language queries, integrating seamlessly with Amazon Quick Research. Financial professionals can access S&P Capital IQ Financials, earnings call transcripts, company information, transactions and more, simply by asking questions.
The Kensho solution addresses a critical challenge in financial services: making vast repositories of financial data immediately accessible without requiring complex query languages or technical expertise. Engineering, product, and business teams can save significant time and resources by transforming what once required hours of data extraction into conversational queries that return precise, trusted information in seconds.
Solution architecture
S&P Global’s MCP server architecture is shown in the following diagram. When using one of the S&P integrations, traffic flows from Quick Research through an Amazon API Gateway to an AWS Application Load Balancer with the MCP services hosted on Amazon Elastic Kubernetes Service (Amazon EKS). The MCP server uses data hosted in Amazon S3 and AWS Relational Database Service for PostgreSQL for structured data, and Amazon OpenSearch Service for vector storage. This architecture delivers enterprise-ready MCP servers with defense-in-depth security, automated scaling, and comprehensive observability.

MCP is an open standard that supports seamless communication between AI agents and external data sources, tools, and services. MCP operates on a client-server architecture where MCP servers handle tool calls, typically consisting of multiple API calls and expose business logic implementations as callable functions. This enables AI agents to discover capabilities dynamically, negotiate features, and share context securely with all critical requirements for enterprise-grade applications.
S&P Global’s solution has the following key building blocks:

Automated data pipeline with Amazon Bedrock: At the heart of the solution is a Retrieval Augmented Generation (RAG) data ingestion pipeline using Amazon Bedrock. This pipeline transforms raw market data into AI Ready Data. Documents from S&P Global’s proprietary repositories undergo preprocessing, chunking, and enrichment before being converted into vector embeddings using Bedrock hosted Cohere Embed model. The ingestion pipeline runs on a scheduled basis, refreshing the OpenSearch vector store every 30 minutes for near real-time access to the energy data.
Vector and semantic search: Amazon OpenSearch serves as the vector database, storing embeddings generated by Bedrock and enabling semantic search capabilities across S&P Global’s energy data. The OpenSearch vector store is optimized for high-dimensional vector operations, supporting rapid similarity searches that power the MCP servers’ ability to retrieve contextually relevant information in response to natural language queries.
Resilience and scale: This solution uses Amazon EKS to host all MCP server solutions with two production clusters enabling traffic splitting and failover capabilities. This dual-cluster approach provides continuous availability even during unexpected failures. Both the Cluster Autoscaler and Horizontal Pod Autoscaler enable dynamic scaling based on demand. The MCP servers are built with the FastMCP framework, providing high-performance HTTP endpoints that comply with the Streamable HTTP Transport specification required by the MCP protocol.
Security: Security is built-in to every layer of the solution. API Gateway serves as the endpoint for MCP server access. S&P Global’s enterprise identity provider is used for OAuth authentication. API Gateway is further secured with AWS Web Application Firewall (WAF) with advanced threat detection. AWS IAM roles and policies enforce least privilege principles, so that each component has only the permissions it requires. AWS Secrets Manager securely stores credentials for accessing resources and AWS services. AWS Security Groups and VPC configurations provide network isolation, while TLS 1.2+ with AWS Certificate Manager validates all data in transit remains encrypted. This multi-layered security includes defense-in-depth security controls.
Observability: Amazon CloudWatch provides centralized logging, metrics collection, and real-time monitoring of the entire pipeline from data ingestion through MCP server responses. AWS CloudTrail captures detailed API activity logs and audit trails, essential for compliance in regulated industries.

Conclusion
Together, these MCP servers built on AWS and integrated into Amazon Quick Research demonstrates S&P Global’s vision for the future of financial services and energy intelligence: maintaining the trust, accuracy, and depth that business leaders require while embracing the transformative potential of AI to make that intelligence more accessible, actionable, and integrated into modern workflows.
Ready to get started? Please refer to Quick Research Third Party Data for more details.

About the authors
Jon Einkauf is a Product leader at AWS based in Seattle, where he focuses on building AI-powered tools that help businesses synthesize information and accelerate research. With over a decade of experience at Amazon spanning digital health, cloud computing, and AI products, he has led cross-functional teams in product management, engineering, and design to deliver innovative solutions for customers worldwide.
Prasanth Ponnoth is an AWS solutions architect supporting global financial services with more than 20 years of industry and technology experience with cloud migration, modernization and building distributed systems at scale. His areas of interests are machine learning, containers/ Kubernetes and open-source technologies. In AWS, he is part of the machine learning technical field community and focusing on Amazon Bedrock, Amazon SageMaker AI, Amazon Bedrock AgentCore services.
Brandon Pominville is a Senior Solutions Architect at AWS based in New York, where he works with global financial services customers to build secure, scalable data and AI platforms in the cloud. With over 20 years of experience across financial services, enterprise data platforms, and cloud computing, he specializes in translating business requirements into technical solutions. Outside of work, Brandon enjoys spending time with his family outdoors or on a cruise ship, and playing volleyball.

Streamline AI agent tool interactions: Connect API Gateway to AgentCor …

AgentCore Gateway now supports API GatewayAs organizations explore the possibilities of agentic applications, they continue to navigate challenges of using enterprise data as context in invocation requests to large language models (LLMs) in a manner that is secure and aligned with enterprise policies. To help standardize and secure those interactions, many organizations are using the Model Context Protocol (MCP) specification, which defines how agentic applications can securely connect to data sources and tools.
While MCP has been advantageous for net new use cases, organizations also navigate challenges with bringing their existing API estate into the agentic era. MCP can certainly wrap existing APIs, but it requires additional work, translating requests from MCP to RESTful APIs, making sure security is maintained through the entire request flow, and applying the standard observability required for production deployments.
Amazon Bedrock AgentCore Gateway now supports Amazon API Gateway as a target, translating MCP requests to AgentCore Gateway into RESTful requests to API Gateway. You can now expose both new and existing API endpoints to agentic applications using MCP, with built-in security and observability. This post covers these new capabilities and shows how to implement them.
What’s new: API Gateway support in AgentCore Gateway
AgentCore Gateway now supports API Gateway targets in addition to existing target types (Lambda functions, OpenAPI schemas, Smithy models, and MCP servers).

Our customers have successfully built extensive API ecosystems using API Gateway, connecting backends across numerous applications. As enterprises advance toward next-generation agentic applications, the natural evolution is to expose these existing APIs and backend tools to AI-powered systems, enabling seamless integration between established infrastructure and modern intelligent agents.
This integration between AgentCore Gateway and API Gateway simplifies the connection between API Gateway and AgentCore Gateway. It allows you to directly target API Gateway, so that you don’t need to export API Gateway APIs as an OpenAPI 3 specification and then add it to AgentCore Gateway as an OpenAPI target.
With this integration, a new API_GATEWAY target type will be added to AgentCore Gateway, eliminating the manual export/import process. REST API owners can add their API as an AgentCore Gateway target with a few console interactions or a single CLI command to expose their existing REST API as MCP tools using AgentCore Gateway. API consumers can then connect AI agents with these REST APIs through the Model Context Protocol (MCP) and power their workflows with AI integration. Your agentic applications can now connect to your new or existing API Gateway API. This integration between AgentCore Gateway and API Gateway supports IAM and API key authorization.

Both AgentCore Gateway and API Gateway have integrations with Amazon CloudWatch Logs, AWS CloudTrail, and AWS X-Ray for observability. Agent developers using this new capability between AgentCore Gateway and API Gateway can use these observability tools.
Walkthrough
This post shows you how to set up an existing REST API with API Gateway as a target for AgentCore Gateway. With this integration you can use your existing REST APIs as a tool for your agentic applications exposed using AgentCore Gateway.
Prerequisites
For this example, you need the following:

An AWS account with an existing REST API in API Gateway.
An Identity and Access Management (IAM) role or user with enough permissions to create an AgentCore Gateway and set up an API Gateway target.

You can create gateways and add targets in multiple ways:

AWS Management Console
AWS SDK for Python (Boto3)
AWS Command Line Interface (AWS CLI)
AgentCore starter toolkit for fast and straightforward set up

This post uses Boto3 for setting up the integration between AgentCore Gateway and API Gateway. For an interactive walkthrough, you can use the Jupyter Notebook sample on GitHub.
Set up prerequisites for inbound and outbound authorization.
Inbound authorization authenticates incoming user requests. Outbound authorization helps AgentCore Gateway to securely connect to gateway targets, such as an API Gateway, on behalf of the authenticated user.

For API Gateway as a target, AgentCore Gateway supports the following types of outbound authorization:

No authorization (not recommended) – Some target types provide you the option to bypass outbound authorization. We do not recommend this less secure option.
IAM-based outbound authorization – Use the gateway service role to authorize access to the gateway target with AWS Signature Version 4 (Sig V4).
API key – Use the API key, which is set up using AgentCore Identity to authorize access to API Gateway target. API keys created using an API Gateway mapped with API Gateway usage plans, helps you monitor and control API usage. Please refer to this documentation for more details.

Create an IAM role with the trust policy from the documentation.

For Outbound Authorization with IAM-based authorization, the policy should include execute-api:Invoke permission. Sample inline policy:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Action”: [
“execute-api:Invoke”,
],
“Resource”: ” “arn:aws:execute-api:{AWS_Region}:{AWS_Account_ID}:api-id/stage/METHOD_HTTP_VERB/resource-path”,
“Effect”: “Allow”
}
]
}

For API key authorization, you can create an API key (see the API Gateway documentation) and associate it with your API Gateway usage plan. Then create an API key credential provider with AgentCore Identity.

Once done, update the policy as described in the AgentCore Gateway documentation.
Create an AgentCore Gateway
When using the AgentCore starter toolkit, you can create a gateway with a default authorization configuration using Amazon Cognito for JWT-based inbound authorization.

import boto3
gateway_client = boto3.client(‘bedrock-agentcore-control’)
auth_config = {
“customJWTAuthorizer”: {
“allowedClients”: [‘<cognito_client_id>’], # Client MUST match with the ClientId configured in Cognito. Example: 7rfbikfsm51j2fpaggacgng84g
“discoveryUrl”: <cognito_oauth_discovery_url>
}
}
create_response = gateway_client.create_gateway(
name=’sample-ac-gateway’,
roleArn='<IAM_Role_ARN>’, # The IAM Role must have permissions to create/list/get/delete Gateway
protocolType=’MCP’,
protocolConfiguration={
‘mcp’: {
‘supportedVersions’: [‘2025-03-26’],
‘searchType’: ‘SEMANTIC’
}
},
authorizerType=’CUSTOM_JWT’,
authorizerConfiguration=auth_config,
description=’AgentCore Gateway with API Gateway target’
)
print(create_response)
# Retrieve the GatewayID used for GatewayTarget creation
gatewayID = create_response[“gatewayId”]
gatewayURL = create_response[“gatewayUrl”]
print(gatewayID)

This returns GATEWAY_ID that you will need to create the gateway target.
Create an AgentCore Gateway target

Create a target configuration
To create an API gateway target, you need to specify the following as the part of target configuration:

toolFilters: Use this to determine what resources on the REST API will be exposed as tool on the gateway. Filters also support wildcards in the filterPath.
toolOverrides (optional): Use this to allow users to override tool names and description. You must specify explicit paths and methods.
restApiId: Use this to pass API Gateway ID.

Below are a few examples of target configurations:
Example 1
This exposes GET & POST /pets, GET /pets/{petId} to the gateway and overrides their tool names and descriptions.

{
“mcp”: {
“apiGateway”: {
“restApiId”: “<api-id>”,
“stage”: “<stage>”,
“apiGatewayToolConfiguration”: {
“toolFilters”: [
{
“filterPath”: “/pets”,
“methods”: [“GET”,”POST”]
},
{
“filterPath”: “/pets/{petId}”,
“methods”: [“GET”]
}
],
“toolOverrides” : [
{
“name”: “ListPets”,
“path”: “/pets”,
“method”: “GET”,
“description”:”Retrieves all the available Pets.”
},
{
“name”: “AddPet”,
“path”: “/pets”,
“method”: “POST”,
“description”:”Add a new pet to the available Pets.”
},
{
“path”: “/pets/{petId}”,
“method”: “GET”,
“name”: “GetPetById”,
“description”: “Retrieve a specific pet by its ID”
}
]
}
}
}
}

Example 2
This will expose GET /pets but also GET /pets/{petId} or anything under /pets. Since toolOverrides is not specified, it will use the resource description from API Gateway.

{
“mcp”: {
“apiGateway”: {
“restApiId”: “<api-id>”,
“stage”: “<stage>”,
“apiGatewayToolConfiguration”: {
“toolFilters”: [
{
“filterPath”: “/pets/*”,
“methods”: [“GET”]
}
]
}
}
}
}

Credential provider configuration
When creating a target, you also need to specify the target’s outbound authorization using a credential provider configuration. As discussed above, there are three types of credential providers:
GATEWAY_IAM_ROLE
This uses the ROLE_ARN you specified when creating the gateway. Define the credential provider configuration as follows:

[
{
“credentialProviderType”: “GATEWAY_IAM_ROLE”
}
]

API_KEY
This requires the creation of an API key credential provider with AgentCore Identity.

[
{
“credentialProviderType”: “API_KEY”,
“credentialProvider”: {
“apiKeyCredentialProvider”: {
“providerArn”: “<provider-arn>”,
“credentialParameterName”: “x-api-key”, // optional
“credentialPrefix”: “abc”, // optional, prefix is added to the API key when sending it to the target endpoint
“credentialLocation”: “HEADER” //optional, specifies where in the request the API key should be placed
}
}
}
]

NO_AUTH
NO_AUTH can be configured by not specifying a credential provider configuration while creating the AgentCore Gateway target. This is not recommended.
Create an AgentCore Gateway target
Now configure your REST API as a gateway target:

import boto3
gateway_client = boto3.client(‘bedrock-agentcore-control’)
create_gateway_target_response = gateway_client.create_gateway_target(
name=’api-gateway-target’,
gatewayIdentifier='<gateway_ID>’,
targetConfiguration=[< your_target_configuration>],
credentialProviderConfigurations=[<your_credential_config>]
)
print(create_gateway_target_response)
gateway_target_id=create_gateway_target_2_response[‘targetId’]

Test gateway with the Strands Agent framework
Test the gateway with the Strands Agents framework to list and call the available tools from MCP server. You can also use other MCP-compatible agents built with different agentic frameworks.

def create_streamable_http_transport():
return streamablehttp_client(
gatewayURL, headers={“Authorization”: f”Bearer {<Bearer_Token>}”}
)
client = MCPClient(create_streamable_http_transport)
with client:
# Call the listTools
tools = client.list_tools_sync()
# Create an Agent with the model and tools
agent = Agent(model=yourModel, tools=tools) ## you can replace with any model you like
# Invoke the agent with the sample prompt. This will only invoke MCP listTools and retrieve the list of tools the LLM has access to. The below does not actually call any tool.
agent(“Hi, can you list all tools available to you”)
# Tool calling
agent(“List all the available pets”)
agent(“Tell me about the pet with petId 3 “)
agent(“When my order will be delivered? My order id is 2”)

You will observe the following output:

I have access to the following tools:
1. **x_amz_bedrock_agentcore_search** – A search tool that returns a trimmed down list of tools based on a provided context/query
2. **api-gateway-target-1___Add_Pet** – Add a new pet to the available Pets
3. **api-gateway-target-1___GetPetById** – Retrieve a specific pet by its ID (requires petId parameter)
4. **api-gateway-target-1___List_Pets** – Retrieves all the available Pets (optional parameters: page, type)
5. **api-gateway-target-2___GetOrderById** – Retrieve a specific order by its ID (requires orderId parameter)
I’ll retrieve all the available pets for you.
Tool #1: api-gateway-target-1___List_Pets
“HTTP/1.1 200 OK”
Here are all the available pets:
1. **Pet ID 1** – Dog – $249.99
2. **Pet ID 2** – Cat – $124.99
3. **Pet ID 3** – Fish – $0.99
I’ll retrieve the details for pet ID 3.
Tool #2: api-gateway-target-1___GetPetById
“HTTP/1.1 200 OK”
Here are the details for pet ID 3:
– **Pet ID**: 3
– **Type**: Fish
– **Price**: $0.99
I’ll check the details of your order with ID 2 to see the delivery information.
Tool #3: api-gateway-target-2___GetOrderById
“HTTP/1.1 200 OK”
Based on your order details:
– **Order ID**: 2
– **Pet Category**: Cat
– **Price**: $124.99
– **Delivery Date**: 02-12-2025 (December 2nd, 2025)
Your cat order will be delivered on **December 2nd, 2025**.

Observability
Enable application logs and tracing for your AgentCore Gateway resource. You will see detailed logs to help monitor and troubleshoot your AgentCore Gateway resource. It will include the tool calls performed by your agentic application, request parameters, responses, and errors if any.
Example logs:

{
“resource_arn”: “arn:aws:bedrock-agentcore:us-west-2:<AWS_Account_Id>:gateway/sample-ac-gateway2-mgtqozexct”,
“event_timestamp”: 1763621922275,
“body”: {
“isError”: false,
“log”: “Executing tool api-gateway-target-1___GetPetById from target W8BCF5VEAZ”,
“id”: “3”
},
“account_id”: “<AWS_Account_Id>”,
“request_id”: “8a70f423-79ee-4168-9d68-b76ad3*****”,
“trace_id”: “324a2ecc08631a55a02bb8f74104****”,
“span_id”: “f58914982450ad9b”,
“timestamp”: “1763621922275”,
“gateway_id”: “sample-ac-gateway2-mgtqozexct”
}
{
“resource_arn”: “arn:aws:bedrock-agentcore:us-west-2: <AWS_Account_Id>:gateway/sample-ac-gateway2-mgtqozexct”,
“event_timestamp”: 1763621922348,
“body”: {
“isError”: false,
“responseBody”: “{jsonrpc=2.0, id=3, result={isError=false, content=[{type=text, text={“id”:3,”type”:”fish”,”price”:0.99}}]}}”,
“log”: “Successfully processed request”,
“id”: “3”
},
“account_id”: “<AWS_Account_Id>”,
“request_id”: “8a70f423-79ee-4168-9d68-b76ad3ef****”,
“trace_id”: “324a2ecc08631a55a02bb8f7410****”,
“span_id”: “f58914982450ad9b”,
“timestamp”: “1763621922348”,
“gateway_id”: “sample-ac-gateway2-mgtqozexct”
}

Along with this, AgentCore Gateway offers detailed CloudWatch metrics including the usage metrics (TargetType, IngressAuthType, EgressAuthType, RequestsPerSession), invocation metrics (Invocations, ConcurrentExecutions, Sessions), performance metrics (Latency, Duration, TargetExecutionTime), and error rates (Throttles, SystemErrors, UserErrors).

AgentCore Gateway also supports AWS X-Ray and OTEL conformant vended spans that customers can use to track invocations across different primitives that are being used.

To learn more, see the AgentCore Gateway Observability documentation.
Clean up
To avoid recurring charges, make sure to delete the resources created by running the following code.

import boto3
gateway_client = boto3.client(‘bedrock-agentcore-control’)
# Deleting the Gateway
Targetresponse = gateway_client.delete_gateway_target( gatewayIdentifier='<Gateway_Id>’, targetId='<Target_Id>’)print(response)
# Deleting the Gateway
response = gateway_client.delete_gateway(
gatewayIdentifier='<Gateway_Id>’)
print(response)

Conclusion
AgentCore Gateway now supports Amazon API Gateway as a target, exposing REST APIs as MCP-compatible endpoints. You can bring your existing API infrastructure to agentic use cases while using your current security and observability tools.
Visit our developer documentation and workshop to learn more and get started today.

About the authors
With over 6+ years at AWS, Sparsh Wadhwa brings deep expertise in serverless, event-driven architectures, and Generative AI to his work with ISV customers in India. As a Solutions Architect, he partners with Independent Software Vendors to reimagine their products for the cloud era—from modernizing legacy systems to embedding AI capabilities that differentiate their offerings. Sparsh believes the best solutions emerge from understanding both technical possibilities and business context.
Heeki Park is a Principal Solutions Architect at AWS. In his 9+ years at AWS, he helped enterprise customers think about how to build and operate cloud-native applications, adopt serverless and event-driven patterns, and build pragmatic generative AI applications. Heeki is an avid runner and enjoys analyzing activity data to measure improvement in cardiovascular fitness.
Dhawal Patel is a Principal Generative AI Tech lead at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to agentic AI, deep learning, and distributed computing.

Google Colab Integrates KaggleHub for One Click Access to Kaggle Datas …

Google is closing an old gap between Kaggle and Colab. Colab now has a built in Data Explorer that lets you search Kaggle datasets, models and competitions directly inside a notebook, then pull them in through KaggleHub without leaving the editor.

What Colab Data Explorer actually ships?

Kaggle announced the feature recently where they describe a panel in the Colab notebook editor that connects to Kaggle search.

From this panel you can:

Search Kaggle datasets, models and competitions

Access the feature from the left toolbar in Colab

Use integrated filters to refine the results, for example by resource type or relevance

The Colab Data Explorer lets you search Kaggle datasets, models and competitions directly from a Colab notebook and that you can import data with a KaggleHub code snippet and integrated filters.

The old Kaggle to Colab pipeline was all setup work

Before this launch, most workflows that pulled Kaggle data into Colab followed a fixed sequence.

You created a Kaggle account, generated an API token, downloaded the kaggle.json credentials file, uploaded that file into the Colab runtime, set environment variables and then used the Kaggle API or command line interface to download datasets.

The steps were well documented and reliable. They were also mechanical and easy to misconfigure, especially for beginners who had to debug missing credentials or incorrect paths before they could even run pandas.read_csv on a file. Many tutorials exist only to explain this setup.

Colab Data Explorer does not remove the need for Kaggle credentials. It changes how you reach Kaggle resources and how much code you must write before you can start analysis.

KaggleHub is the integration layer

KaggleHub is a Python library that provides a simple interface to Kaggle datasets, models and notebook outputs from Python environments.

The key properties, which matter for Colab users, are:

KaggleHub works in Kaggle notebooks and in external environments such as local Python and Colab

It authenticates using existing Kaggle API credentials when needed

It exposes resource centric functions such as model_download and dataset_download which take Kaggle identifiers and return paths or objects in the current environment

Colab Data Explorer uses this library as the loading mechanism. When you select a dataset or model in the panel, Colab shows a KaggleHub code snippet that you run inside the notebook to access that resource.

Once the snippet runs, the data is available in the Colab runtime. You can then read it with pandas, train models with PyTorch or TensorFlow or plug it into evaluation code, just as you would with any local files or data objects.

The post Google Colab Integrates KaggleHub for One Click Access to Kaggle Datasets, Models and Competitions appeared first on MarkTechPost.

Cisco Released Cisco Time Series Model: Their First Open-Weights Found …

Cisco and Splunk have introduced the Cisco Time Series Model, a univariate zero shot time series foundation model designed for observability and security metrics. It is released as an open weight checkpoint on Hugging Face under an Apache 2.0 license, and it targets forecasting workloads without task specific fine tuning. The model extends TimesFM 2.0 with an explicit multiresolution architecture that fuses coarse and fine history in one context window.

https://arxiv.org/pdf/2511.19841

Why observability needs multiresolution context?

Production metrics are not simple single scale signals. Weekly patterns, long term growth and saturation are visible only at coarse resolutions. Saturation events, traffic spikes and incident dynamics show up at 1 minute or 5 minute resolution. The common time series foundation models work at a single resolution with context windows between 512 and 4096 points, while TimesFM 2.5 extends this to 16384 points. For 1 minute data this still covers at most a couple of weeks and often less.

This is a problem in observability where data platforms often retain only old data in aggregated form. Fine grained samples expire and survive only as 1 hour rollups. Cisco Time Series Model is built for this storage pattern. It treats coarse history as a first class input that improves forecasts at the fine resolution. The architecture operates directly on a multiresolution context instead of pretending that all inputs live on a single grid.

https://arxiv.org/pdf/2511.19841

Multiresolution input and forecasting objective

Formally, the model consumes a pair of contexts, (xc, xf). The coarse context (x_c) and the fine context (x_f) each have length up to 512. The spacing of (xc) is fixed at 60 times the spacing of (xf). A typical observability setup uses 512 hours of 1 hour aggregates and 512 minutes of 1 minute values. Both series terminate at the same forecast cut point. The model predicts a horizon of 128 points at the fine resolution, with a mean and a set of quantiles from 0.1 to 0.9.

Architecture, TimesFM core with resolution embeddings

Internally, Cisco Time Series Model reuses the TimesFM patch based decoder stack. The inputs are normalized, patched into non overlapping chunks, and passed through a residual embedding block. The transformer core consists of 50 decoder only layers. A final residual block maps tokens back to the horizon. The research team remove positional embeddings and instead rely on patch ordering, the multiresolution structure and a new resolution embedding to encode structure.

Two additions make the architecture multiresolution aware. A special token, often called ST in the report, is inserted between the coarse and fine token streams. It lives in sequence space and marks the boundary between resolutions. Resolution embeddings, often called RE, are added in model space. One embedding vector is used for all coarse tokens and another for all fine tokens. Ablation studies in the paper show that both components improve quality, especially in long context scenarios.

The decode procedure is also multiresolution. The model outputs mean and quantile forecasts for the fine resolution horizon. During long horizon decoding, newly predicted fine points are appended to the fine context. Aggregates of these predictions update the coarse context. This creates an autoregressive loop in which both resolutions evolve together during forecasting.

https://arxiv.org/pdf/2511.19841

Training data and recipe

Cisco Time Series Model is trained by continued pretraining on top of TimesFM weights. The final model has 500 million parameters. Training uses AdamW for biases, norms and embeddings, and Muon for the hidden layers, with cosine learning rate schedules. The loss combines mean squared error on the mean forecast with quantile loss over the quantiles from 0.1 to 0.9. The team trains for 20 epochs and picks the best checkpoint by validation loss.

The dataset is large and skewed toward observability. The Splunk team reports about 400 million metrics time series from their own Splunk Observability Cloud deployments, collected at 1 minute resolution over 13 months and partly aggregated to 5 minute resolution. The research team states that the final corpus contains more than 300 billion unique data points, with about 35 percent 1 minute observability, 16.5 percent 5 minute observability, 29.5 percent GIFT Eval pretraining data, 4.5 percent Chronos datasets and 14.5 percent synthetic KernelSynth series.

Benchmark results on observability and GIFT Eval

The research team evaluate the model on two main benchmarks. The first is an observability dataset derived from Splunk metrics at 1 minute and 5 minute resolution. The second is a filtered version of GIFT Eval, where datasets that leak TimesFM 2.0 training data are removed.

On observability data at 1 minute resolution with 512 fine steps, Cisco Time Series Model using a 512 multiresolution context reduces mean absolute error from 0.6265 for TimesFM 2.5 and 0.6315 for TimesFM 2.0 to 0.4788, with similar improvements in mean absolute scaled error and continuous ranked probability score. Similar gains appear at 5 minute resolution. Across both resolutions, the model outperforms Chronos 2, Chronos Bolt, Toto and AutoARIMA baselines under the normalized metrics used in the paper.

On the filtered GIFT Eval benchmark, Cisco Time Series Model matches the base TimesFM 2.0 model and performs competitively with TimesFM-2.5, Chronos-2 and Toto. The key claim is not universal dominance but preservation of general forecasting quality while adding a strong advantage on long context windows and observability workloads.

https://arxiv.org/pdf/2511.19841

Key Takeaways

Cisco Time Series Model is a univariate zero shot time series foundation model that extends the TimesFM 2.0 decoder only backbone with a multiresolution architecture for observability and security metrics.

The model consumes a multiresolution context, with a coarse series and a fine series, each up to 512 steps long, where the coarse resolution is 60 times the fine resolution, and it predicts 128 fine resolution steps with mean and quantile outputs.

Cisco Time Series Model is trained on more than 300B data points, with more than half from observability, mixing Splunk machine data, GIFT Eval, Chronos datasets and synthetic KernelSynth series, and it has about 0.5B parameters.

On observability benchmarks at 1 minute and 5 minute resolutions, the model achieves lower error than TimesFM 2.0’s, Chronos and other baselines, while retaining competitive performance on the general purpose GIFT Eval benchmark.

Check out the Paper, Blog and Model Card on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Cisco Released Cisco Time Series Model: Their First Open-Weights Foundation Model based on Decoder-only Transformer Architecture appeared first on MarkTechPost.

A Coding Implementation of a Complete Hierarchical Bayesian Regression …

In this tutorial, we explore hierarchical Bayesian regression with NumPyro and walk through the entire workflow in a structured manner. We start by generating synthetic data, then we define a probabilistic model that captures both global patterns and group-level variations. Through each snippet, we set up inference using NUTS, analyze posterior distributions, and perform posterior predictive checks to understand how well our model captures the underlying structure. By approaching the tutorial step by step, we build an intuitive understanding of how NumPyro enables flexible, scalable Bayesian modeling. Check out the Full Codes here.

Copy CodeCopiedUse a different Browsertry:
import numpyro
except ImportError:
!pip install -q “llvmlite>=0.45.1” “numpyro[cpu]” matplotlib pandas

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import jax
import jax.numpy as jnp
from jax import random
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS, Predictive
from numpyro.diagnostics import hpdi

numpyro.set_host_device_count(1)

We set up our environment by installing NumPyro and importing all required libraries. We prepare JAX, NumPyro, and plotting tools so we have everything ready for Bayesian inference. As we run this cell, we ensure our Colab session is fully equipped for hierarchical modeling. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef generate_data(key, n_groups=8, n_per_group=40):
k1, k2, k3, k4 = random.split(key, 4)
true_alpha = 1.0
true_beta = 0.6
sigma_alpha_g = 0.8
sigma_beta_g = 0.5
sigma_eps = 0.7
group_ids = np.repeat(np.arange(n_groups), n_per_group)
n = n_groups * n_per_group
alpha_g = random.normal(k1, (n_groups,)) * sigma_alpha_g
beta_g = random.normal(k2, (n_groups,)) * sigma_beta_g
x = random.normal(k3, (n,)) * 2.0
eps = random.normal(k4, (n,)) * sigma_eps
a = true_alpha + alpha_g[group_ids]
b = true_beta + beta_g[group_ids]
y = a + b * x + eps
df = pd.DataFrame({“y”: np.array(y), “x”: np.array(x), “group”: group_ids})
truth = dict(true_alpha=true_alpha, true_beta=true_beta,
sigma_alpha_group=sigma_alpha_g, sigma_beta_group=sigma_beta_g,
sigma_eps=sigma_eps)
return df, truth

key = random.PRNGKey(0)
df, truth = generate_data(key)
x = jnp.array(df[“x”].values)
y = jnp.array(df[“y”].values)
groups = jnp.array(df[“group”].values)
n_groups = int(df[“group”].nunique())

We generate synthetic hierarchical data that mimics real-world group-level variation. We convert this data into JAX-friendly arrays so NumPyro can process it efficiently. By doing this, we lay the foundation for fitting a model that learns both global trends and group differences. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef hierarchical_regression_model(x, group_idx, n_groups, y=None):
mu_alpha = numpyro.sample(“mu_alpha”, dist.Normal(0.0, 5.0))
mu_beta = numpyro.sample(“mu_beta”, dist.Normal(0.0, 5.0))
sigma_alpha = numpyro.sample(“sigma_alpha”, dist.HalfCauchy(2.0))
sigma_beta = numpyro.sample(“sigma_beta”, dist.HalfCauchy(2.0))
with numpyro.plate(“group”, n_groups):
alpha_g = numpyro.sample(“alpha_g”, dist.Normal(mu_alpha, sigma_alpha))
beta_g = numpyro.sample(“beta_g”, dist.Normal(mu_beta, sigma_beta))
sigma_obs = numpyro.sample(“sigma_obs”, dist.Exponential(1.0))
alpha = alpha_g[group_idx]
beta = beta_g[group_idx]
mean = alpha + beta * x
with numpyro.plate(“data”, x.shape[0]):
numpyro.sample(“y”, dist.Normal(mean, sigma_obs), obs=y)

nuts = NUTS(hierarchical_regression_model, target_accept_prob=0.9)
mcmc = MCMC(nuts, num_warmup=1000, num_samples=1000, num_chains=1, progress_bar=True)
mcmc.run(random.PRNGKey(1), x=x, group_idx=groups, n_groups=n_groups, y=y)
samples = mcmc.get_samples()

We define our hierarchical regression model and launch the NUTS-based MCMC sampler. We allow NumPyro to explore the posterior space and learn parameters such as group intercepts and slopes. As this sampling completes, we obtain rich posterior distributions that reflect uncertainty at every level. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef param_summary(arr):
arr = np.asarray(arr)
mean = arr.mean()
lo, hi = hpdi(arr, prob=0.9)
return mean, float(lo), float(hi)

for name in [“mu_alpha”, “mu_beta”, “sigma_alpha”, “sigma_beta”, “sigma_obs”]:
m, lo, hi = param_summary(samples[name])
print(f”{name}: mean={m:.3f}, HPDI=[{lo:.3f}, {hi:.3f}]”)

predictive = Predictive(hierarchical_regression_model, samples, return_sites=[“y”])
ppc = predictive(random.PRNGKey(2), x=x, group_idx=groups, n_groups=n_groups)
y_rep = np.asarray(ppc[“y”])

group_to_plot = 0
mask = df[“group”].values == group_to_plot
x_g = df.loc[mask, “x”].values
y_g = df.loc[mask, “y”].values
y_rep_g = y_rep[:, mask]

order = np.argsort(x_g)
x_sorted = x_g[order]
y_rep_sorted = y_rep_g[:, order]
y_med = np.median(y_rep_sorted, axis=0)
y_lo, y_hi = np.percentile(y_rep_sorted, [5, 95], axis=0)

plt.figure(figsize=(8, 5))
plt.scatter(x_g, y_g)
plt.plot(x_sorted, y_med)
plt.fill_between(x_sorted, y_lo, y_hi, alpha=0.3)
plt.show()

We analyze our posterior samples by computing summaries and performing posterior predictive checks. We visualize how well the model recreates observed data for a selected group. This step helps us understand how accurately our model captures the underlying generative process. Check out the Full Codes here.

Copy CodeCopiedUse a different Browseralpha_g = np.asarray(samples[“alpha_g”]).mean(axis=0)
beta_g = np.asarray(samples[“beta_g”]).mean(axis=0)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].bar(range(n_groups), alpha_g)
axes[0].axhline(truth[“true_alpha”], linestyle=”–“)
axes[1].bar(range(n_groups), beta_g)
axes[1].axhline(truth[“true_beta”], linestyle=”–“)
plt.tight_layout()
plt.show()

We plot the estimated group-level intercepts and slopes to compare their learned patterns with the true values. We explore how each group behaves and how the model adapts to their differences. This final visualization brings together the complete picture of hierarchical inference.

In conclusion, we implemented how NumPyro allows us to model hierarchical relationships with clarity, efficiency, and strong expressive power. We observed how the posterior results reveal meaningful global and group-specific effects, and how predictive checks validate the model’s fit to the generated data. As we put everything together, we gain confidence in constructing, fitting, and interpreting hierarchical models using JAX-powered inference. This process strengthens our ability to apply Bayesian thinking to richer, more realistic datasets where multilevel structure is essential.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation of a Complete Hierarchical Bayesian Regression Workflow in NumPyro Using JAX-Powered Inference and Posterior Predictive Analysis appeared first on MarkTechPost.

OpenAGI Foundation Launches Lux: A Foundation Computer Use Model that …

How do you turn slow, manual click work across browsers and desktops into a reliable, automated system that can actually use a computer for you at scale? Lux is the latest example of computer use agents moving from research demo to infrastructure. OpenAGI Foundation team has released Lux, a foundation model that operates real desktops and browsers and reports a score of 83.6 on the Online Mind2Web benchmark, which covers more than 300 real world computer use tasks. This is ahead of Google Gemini CUA at 69.0, OpenAI Operator at 61.3 and Anthropic Claude Sonnet 4 at 61.0.

https://agiopen.org/blog

What Lux Actually Does?

Lux is a computer use model, not a chat model with a browser plugin. It takes a natural language goal, views the screen, and outputs low level actions such as clicks, key presses and scroll events. It can drive browsers, editors, spreadsheets, email clients and other desktop applications because it works on rendered UI, not on application specific APIs.

From a developer point of view, Lux is available through the OpenAGI SDK and API console. The research team describes target workloads that include software QA flows, deep research runs, social media management, online store operations and bulk data entry. In all of these settings the agent needs to sequence dozens or hundreds of UI actions while staying aligned with a natural language task description.

https://agiopen.org/blog

Three Execution Modes For Different Control Levels

Lux ships with three execution modes that expose different tradeoffs between speed, autonomy and control.

Actor mode is the fast path. It runs around 1 second per step and is aimed at clearly specified tasks such as filling a form, pulling a report from a dashboard or extracting a small set of fields from a page. Think of it as a low latency macro engine that still understands natural language.

Thinker mode handles vague or multi step goals. It decomposes the high level instruction into smaller sub tasks and then executes them. Example workloads include multi page research, triage of long email queues or navigation of analytics interfaces where the exact click path is not specified in advance.

Tasker mode gives maximum determinism. The caller supplies an explicit Python list of steps that Lux executes one by one and it retries until the sequence completes or hits a hard failure. This allows teams to keep task graphs, guardrails and failure policies in their own code while delegating UI control to the model.

Tasker, Actor and Thinker are the three primary modes for procedural workflows, fast execution and complex goal solving.

Benchmarks, Latency And Cost

On Online Mind2Web, Lux reaches a success rate of 83.6 percent. The same benchmark reports 69.0 percent for Gemini CUA, 61.3 percent for OpenAI Operator and 61.0 percent for Claude Sonnet 4. The benchmark contains more than 300 web based tasks collected from real services, so it is a useful proxy for practical agents that drive browsers and web apps.

Latency and cost are where the numbers become important for engineering teams. OpenAGI team reports that Lux completes each step in about 1 second, while OpenAI Operator is around 3 seconds per step in the same evaluation setting. The research team also states that Lux is about 10 times cheaper per token than Operator. For any agent that can easily run hundreds of steps in a session, these constant factors determine whether a workload is viable in production.

Agentic Active Pre-training and Why OSGym Matters?

Lux is trained with a method that OpenAGI research team calls Agentic Active Pre-training. The team contrasts this with standard language model pre-training that passively ingests text from the internet. The idea is that Lux learns by acting in digital environments and refining its behavior through large scale interaction, rather than only minimizing token prediction loss on static logs. The optimization objective differs from classical reinforcement learning, and is set up to favor self driven exploration and understanding instead of a manually shaped reward.

This training setup depends on a data engine that can expose many operating system environments in parallel. OpenAGI team has already open sourced that engine as OSGym, under an MIT license that allows both research and commercial use. OSGym runs full operating system replicas, not only browser sandboxes, and supports tasks that span office software, browsers, development tools and multi application workflows.

Key Takeaways

Lux is a foundation computer use model that operates full desktops and browsers and reaches 83.6 percent success on the Online Mind2Web benchmark, ahead of Gemini CUA, OpenAI Operator and Claude Sonnet-4.

Lux exposes 3 modes, Actor, Thinker and Tasker, which cover low latency UI macros, multi step goal decomposition and deterministic scripted execution for production workflows.

Lux is reported to run around 1 second per step and to be about 10 times cheaper per token than OpenAI Operator, which matters for long horizon agents that run hundreds of actions per task.

Lux is trained with Agentic Active Pre-training, where the model learns by acting in environments, rather than only consuming static web text, which targets robust screen to action behavior instead of pure language modeling.

OSGym, the open source data engine behind Lux, can run more than 1,000 OS replicas and generate more than 1,400 multi turn trajectories per minute at low per replica cost, which gives teams a practical way to train and evaluate their own computer use agents.

Check out the Official Announcement, Project and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OpenAGI Foundation Launches Lux: A Foundation Computer Use Model that Tops Online Mind2Web with OSGym At Scale appeared first on MarkTechPost.

Kernel Principal Component Analysis (PCA): Explained with an Example

Dimensionality reduction techniques like PCA work wonderfully when datasets are linearly separable—but they break down the moment nonlinear patterns appear. That’s exactly what happens with datasets such as two moons: PCA flattens the structure and mixes the classes together. 

Kernel PCA fixes this limitation by mapping the data into a higher-dimensional feature space where nonlinear patterns become linearly separable. In this article, we’ll walk through how Kernel PCA works and use a simple example to visually compare PCA vs. Kernel PCA, showing how a nonlinear dataset that PCA fails to separate becomes perfectly separable after applying Kernel PCA.

What is PCA and how is it different from Kernel PCA?

Principal Component Analysis (PCA) is a linear dimensionality-reduction technique that identifies the directions (principal components) along which the data varies the most. It works by computing orthogonal linear combinations of the original features and projecting the dataset onto the directions of maximum variance. 

These components are uncorrelated and ordered so that the first few capture most of the information in the data. PCA is powerful, but it comes with one important limitation: it can only uncover linear relationships in the data. When applied to nonlinear datasets—like the “two moons” example—it often fails to separate the underlying structure.

Kernel PCA extends PCA to handle nonlinear relationships. Instead of directly applying PCA in the original feature space, Kernel PCA first uses a kernel function (such as RBF, polynomial, or sigmoid) to implicitly project the data into a higher-dimensional feature space where the nonlinear structure becomes linearly separable. 

PCA is then performed in this transformed space using a kernel matrix, without explicitly computing the higher-dimensional projection. This “kernel trick” allows Kernel PCA to capture complex patterns that standard PCA cannot.

We will now create a dataset that is nonlinear and then apply PCA to the dataset.

Code Implementation

Generating the dataset

We generate a nonlinear “two moons” dataset using make_moons, which is ideal for demonstrating why PCA fails and Kernel PCA succeeds.

Copy CodeCopiedUse a different Browserimport matplotlib.pyplot as plt
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.02, random_state=123)

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

Applying PCA on the dataset

Copy CodeCopiedUse a different Browserfrom sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.title(“PCA”)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.xlabel(“Component 1”)
plt.ylabel(“Component 2”)
plt.show()

The PCA visualization shows that the two moon-shaped clusters remain intertwined even after dimensionality reduction. This happens because PCA is a strictly linear technique—it can only rotate, scale, or flatten the data along straight directions of maximum variance. 

Since the “two moons” dataset has a nonlinear structure, PCA is unable to separate the classes or untangle the curved shapes. As a result, the transformed data still looks almost identical to the original pattern, and the two classes remain overlapped in the projected space.

Applying Kernel PCA on the dataset

We now apply Kernel PCA using an RBF kernel, which maps the nonlinear data into a higher-dimensional space where it becomes linearly separable. In the kernel space the two classes in our dataset are linearly separable. Kernel PCA uses a kernel function to project the dataset into a higher-dimensional space, where it is linearly separable.

Copy CodeCopiedUse a different Browserfrom sklearn.decomposition import KernelPCA
kpca = KernelPCA(kernel=’rbf’, gamma=15)
X_kpca = kpca.fit_transform(X)

plt.title(“Kernel PCA”)
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c=y)
plt.show()

The goal of PCA (and dimensionality reduction in general) is not just to compress the data—it’s to reveal the underlying structure in a way that preserves meaningful variation. In nonlinear datasets like the two-moons example, traditional PCA cannot “unfold” the curved shapes because it only applies linear transformations.

Kernel PCA, however, performs a nonlinear mapping before applying PCA, allowing the algorithm to untangle the moons into two clearly separated clusters. This separation is valuable because it makes downstream tasks like visualization, clustering, and even classification far more effective. When the data becomes linearly separable after transformation, simple models—such as linear classifiers—can successfully distinguish between the classes, something that would be impossible in the original or PCA-transformed space.

Challenges involved with Kernel PCA

While Kernel PCA is powerful for handling nonlinear datasets, it comes with several practical challenges. The biggest drawback is computational cost—because it relies on computing pairwise similarities between all data points, the algorithm has O(n²) time and memory complexity, making it slow and memory-heavy for large datasets. 

Another challenge is model selection: choosing the right kernel (RBF, polynomial, etc.) and tuning parameters like gamma can be tricky and often requires experimentation or domain expertise. 

Kernel PCA can also be harder to interpret, since the transformed components no longer correspond to intuitive directions in the original feature space. Finally, it is sensitive to missing values and outliers, which can distort the kernel matrix and degrade performance.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Kernel Principal Component Analysis (PCA): Explained with an Example appeared first on MarkTechPost.

How to Design a Fully Local Multi-Agent Orchestration System Using Tin …

In this tutorial, we explore how we can orchestrate a team of specialized AI agents locally using an efficient manager-agent architecture powered by TinyLlama. We walk through how we build structured task decomposition, inter-agent collaboration, and autonomous reasoning loops without relying on any external APIs. By running everything directly through the transformers library, we create a fully offline, lightweight, and transparent multi-agent system that we can customize, inspect, and extend. Through the snippets, we observe how each component, from task structures to agent prompts to result synthesis, comes together to form a coherent human-AI workflow that we control end-to-end. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install transformers torch accelerate bitsandbytes -q

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
import re
from typing import List, Dict, Any
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class Task:
id: str
description: str
assigned_to: str = None
status: str = “pending”
result: Any = None
dependencies: List[str] = None

def __post_init__(self):
if self.dependencies is None:
self.dependencies = []

@dataclass
class Agent:
name: str
role: str
expertise: str
system_prompt: str

We set up all the core imports and define the fundamental data structures needed to manage tasks and agents. We define Task and Agent as structured entities to cleanly orchestrate work. By doing this, we ensure that every part of the system has a consistent and reliable foundation. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserAGENT_REGISTRY = {
“researcher”: Agent(
name=”researcher”,
role=”Research Specialist”,
expertise=”Information gathering, analysis, and synthesis”,
system_prompt=”You are a research specialist. Provide thorough research on topics.”
),
“coder”: Agent(
name=”coder”,
role=”Software Engineer”,
expertise=”Writing clean, efficient code with best practices”,
system_prompt=”You are an expert programmer. Write clean, well-documented code.”
),
“writer”: Agent(
name=”writer”,
role=”Content Writer”,
expertise=”Clear communication and documentation”,
system_prompt=”You are a professional writer. Create clear, engaging content.”
),
“analyst”: Agent(
name=”analyst”,
role=”Data Analyst”,
expertise=”Data interpretation and insights”,
system_prompt=”You are a data analyst. Provide clear insights from data.”
)
}

class LocalLLM:
def __init__(self, model_name: str = “TinyLlama/TinyLlama-1.1B-Chat-v1.0″):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
) if torch.cuda.is_available() else None
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map=”auto”,
low_cpu_mem_usage=True
)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token

def generate(self, prompt: str, max_tokens: int = 300) -> str:
formatted_prompt = f”<|system|>nYou are a helpful AI assistant.</s>n<|user|>n{prompt}</s>n<|assistant|>n”
inputs = self.tokenizer(
formatted_prompt,
return_tensors=”pt”,
truncation=True,
max_length=1024,
padding=True
)
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
top_p=0.9,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
use_cache=True
)
full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
if “<|assistant|>” in full_response:
return full_response.split(“<|assistant|>”)[-1].strip()
return full_response[len(formatted_prompt):].strip()

We register all our specialized agents and implement the local LLM wrapper that powers the system. We load TinyLlama in an efficient 4-bit mode so we can run everything smoothly on Colab or local hardware. With this, we give ourselves a flexible and fully local way to generate responses for each agent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ManagerAgent:
def __init__(self, model_name: str = “TinyLlama/TinyLlama-1.1B-Chat-v1.0”):
self.llm = LocalLLM(model_name)
self.agents = AGENT_REGISTRY
self.tasks: Dict[str, Task] = {}
self.execution_log = []

def log(self, message: str):
timestamp = datetime.now().strftime(“%H:%M:%S”)
log_entry = f”[{timestamp}] {message}”
self.execution_log.append(log_entry)
print(log_entry)

def decompose_goal(self, goal: str) -> List[Task]:
self.log(f” Decomposing goal: {goal}”)
agent_info = “n”.join([f”- {name}: {agent.expertise}” for name, agent in self.agents.items()])
prompt = f”””Break down this goal into 3 specific subtasks. Assign each to the best agent.

Goal: {goal}

Available agents:
{agent_info}

Respond ONLY with a JSON array.”””
response = self.llm.generate(prompt, max_tokens=250)
try:
json_match = re.search(r'[s*{.*?}s*]’, response, re.DOTALL)
if json_match:
tasks_data = json.loads(json_match.group())
else:
raise ValueError(“No JSON found”)
except:
tasks_data = self._create_default_tasks(goal)

tasks = []
for i, task_data in enumerate(tasks_data[:3]):
task = Task(
id=task_data.get(‘id’, f’task_{i+1}’),
description=task_data.get(‘description’, f’Work on: {goal}’),
assigned_to=task_data.get(‘assigned_to’, list(self.agents.keys())[i % len(self.agents)]),
dependencies=task_data.get(‘dependencies’, [] if i == 0 else [f’task_{i}’])
)
self.tasks[task.id] = task
tasks.append(task)
self.log(f” ✓ {task.id}: {task.description[:50]}… → {task.assigned_to}”)

return tasks

We begin constructing the ManagerAgent class and focus on how we decompose a high-level goal into well-defined subtasks. We generate structured JSON-based tasks and automatically assign them to the right agent. By doing this, we allow the system to think step by step and organize work just like a human project manager. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def _create_default_tasks(self, goal: str) -> List[Dict]:
if any(word in goal.lower() for word in [‘code’, ‘program’, ‘implement’, ‘algorithm’]):
return [
{“id”: “task_1”, “description”: f”Research and explain the concept: {goal}”, “assigned_to”: “researcher”, “dependencies”: []},
{“id”: “task_2”, “description”: f”Write code implementation for: {goal}”, “assigned_to”: “coder”, “dependencies”: [“task_1”]},
{“id”: “task_3”, “description”: f”Create documentation and examples”, “assigned_to”: “writer”, “dependencies”: [“task_2”]}
]
return [
{“id”: “task_1”, “description”: f”Research: {goal}”, “assigned_to”: “researcher”, “dependencies”: []},
{“id”: “task_2”, “description”: f”Analyze findings and structure content”, “assigned_to”: “analyst”, “dependencies”: [“task_1”]},
{“id”: “task_3”, “description”: f”Write comprehensive response”, “assigned_to”: “writer”, “dependencies”: [“task_2″]}
]

def execute_task(self, task: Task, context: Dict[str, Any] = None) -> str:
self.log(f” Executing {task.id} with {task.assigned_to}”)
task.status = “in_progress”
agent = self.agents[task.assigned_to]
context_str = “”
if context and task.dependencies:
context_str = “nnContext from previous tasks:n”
for dep_id in task.dependencies:
if dep_id in context:
context_str += f”- {context[dep_id][:150]}…n”

prompt = f”””{agent.system_prompt}

Task: {task.description}{context_str}

Provide a clear, concise response:”””
result = self.llm.generate(prompt, max_tokens=250)
task.result = result
task.status = “completed”
self.log(f” ✓ Completed {task.id}”)
return result

We define fallback task logic and the full execution flow for each task. We guide each agent with its own system prompt and provide contextual information to keep results coherent. This allows us to execute tasks intelligently while respecting dependency order. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef synthesize_results(self, goal: str, results: Dict[str, str]) -> str:
self.log(” Synthesizing final results”)
results_text = “nn”.join([f”Task {tid}:n{res[:200]}” for tid, res in results.items()])
prompt = f”””Combine these task results into one final coherent answer.

Original Goal: {goal}

Task Results:
{results_text}

Final comprehensive answer:”””
return self.llm.generate(prompt, max_tokens=350)

def execute_goal(self, goal: str) -> Dict[str, Any]:
self.log(f”n{‘=’*60}n Starting Manager Agentn{‘=’*60}”)
tasks = self.decompose_goal(goal)
results = {}
completed = set()
max_iterations = len(tasks) * 2
iteration = 0

while len(completed) < len(tasks) and iteration < max_iterations:
iteration += 1
for task in tasks:
if task.id in completed:
continue
deps_met = all(dep in completed for dep in task.dependencies)
if deps_met:
result = self.execute_task(task, results)
results[task.id] = result
completed.add(task.id)

final_output = self.synthesize_results(goal, results)
self.log(f”n{‘=’*60}n Execution Complete!n{‘=’*60}n”)

return {
“goal”: goal,
“tasks”: [asdict(task) for task in tasks],
“final_output”: final_output,
“execution_log”: self.execution_log
}

We synthesize the outputs from all subtasks and convert them into one unified final answer. We also implement an orchestration loop that ensures each task runs only after its dependencies are complete. This snippet shows how we bring everything together into a smooth multi-step reasoning pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_basic():
manager = ManagerAgent()
goal = “Explain binary search algorithm with a simple example”
result = manager.execute_goal(goal)
print(“n” + “=”*60)
print(“FINAL OUTPUT”)
print(“=”*60)
print(result[“final_output”])
return result

def demo_coding():
manager = ManagerAgent()
goal = “Implement a function to find the maximum element in a list”
result = manager.execute_goal(goal)
print(“n” + “=”*60)
print(“FINAL OUTPUT”)
print(“=”*60)
print(result[“final_output”])
return result

def demo_custom(custom_goal: str):
manager = ManagerAgent()
result = manager.execute_goal(custom_goal)
print(“n” + “=”*60)
print(“FINAL OUTPUT”)
print(“=”*60)
print(result[“final_output”])
return result

if __name__ == “__main__”:
print(” Manager Agent Tutorial – APIless Local Version”)
print(“=”*60)
print(“Using TinyLlama (1.1B) – Fast & efficient!n”)
result = demo_basic()
print(“nn Try more:”)
print(” – demo_coding()”)
print(” – demo_custom(‘your goal here’)”)

We provide demonstration functions to easily test our system with different goals. We run sample tasks to observe how the manager decomposes, executes, and synthesizes work in real time. This gives us an interactive way to understand the entire workflow and refine it further.

In conclusion, we demonstrate how to design and operate a complete multi-agent orchestration system locally with minimal dependencies. We now understand how the manager breaks down goals, routes tasks to the right expert agents, collects their outputs, resolves dependencies, and synthesizes the final result. This implementation allows us to appreciate how modular, predictable, and powerful local agentic patterns can be when built from scratch.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Fully Local Multi-Agent Orchestration System Using TinyLlama for Intelligent Task Decomposition and Autonomous Collaboration appeared first on MarkTechPost.

Apple Researchers Release CLaRa: A Continuous Latent Reasoning Framewo …

How do you keep RAG systems accurate and efficient when every query tries to stuff thousands of tokens into the context window and the retriever and generator are still optimized as 2 separate, disconnected systems? A team of researchers from Apple and University of Edinburgh released CLaRa, Continuous Latent Reasoning, (CLaRa-7B-Base, CLaRa-7B-Instruct and CLaRa-7B-E2E) a retrieval augmented generation framework that compresses documents into continuous memory tokens and then performs both retrieval and generation in that shared latent space. The goal is simple. Shorten context, avoid double encoding, and let the generator teach the retriever what actually matters for downstream answers.

https://arxiv.org/pdf/2511.18659

From raw documents to continuous memory tokens

CLaRa starts with a semantic compressor that attaches a small number of learned memory tokens to each document. During Salient Compressor Pretraining, SCP, the base model is a Mistral 7B style transformer with LoRA adapters that switch between a compressor role and a generator role. The final layer hidden states of the memory tokens become the compressed representation for that document.

SCP is trained on about 2M passages from Wikipedia 2021. A local Qwen-32B model generates 3 supervision signals for each passage. Simple QA pairs cover atomic facts. Complex QA pairs connect several facts in one question to enforce multi hop reasoning. Paraphrases reorder and compress the text while preserving semantics. A verification loop checks factual consistency and coverage and can regenerate missing questions or paraphrases for up to 10 rounds before accepting a sample.

Training uses 2 losses. A cross entropy term trains the generator to answer questions or produce paraphrases conditioned only on the memory tokens and an instruction prefix. A mean squared error term aligns the average hidden state of document tokens with the average hidden state of the memory tokens. The MSE loss gives modest but consistent gains of about 0.3 to 0.6 F1 points at compression ratios 32 and 128 and keeps compressed and original representations in the same semantic region.

https://arxiv.org/pdf/2511.18659

Joint retrieval and generation in a shared space

After offline compression, each document is represented only by its memory tokens. CLaRa then trains a query reasoner and an answer generator on top of the same backbone. The query reasoner is another LoRA adapter that maps an input question into the same number of memory tokens used for documents. Retrieval becomes pure embedding search. The system computes cosine similarity between the query embedding and each candidate document embedding.

The best compressed document embeddings for a query are concatenated with the query tokens and fed into the generator adapter. Training uses only a standard next token prediction loss on the final answer. There are no explicit relevance labels. The key trick is a differentiable top k selector implemented with a Straight Through estimator. During the forward pass the model uses hard top k selection. During the backward pass a softmax distribution over document scores allows gradients from the generator to flow into the query reasoner parameters.

The research team shows 2 effects in the gradient analysis. First, the retriever is encouraged to assign higher probability to documents that increase answer likelihood. Second, because retrieval and generation share the same compressed representations, generator gradients reshape the latent document space to make it easier to reason over. Logit lens analysis of the query embeddings recovers topic tokens such as “NFL” and “Oklahoma” for a question about the nephew of Ivory Lee Brown, even though those tokens are not in the raw query but are present in the supporting articles.

https://arxiv.org/pdf/2511.18659

Compression quality and QA accuracy

The compressor is evaluated on 4 QA datasets: Natural Questions, HotpotQA, MuSiQue and 2WikiMultihopQA. Under the Normal setting, where the system retrieves the top 5 Wikipedia 2021 documents per query, SCP-Mistral-7B at 4 times compression reaches an average F1 of 39.86. This is 5.37 points better than the hard compression baseline LLMLingua 2 and 1.13 points better than the best soft compression baseline PISCO.

Under the Oracle setting, where the gold document is guaranteed to be in the candidate set, SCP-Mistral-7B at 4 times compression reaches an average F1 of 66.76. That is 17.31 points above LLMLingua-2 and 5.35 points above PISCO. Even more interesting, the compressed representations outperform a BGE based text retriever plus full document Mistral-7B generator by about 2.36 average F1 points for Mistral and about 6.36 points for Phi 4 mini. Well trained soft compression can exceed full text RAG while cutting context length by factors from 4 to 128.

https://arxiv.org/pdf/2511.18659

The performance at very high compression ratios, above 32 in Oracle, does drop, but the decline remains moderate in Normal retrieval conditions. The key explanation as per the research team is, weak document relevance bottlenecks the system before compression quality does.

End to end QA and retrieval behavior

For end to end QA, CLaRa uses 20 candidate documents per query with compression ratios 4, 16 and 32. On the Normal setting, CLaRa-Mistral-7B with instruction initialized weights and 16 times compression reaches F1 equal to 50.89 on Natural Questions and 44.66 on 2WikiMultihopQA. This is comparable to DRO-Mistral-7B, which reads full uncompressed text, while using 16 times shorter document representations. On some datasets, CLaRa at 16 times compression slightly improves F1 over DRO, for example from 43.65 to 47.18 on 2Wiki.

In the Oracle setting, CLaRa-Mistral-7B exceeds 75, F1 on both Natural Questions and HotpotQA at 4 times compression. This shows that the generator can fully exploit accurate retrieval even when all evidence is stored only in compressed memory tokens. Instruction initialized CLaRa generally wins over pre-training initialized CLaRa in the Normal setting, while the gap narrows in Oracle, where retrieval noise is limited.

On the retrieval side, CLaRa used as a reranker under Oracle conditions delivers strong Recall at 5. With pretraining initialization at compression 4 on HotpotQA, CLaRa-Mistral-7B reaches Recall at 5 equal to 96.21. This beats the supervised BGE Reranker baseline at 85.93 by 10.28 points and even outperforms a fully supervised Sup Instruct retriever trained with contrastive relevance labels.

https://arxiv.org/pdf/2511.18659

What Apple has released?

Apple’s research team released 3 models on Hugging Face: CLaRa-7B-Base, CLaRa-7B-Instruct and CLaRa-7B-E2E. CLaRa-7B-Instruct is described as an instruction tuned unified RAG model with built in document compression at 16 and 128 times. It answers instruction style questions directly from compressed representations and uses Mistral-7B-Instruct v0.2 as the base model.

Key Takeaways

CLaRa replaces raw documents with a small set of continuous memory tokens learned via QA guided and paraphrase guided semantic compression, which preserves key reasoning signals even at 16 times and 128 times compression.

Retrieval and generation are trained in a single shared latent space, the query encoder and generator share the same compressed representations and are optimized together with one language modeling loss.

A differentiable top-k estimator lets gradients flow from answer tokens back into the retriever, which aligns document relevance with answer quality and removes the usual disjoint tuning loop for RAG systems.

On multi hop QA benchmarks like Natural Questions, HotpotQA, MuSiQue and 2WikiMultihopQA, CLaRa’s SCP compressor at 4 times compression outperforms strong text based baselines such as LLMLingua 2 and PISCO and can even beat full text BGE/ Mistral pipelines on average F1.

Apple has released 3 practical models, CLaRa-7B-Base, CLaRa-7B-Instruct and CLaRa-7B-E2E, along with the full training pipeline on GitHub.

Editorial Notes

CLaRa is an important step for retrieval augmented generation because it treats semantic document compression and joint optimization in a shared continuous space as first class citizens, not afterthoughts bolted onto a text only pipeline. It shows that embedding based compression with SCP, combined with end to end training via a differentiable top-k estimator and a single language modeling loss, can match or surpass text based RAG baselines while using far shorter contexts and simpler retrieval stacks. Overall, CLaRa demonstrates that unified continuous latent reasoning is a credible alternative to classic chunk and retrieve RAG for real world QA workloads.

Check out the Paper, Model Weights on HF and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Apple Researchers Release CLaRa: A Continuous Latent Reasoning Framework for Compression‑Native RAG with 16x–128x Semantic Document Compression appeared first on MarkTechPost.

How to Build a Meta-Cognitive AI Agent That Dynamically Adjusts Its Ow …

In this tutorial, we build an advanced meta-cognitive control agent that learns how to regulate its own depth of thinking. We treat reasoning as a spectrum, ranging from fast heuristics to deep chain-of-thought to precise tool-like solving, and we train a neural meta-controller to decide which mode to use for each task. By optimizing the trade-off between accuracy, computation cost, and a limited reasoning budget, we explore how an agent can monitor its internal state and adapt its reasoning strategy in real time. Through each snippet, we experiment, observe patterns, and understand how meta-cognition emerges when an agent learns to think about its own thinking. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserimport random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

OPS = [‘+’, ‘*’]

def make_task():
op = random.choice(OPS)
if op == ‘+’:
a, b = random.randint(1, 99), random.randint(1, 99)
else:
a, b = random.randint(2, 19), random.randint(2, 19)
return a, b, op

def true_answer(a, b, op):
return a + b if op == ‘+’ else a * b

def true_difficulty(a, b, op):
if op == ‘+’ and a <= 30 and b <= 30:
return 0
if op == ‘*’ and a <= 10 and b <= 10:
return 1
return 2

def heuristic_difficulty(a, b, op):
score = 0
if op == ‘*’:
score += 0.6
score += max(a, b) / 100.0
return min(score, 1.0)

def fast_heuristic(a, b, op):
if op == ‘+’:
base = a + b
noise = random.choice([-2, -1, 0, 0, 0, 1, 2, 3])
else:
base = int(0.8 * a * b)
noise = random.choice([-5, -3, 0, 0, 2, 5, 8])
return base + noise, 0.5

def deep_chain_of_thought(a, b, op, verbose=False):
if op == ‘+’:
x, y = a, b
carry = 0
pos = 1
result = 0
step = 0
while x > 0 or y > 0 or carry:
dx, dy = x % 10, y % 10
s = dx + dy + carry
carry, digit = divmod(s, 10)
result += digit * pos
x //= 10; y //= 10; pos *= 10
step += 1
else:
result = 0
step = 0
for i, d in enumerate(reversed(str(b))):
row = a * int(d) * (10 ** i)
result += row
step += 1
return result, max(2.0, 0.4 * step)

def tool_solver(a, b, op):
return eval(f”{a}{op}{b}”), 1.2

ACTION_NAMES = [“fast”, “deep”, “tool”]

We set up the world our meta-agent operates in. We generate arithmetic tasks, define ground-truth answers, estimate difficulty, and implement three different reasoning modes. As we run it, we observe how each solver behaves differently in terms of accuracy and computational cost, which form the foundation of the agent’s decision space. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserdef encode_state(a, b, op, rem_budget, error_ema, last_action):
a_n = a / 100.0
b_n = b / 100.0
op_plus = 1.0 if op == ‘+’ else 0.0
op_mul = 1.0 – op_plus
diff_hat = heuristic_difficulty(a, b, op)
rem_n = rem_budget / MAX_BUDGET
last_onehot = [0.0, 0.0, 0.0]
if last_action is not None:
last_onehot[last_action] = 1.0
feats = [
a_n, b_n, op_plus, op_mul,
diff_hat, rem_n, error_ema
] + last_onehot
return torch.tensor(feats, dtype=torch.float32, device=device)

STATE_DIM = 10
N_ACTIONS = 3

class PolicyNet(nn.Module):
def __init__(self, state_dim, hidden=48, n_actions=3):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.Tanh(),
nn.Linear(hidden, hidden),
nn.Tanh(),
nn.Linear(hidden, n_actions)
)
def forward(self, x):
return self.net(x)

policy = PolicyNet(STATE_DIM, hidden=48, n_actions=N_ACTIONS).to(device)
optimizer = optim.Adam(policy.parameters(), lr=3e-3)

We encode each task into a structured state that captures operands, operation type, predicted difficulty, remaining budget, and recent performance. We then define a neural policy network that maps this state to a probability distribution over actions. As we work through it, we see how the policy becomes the core mechanism through which the agent learns to regulate its thinking. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different BrowserGAMMA = 0.98
COST_PENALTY = 0.25
MAX_BUDGET = 25.0
EPISODES = 600
STEPS_PER_EP = 20
ERROR_EMA_DECAY = 0.9

def run_episode(train=True):
log_probs = []
rewards = []
info = []
rem_budget = MAX_BUDGET
error_ema = 0.0
last_action = None

for _ in range(STEPS_PER_EP):
a, b, op = make_task()
state = encode_state(a, b, op, rem_budget, error_ema, last_action)
logits = policy(state)
dist = torch.distributions.Categorical(logits=logits)
action = dist.sample() if train else torch.argmax(logits)
act_idx = int(action.item())

if act_idx == 0:
pred, cost = fast_heuristic(a, b, op)
elif act_idx == 1:
pred, cost = deep_chain_of_thought(a, b, op, verbose=False)
else:
pred, cost = tool_solver(a, b, op)

correct = (pred == true_answer(a, b, op))
acc_reward = 1.0 if correct else 0.0
budget_penalty = 0.0

rem_budget -= cost
if rem_budget < 0:
budget_penalty = -1.5 * (abs(rem_budget) / MAX_BUDGET)

step_reward = acc_reward – COST_PENALTY * cost + budget_penalty
rewards.append(step_reward)

if train:
log_probs.append(dist.log_prob(action))

err = 0.0 if correct else 1.0
error_ema = ERROR_EMA_DECAY * error_ema + (1 – ERROR_EMA_DECAY) * err
last_action = act_idx

info.append({
“correct”: correct,
“cost”: cost,
“difficulty”: true_difficulty(a, b, op),
“action”: act_idx
})

if train:
returns = []
G = 0.0
for r in reversed(rewards):
G = r + GAMMA * G
returns.append(G)
returns = list(reversed(returns))
returns_t = torch.tensor(returns, dtype=torch.float32, device=device)
baseline = returns_t.mean()
adv = returns_t – baseline
loss = -(torch.stack(log_probs) * adv).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()

return rewards, info

We implement the heart of learning using the REINFORCE policy gradient algorithm. We run multi-step episodes, collect log-probabilities, accumulate rewards, and compute returns. As we execute this part, we watch the meta-controller adjust its strategy by reinforcing decisions that balance accuracy with cost. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserprint(“Training meta-cognitive controller…”)
for ep in range(EPISODES):
rewards, _ = run_episode(train=True)
if (ep + 1) % 100 == 0:
print(f” episode {ep+1:4d} | avg reward {np.mean(rewards):.3f}”)

def evaluate(n_episodes=50):
all_actions = {0: [0,0,0], 1: [0,0,0], 2: [0,0,0]}
stats = {0: {“n”:0,”acc”:0,”cost”:0},
1: {“n”:0,”acc”:0,”cost”:0},
2: {“n”:0,”acc”:0,”cost”:0}}

for _ in range(n_episodes):
_, info = run_episode(train=False)
for step in info:
d = step[“difficulty”]
a_idx = step[“action”]
all_actions[d][a_idx] += 1
stats[d][“n”] += 1
stats[d][“acc”] += 1 if step[“correct”] else 0
stats[d][“cost”] += step[“cost”]

for d in [0,1,2]:
if stats[d][“n”] == 0:
continue
n = stats[d][“n”]
print(f”Difficulty {d}:”)
print(” action counts [fast, deep, tool]:”, all_actions[d])
print(” accuracy:”, stats[d][“acc”]/n)
print(” avg cost:”, stats[d][“cost”]/n)
print()

print(“Policy behavior by difficulty:”)
evaluate()

We train the meta-cognitive agent over hundreds of episodes and evaluate its behavior across difficulty levels. We observe how the policy evolves, using fast heuristics for simple tasks while resorting to deeper reasoning for harder ones. As we analyze the outputs, we understand how training shapes the agent’s reasoning choices. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserprint(“nExample hard task with meta-selected thinking mode:”)
a, b, op = 47, 18, ‘*’
state = encode_state(a, b, op, MAX_BUDGET, 0.3, None)
with torch.no_grad():
logits = policy(state)
act = int(torch.argmax(logits).item())

print(f”Task: {a} {op} {b}”)
print(“Chosen mode:”, ACTION_NAMES[act])

if act == 1:
pred, cost = deep_chain_of_thought(a, b, op, verbose=True)
elif act == 0:
pred, cost = fast_heuristic(a, b, op)
print(“Fast heuristic:”, pred)
else:
pred, cost = tool_solver(a, b, op)
print(“Tool solver:”, pred)

print(“True:”, true_answer(a,b,op), “| cost:”, cost)

We inspect a detailed reasoning trace for a hard example chosen by the trained policy. We see the agent confidently pick a mode and walk through the reasoning steps, allowing us to witness its meta-cognitive behavior in action. As we test different tasks, we appreciate how the model adapts its thinking based on context.

In conclusion, we have seen how a neural controller can learn to dynamically choose the most effective reasoning pathway based on the task’s difficulty and the constraints of the moment. We observe how the agent gradually discovers when quick heuristics are sufficient, when deeper reasoning is necessary, and when calling a precise solver is worth the cost. Through this process, we experience how metacognitive control transforms decision-making, leading to more efficient and adaptable reasoning systems.

Check out the FULL CODE NOTEBOOK. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Meta-Cognitive AI Agent That Dynamically Adjusts Its Own Reasoning Depth for Efficient Problem Solving appeared first on MarkTechPost.

AI Interview Series #4: Transformers vs Mixture of Experts (MoE)

Question:

MoE models contain far more parameters than Transformers, yet they can run faster at inference. How is that possible?

Difference between Transformers & Mixture of Experts (MoE)

Transformers and Mixture of Experts (MoE) models share the same backbone architecture—self-attention layers followed by feed-forward layers—but they differ fundamentally in how they use parameters and compute.

Feed-Forward Network vs Experts

Transformer: Each block contains a single large feed-forward network (FFN). Every token passes through this FFN, activating all parameters during inference.

MoE: Replaces the FFN with multiple smaller feed-forward networks, called experts. A routing network selects only a few experts (Top-K) per token, so only a small fraction of total parameters is active.

Parameter Usage

Transformer: All parameters across all layers are used for every token → dense compute.

MoE: Has more total parameters, but activates only a small portion per token → sparse compute. Example: Mixtral 8×7B has 46.7B total parameters, but uses only ~13B per token.

Inference Cost

Transformer: High inference cost due to full parameter activation. Scaling to models like GPT-4 or Llama 2 70B requires powerful hardware.

MoE: Lower inference cost because only K experts per layer are active. This makes MoE models faster and cheaper to run, especially at large scales.

Token Routing

Transformer: No routing. Every token follows the exact same path through all layers.

MoE: A learned router assigns tokens to experts based on softmax scores. Different tokens select different experts. Different layers may activate different experts which  increases specialization and model capacity.

Model Capacity

Transformer: To scale capacity, the only option is adding more layers or widening the FFN—both increase FLOPs heavily.

MoE: Can scale total parameters massively without increasing per-token compute. This enables “bigger brains at lower runtime cost.”

While MoE architectures offer massive capacity with lower inference cost, they introduce several training challenges. The most common issue is expert collapse, where the router repeatedly selects the same experts, leaving others under-trained. 

Load imbalance is another challenge—some experts may receive far more tokens than others, leading to uneven learning. To address this, MoE models rely on techniques like noise injection in routing, Top-K masking, and expert capacity limits. 

These mechanisms ensure all experts stay active and balanced, but they also make MoE systems more complex to train compared to standard Transformers.

AI Interview Series #3: Explain Federated Learning

The post AI Interview Series #4: Transformers vs Mixture of Experts (MoE) appeared first on MarkTechPost.

NVIDIA and Mistral AI Bring 10x Faster Inference for the Mistral 3 Fam …

NVIDIA announced today a significant expansion of its strategic collaboration with Mistral AI. This partnership coincides with the release of the new Mistral 3 frontier open model family, marking a pivotal moment where hardware acceleration and open-source model architecture have converged to redefine performance benchmarks.

This collaboration is a massive leap in inference speed: the new models now run up to 10x faster on NVIDIA GB200 NVL72 systems compared to the previous generation H200 systems. This breakthrough unlocks unprecedented efficiency for enterprise-grade AI, promising to solve the latency and cost bottlenecks that have historically plagued the large-scale deployment of reasoning models.

A Generational Leap: 10x Faster on Blackwell

As enterprise demand shifts from simple chatbots to high-reasoning, long-context agents, inference efficiency has become the critical bottleneck. The collaboration between NVIDIA and Mistral AI addresses this head-on by optimizing the Mistral 3 family specifically for the NVIDIA Blackwell architecture.

Where production AI systems must deliver both strong user experience (UX) and cost-efficient scale, the NVIDIA GB200 NVL72 provides up to 10x higher performance than the previous-generation H200. This is not merely a gain in raw speed; it translates to significantly higher energy efficiency. The system exceeds 5,000,000 tokens per second per megawatt (MW) at user interactivity rates of 40 tokens per second.

Created by MarkTechpost.com and source NVIDIA

For data centers grappling with power constraints, this efficiency gain is as critical as the performance boost itself. This generational leap ensures a lower per-token cost while maintaining the high throughput required for real-time applications.

A New Mistral 3 Family

The engine driving this performance is the newly released Mistral 3 family. This suite of models delivers industry-leading accuracy, efficiency, and customization capabilities, covering the spectrum from massive data center workloads to edge device inference.

Mistral Large 3: The Flagship MoE

At the top of the hierarchy sits Mistral Large 3, a state-of-the-art sparse Multimodal and Multilingual Mixture-of-Experts (MoE) model.

Total Parameters: 675 Billion

Active Parameters: 41 Billion

Context Window: 256K tokens

Trained on NVIDIA Hopper GPUs, Mistral Large 3 is designed to handle complex reasoning tasks, offering parity with top-tier closed models while retaining the flexibility of open weights.

Ministral 3: Dense Power at the Edge

Complementing the large model is the Ministral 3 series, a suite of small, dense, high-performance models designed for speed and versatility.

Sizes: 3B, 8B, and 14B parameters.

Variants: Base, Instruct, and Reasoning for each size (nine models total).

Context Window: 256K tokens across the board.

The Ministral 3 series excel at GPQA Diamond Accuracy benchmark by utilizing 100 less tokens while delivery higher accuracy :

Significant Engineering Behind the Speed: A Comprehensive Optimization Stack

The “10x” performance claim is driven by a comprehensive stack of optimizations co-developed by Mistral and NVIDIA engineers. The teams adopted an “extreme co-design” approach, merging hardware capabilities with model architecture adjustments.

TensorRT-LLM Wide Expert Parallelism (Wide-EP)

To fully exploit the massive scale of the GB200 NVL72, NVIDIA employed Wide Expert Parallelism within TensorRT-LLM. This technology provides optimized MoE GroupGEMM kernels, expert distribution, and load balancing.

Crucially, Wide-EP exploits the NVL72’s coherent memory domain and NVLink fabric. It is highly resilient to architectural variations across large MoEs. For instance, Mistral Large 3 utilizes roughly 128 experts per layer, about half as many as comparable models like DeepSeek-R1. Despite this difference, Wide-EP enables the model to realize the high-bandwidth, low-latency, non-blocking benefits of the NVLink fabric, ensuring that the model’s massive size does not result in communication bottlenecks.

Native NVFP4 Quantization

One of the most significant technical advancements in this release is the support for NVFP4, a quantization format native to the Blackwell architecture.

For Mistral Large 3, developers can deploy a compute-optimized NVFP4 checkpoint quantized offline using the open-source llm-compressor library.

This approach reduces compute and memory costs while strictly maintaining accuracy. It leverages NVFP4’s higher-precision FP8 scaling factors and finer-grained block scaling to control quantization error. The recipe specifically targets the MoE weights while keeping other components at original precision, allowing the model to deploy seamlessly on the GB200 NVL72 with minimal accuracy loss.

Disaggregated Serving with NVIDIA Dynamo

Mistral Large 3 utilizes NVIDIA Dynamo, a low-latency distributed inference framework, to disaggregate the prefill and decode phases of inference.

In traditional setups, the prefill phase (processing the input prompt) and the decode phase (generating the output) compete for resources. By rate-matching and disaggregating these phases, Dynamo significantly boosts performance for long-context workloads, such as 8K input/1K output configurations. This ensures high throughput even when utilizing the model’s massive 256K context window.

From Cloud to Edge: Ministral 3 Performance

The optimization efforts extend beyond the massive data centers. Recognizing the growing need for local AI, the Ministral 3 series is engineered for edge deployment, offering flexibility for a variety of needs.

RTX and Jetson Acceleration

The dense Ministral models are optimized for platforms like the NVIDIA GeForce RTX AI PC and NVIDIA Jetson robotics modules.

RTX 5090: The Ministral-3B variants can reach blistering inference speeds of 385 tokens per second on the NVIDIA RTX 5090 GPU. This brings workstation-class AI performance to local PCs, enabling fast iteration and greater data privacy.

Jetson Thor: For robotics and edge AI, developers can use the vLLM container on NVIDIA Jetson Thor. The Ministral-3-3B-Instruct model achieves 52 tokens per second for single concurrency, scaling up to 273 tokens per second with a concurrency of 8.

Broad Framework Support

NVIDIA has collaborated with the open-source community to ensure these models are usable everywhere.

Llama.cpp & Ollama: NVIDIA collaborated with these popular frameworks to ensure faster iteration and lower latency for local development.

SGLang: NVIDIA collaborated with SGLang to create an implementation of Mistral Large 3 that supports both disaggregation and speculative decoding.

vLLM: NVIDIA worked with vLLM to expand support for kernel integrations, including speculative decoding (EAGLE), Blackwell support, and expanded parallelism.

Production-Ready with NVIDIA NIM

To streamline enterprise adoption, the new models will be available through NVIDIA NIM microservices.

Mistral Large 3 and Ministral-14B-Instruct are currently available through the NVIDIA API catalog and preview API. Soon, enterprise developers will be able to use downloadable NVIDIA NIM microservices. This provides a containerized, production-ready solution that allows enterprises to deploy the Mistral 3 family with minimal setup on any GPU-accelerated infrastructure.

This availability ensures that the specific “10x” performance advantage of the GB200 NVL72 can be realized in production environments without complex custom engineering, democratizing access to frontier-class intelligence.

Conclusion: A New Standard for Open Intelligence

The release of the NVIDIA-accelerated Mistral 3 open model family represents a major leap for AI in the open-source community. By offering frontier-level performance under an open source license, and backing it with a robust hardware optimization stack, Mistral and NVIDIA are meeting developers where they are.

From the massive scale of the GB200 NVL72 utilizing Wide-EP and NVFP4, to the edge-friendly density of Ministral on an RTX 5090, this partnership delivers a scalable, efficient path for artificial intelligence. With upcoming optimizations such as speculative decoding with multitoken prediction (MTP) and EAGLE-3 expected to push performance even further, the Mistral 3 family is poised to become a foundational element of the next generation of AI applications.

Available to test!

If you are a developer looking to benchmark these performance gains, you can download the Mistral 3 models directly from Hugging Face or test the deployment-free hosted versions on build.nvidia.com/mistralai to evaluate the latency and throughput for your specific use case.

Check out the Models on Hugging Face. You can find details on Corporate Blog and Technical/Developer Blog.

Thanks to the NVIDIA AI team for the thought leadership/ Resources for this article. NVIDIA AI team has supported this content/article.
The post NVIDIA and Mistral AI Bring 10x Faster Inference for the Mistral 3 Family on GB200 NVL72 GPU Systems appeared first on MarkTechPost.

How We Learn Step-Level Rewards from Preferences to Solve Sparse-Rewar …

In this tutorial, we explore Online Process Reward Learning (OPRL) and demonstrate how we can learn dense, step-level reward signals from trajectory preferences to solve sparse-reward reinforcement learning tasks. We walk through each component, from the maze environment and reward-model network to preference generation, training loops, and evaluation, while observing how the agent gradually improves its behaviour through online preference-driven shaping. By running this end-to-end implementation, we gain a practical understanding of how OPRL enables better credit assignment, faster learning, and more stable policy optimization in challenging environments where the agent would otherwise struggle to discover meaningful rewards. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserimport numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
import matplotlib.pyplot as plt
from collections import deque
import random

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

class MazeEnv:
def __init__(self, size=8):
self.size = size
self.start = (0, 0)
self.goal = (size-1, size-1)
self.obstacles = set([(i, size//2) for i in range(1, size-2)])
self.reset()

def reset(self):
self.pos = self.start
self.steps = 0
return self._get_state()

def _get_state(self):
state = np.zeros(self.size * self.size)
state[self.pos[0] * self.size + self.pos[1]] = 1
return state

def step(self, action):
moves = [(-1,0), (0,1), (1,0), (0,-1)]
new_pos = (self.pos[0] + moves[action][0],
self.pos[1] + moves[action][1])
if (0 <= new_pos[0] < self.size and
0 <= new_pos[1] < self.size and
new_pos not in self.obstacles):
self.pos = new_pos
self.steps += 1
done = self.pos == self.goal or self.steps >= 60
reward = 10.0 if self.pos == self.goal else 0.0
return self._get_state(), reward, done

def render(self):
grid = [[‘.’ for _ in range(self.size)] for _ in range(self.size)]
for obs in self.obstacles:
grid[obs[0]][obs[1]] = ‘█’
grid[self.goal[0]][self.goal[1]] = ‘G’
grid[self.pos[0]][self.pos[1]] = ‘A’
return ‘n’.join([”.join(row) for row in grid])

class ProcessRewardModel(nn.Module):
def __init__(self, state_dim, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.LayerNorm(hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.LayerNorm(hidden),
nn.ReLU(),
nn.Linear(hidden, 1),
nn.Tanh()
)
def forward(self, states):
return self.net(states)
def trajectory_reward(self, states):
return self.forward(states).sum()

class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden=128):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU()
)
self.actor = nn.Linear(hidden, action_dim)
self.critic = nn.Linear(hidden, 1)
def forward(self, state):
features = self.backbone(state)
return self.actor(features), self.critic(features)

We set up the entire foundation of our OPRL system by importing libraries, defining the maze environment, and building the reward and policy networks. We establish how states are represented, how obstacles block movement, and how the sparse reward structure works. We also design the core neural models that will later learn process rewards and drive the policy’s decisions. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserclass OPRLAgent:
def __init__(self, state_dim, action_dim, lr=3e-4):
self.policy = PolicyNetwork(state_dim, action_dim)
self.reward_model = ProcessRewardModel(state_dim)
self.policy_opt = Adam(self.policy.parameters(), lr=lr)
self.reward_opt = Adam(self.reward_model.parameters(), lr=lr)
self.trajectories = deque(maxlen=200)
self.preferences = deque(maxlen=500)
self.action_dim = action_dim

def select_action(self, state, epsilon=0.1):
if random.random() < epsilon:
return random.randint(0, self.action_dim – 1)
state_t = torch.FloatTensor(state).unsqueeze(0)
with torch.no_grad():
logits, _ = self.policy(state_t)
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, 1).item()

def collect_trajectory(self, env, epsilon=0.1):
states, actions, rewards = [], [], []
state = env.reset()
done = False
while not done:
action = self.select_action(state, epsilon)
next_state, reward, done = env.step(action)
states.append(state)
actions.append(action)
rewards.append(reward)
state = next_state
traj = {
‘states’: torch.FloatTensor(np.array(states)),
‘actions’: torch.LongTensor(actions),
‘rewards’: torch.FloatTensor(rewards),
‘return’: float(sum(rewards))
}
self.trajectories.append(traj)
return traj

We begin constructing the OPRL agent by implementing action selection and trajectory collection. We use an ε-greedy strategy to ensure exploration and gather sequences of states, actions, and returns. As we run the agent through the maze, we store entire trajectories that will later serve as preference data for shaping the reward model. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browser def generate_preference(self):
if len(self.trajectories) < 2:
return
t1, t2 = random.sample(list(self.trajectories), 2)
label = 1.0 if t1[‘return’] > t2[‘return’] else 0.0
self.preferences.append({‘t1’: t1, ‘t2’: t2, ‘label’: label})

def train_reward_model(self, n_updates=5):
if len(self.preferences) < 32:
return 0.0
total_loss = 0.0
for _ in range(n_updates):
batch = random.sample(list(self.preferences), 32)
loss = 0.0
for item in batch:
r1 = self.reward_model.trajectory_reward(item[‘t1’][‘states’])
r2 = self.reward_model.trajectory_reward(item[‘t2’][‘states’])
logit = r1 – r2
pred_prob = torch.sigmoid(logit)
label = item[‘label’]
loss += -(label * torch.log(pred_prob + 1e-8) +
(1-label) * torch.log(1 – pred_prob + 1e-8))
loss = loss / len(batch)
self.reward_opt.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.reward_model.parameters(), 1.0)
self.reward_opt.step()
total_loss += loss.item()
return total_loss / n_updates

We generate preference pairs from collected trajectories and train the process reward model using the Bradley–Terry formulation. We compare trajectory-level scores, compute probabilities, and update the reward model to reflect which behaviours appear better. This allows us to learn dense, differentiable, step-level rewards that guide the agent even when the environment itself is sparse. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browser def train_policy(self, n_updates=3, gamma=0.98):
if len(self.trajectories) < 5:
return 0.0
total_loss = 0.0
for _ in range(n_updates):
traj = random.choice(list(self.trajectories))
with torch.no_grad():
process_rewards = self.reward_model(traj[‘states’]).squeeze()
shaped_rewards = traj[‘rewards’] + 0.1 * process_rewards
returns = []
G = 0
for r in reversed(shaped_rewards.tolist()):
G = r + gamma * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
returns = (returns – returns.mean()) / (returns.std() + 1e-8)
logits, values = self.policy(traj[‘states’])
log_probs = F.log_softmax(logits, dim=-1)
action_log_probs = log_probs.gather(1, traj[‘actions’].unsqueeze(1))
advantages = returns – values.squeeze().detach()
policy_loss = -(action_log_probs.squeeze() * advantages).mean()
value_loss = F.mse_loss(values.squeeze(), returns)
entropy = -(F.softmax(logits, dim=-1) * log_probs).sum(-1).mean()
loss = policy_loss + 0.5 * value_loss – 0.01 * entropy
self.policy_opt.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
self.policy_opt.step()
total_loss += loss.item()
return total_loss / n_updates

def train_oprl(episodes=500, render_interval=100):
env = MazeEnv(size=8)
agent = OPRLAgent(state_dim=64, action_dim=4, lr=3e-4)
returns, reward_losses, policy_losses = [], [], []
success_rate = []
for ep in range(episodes):
epsilon = max(0.05, 0.5 – ep / 1000)
traj = agent.collect_trajectory(env, epsilon)
returns.append(traj[‘return’])
if ep % 2 == 0 and ep > 10:
agent.generate_preference()
if ep > 20 and ep % 2 == 0:
rew_loss = agent.train_reward_model(n_updates=3)
reward_losses.append(rew_loss)
if ep > 10:
pol_loss = agent.train_policy(n_updates=2)
policy_losses.append(pol_loss)
success = 1 if traj[‘return’] > 5 else 0
success_rate.append(success)
if ep % render_interval == 0 and ep > 0:
test_env = MazeEnv(size=8)
agent.collect_trajectory(test_env, epsilon=0)
print(test_env.render())
return returns, reward_losses, policy_losses, success_rate

We train the policy using shaped rewards produced by the learned process reward model. We compute returns, advantages, value estimates, and entropy bonuses, enabling the agent to improve its strategy over time. We then build a full training loop in which exploration decays, preferences accumulate, and both the reward model and the policy are updated continuously. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserprint(“Training OPRL Agent on Sparse Reward Maze…n”)
returns, rew_losses, pol_losses, success = train_oprl(episodes=500, render_interval=250)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0,0].plot(returns, alpha=0.3)
axes[0,0].plot(np.convolve(returns, np.ones(20)/20, mode=’valid’), linewidth=2)
axes[0,0].set_xlabel(‘Episode’)
axes[0,0].set_ylabel(‘Return’)
axes[0,0].set_title(‘Agent Performance’)
axes[0,0].grid(alpha=0.3)

success_smooth = np.convolve(success, np.ones(20)/20, mode=’valid’)
axes[0,1].plot(success_smooth, linewidth=2, color=’green’)
axes[0,1].set_xlabel(‘Episode’)
axes[0,1].set_ylabel(‘Success Rate’)
axes[0,1].set_title(‘Goal Success Rate’)
axes[0,1].grid(alpha=0.3)

axes[1,0].plot(rew_losses, linewidth=2, color=’orange’)
axes[1,0].set_xlabel(‘Update Step’)
axes[1,0].set_ylabel(‘Loss’)
axes[1,0].set_title(‘Reward Model Loss’)
axes[1,0].grid(alpha=0.3)

axes[1,1].plot(pol_losses, linewidth=2, color=’red’)
axes[1,1].set_xlabel(‘Update Step’)
axes[1,1].set_ylabel(‘Loss’)
axes[1,1].set_title(‘Policy Loss’)
axes[1,1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(“OPRL Training Complete!”)
print(“Process rewards, preference learning, reward shaping, and online updates demonstrated.”)

We visualize the learning dynamics by plotting returns, success rates, reward-model loss, and policy loss. We monitor how the agent’s performance evolves as OPRL shapes the reward landscape. By the end of the visualization, we clearly see the impact of process rewards on solving a challenging, sparse-reward maze.

In conclusion, we see how OPRL transforms sparse terminal outcomes into rich online feedback that continuously guides the agent’s behaviour. We watch the process reward model learn preferences, shape the return signal, and accelerate the policy’s ability to reach the goal. With larger mazes, varying shaping strengths, or even real human preference feedback, we appreciate how OPRL provides a flexible and powerful framework for credit assignment in complex decision-making tasks. We finish with a clear, hands-on understanding of how OPRL operates and how we can extend it to more advanced agentic RL settings.

Check out the FULL CODE NOTEBOOK and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How We Learn Step-Level Rewards from Preferences to Solve Sparse-Reward Environments Using Online Process Reward Learning appeared first on MarkTechPost.