i-genie, Author at i-genie.co.uk

AI Interview Series #2: Explain Some of the Common Model Context Proto …

Posted on November 17, 2025 by i-genie

In this part of the Interview Series, we’ll look at some of the common security vulnerabilities in the Model Context Protocol (MCP) — a framework designed to let LLMs safely interact with external tools and data sources. While MCP brings structure and transparency to how models access context, it also introduces new security risks if not properly managed. In this article, we’ll explore three key threats — MCP Tool Poisoning, Rug Pulls, and Tool Hijacking Attacks

Tool Poisoning

A Tool Poisoning Attack happens when an attacker inserts hidden malicious instructions inside an MCP tool’s metadata or description.

Users only see a clean, simplified tool description in the UI.

LLMs, however, see the full tool definition — including hidden prompts, backdoor commands, or manipulated instructions.

This mismatch allows attackers to silently influence the AI into harmful or unauthorized actions.

Tool Hijacking

A Tool Hijacking Attack happens when you connect multiple MCP servers to the same client, and one of them is malicious. The malicious server injects hidden instructions inside its own tool descriptions that try to redirect, override, or manipulate the behavior of tools provided by a trusted server.

In this case, Server B pretends to offer a harmless add() tool, but its hidden instructions try to hijack the email_sender tool exposed by Server A.

MCP Rug Pulls

An MCP Rug Pull happens when a server changes its tool definitions after the user has already approved them. It’s similar to installing a trusted app that later updates itself into malware — the client believes the tool is safe, but its behavior has silently changed behind the scenes.

Because users rarely re-review tool specs, this attack is extremely hard to detect.

AI Interview Series #1: Explain Some LLM Text Generation Strategies Used in LLMs

The post AI Interview Series #2: Explain Some of the Common Model Context Protocol (MCP) Security Vulnerabilities appeared first on MarkTechPost.

Comparing the Top 4 Agentic AI Browsers in 2025: Atlas vs Copilot Mode …

Posted on November 17, 2025 by i-genie

Agentic AI browsers are moving the model from ‘answering about the web’ to operating on the web. In 2025, four AI browsers define this space: OpenAI’s ChatGPT Atlas, Microsoft Edge with Copilot Mode, The Browser Company’s Dia, and Perplexity’s Comet. Each makes different design choices around autonomy, memory, and privacy. This article compares their architectures, capabilities, and risk profiles so various type of users can decide which browser aligns with their workflows.

What are Agentic Browsers?

Agentic browsers are not just ‘chat over a page’. They expose the browser’s DOM (Document Object Model), tab graph, and history to an AI model and allow it to:

Read and reason over multiple tabs

Maintain task context across time

Take actions such as navigating, filling forms, and completing workflows

OpenAI ChatGPT Atlas, Microsoft Edge Copilot Mode, The Browser Company’s Dia, and Perplexity’s Comet all do this, but with different tradeoffs in autonomy, memory, and security.

High-level comparison

Atlas is the most fully agentic: deep ChatGPT integration, rich browser control, strong but complex memory and privacy story.

Copilot Mode is an incremental but significant extension to Edge: unified Copilot, cross-tab reasoning, early ‘Actions’ for automation, still conservative compared with Atlas and Comet.

Dia is an AI-first browser built on Chromium, optimized for reading, writing, and structured workflows with privacy-first defaults and intentionally limited autonomy.

Comet is a highly agentic personal assistant browser with deep workflow automation, a local-data narrative, and currently the most aggressive legal and security risk profile.

The rest of the article unpacks these differences in a more technical way.

1. ChatGPT Atlas (OpenAI): AI-native browser with full agent mode

1.1 Architecture

Atlas is a dedicated AI browser built around ChatGPT rather than a standard Chromium shell with an extension. It runs on Chromium but wraps it in OpenAI’s OWL process architecture, which separates the rendering engine from the Atlas application and agent layer.

Key characteristics:

macOS only at launch, with Windows, iOS, and Android ‘coming soon’.

ChatGPT is exposed everywhere: omnibox, main panel, and a ChatGPT sidebar that can see the current page and tabs.

This gives Atlas a first-class API into:

Current tab DOM and visible content

Tab list and navigation history

User queries and previous conversation state

1.2 Agent mode: real browser control

Agent Mode is the key differentiator. For Plus / Pro / Business users, Atlas can execute multi-step workflows:

Open and close tabs, follow links, and switch sites

Fill out forms and online applications

Book reservations such as hotels and restaurants

Compare products across multiple sites and return structured summaries

Constraints:

Agent mode cannot access local files or the OS, and cannot download or execute local programs. It is sandboxed inside the browser.

Actions require explicit user consent; Atlas surfaces prompts like ‘Should I start clicking and filling these forms’ before executing workflows.

1.3 Memory and privacy

Atlas introduces browser memories:

It stores filtered summaries of visited pages and inferred user intent, not full page captures. Summaries are retained for about 30 days, enabling queries like ‘reopen the reports I read yesterday’ or ‘continue the Athens itinerary plan’.

Memories are opt-in and can be viewed, edited, or deleted. Memory can be disabled globally or on specific sites, and Atlas supports incognito.

OpenAI also added parental controls that let guardians disable both browser memories and agent mode for child accounts.

Critical points:

Atlas still needs to transmit page snippets and metadata to OpenAI’s servers for summarization, which means sensitive content can be exposed if protections fail.

Security researchers have already demonstrated prompt-injection attacks that exploit Atlas’s omnibox and agent context, confirming that highly agentic browsing increases the attack surface.

1.4 Pricing and fit

Atlas is free to install for ChatGPT users on macOS.

Agent Mode is only available on paid ChatGPT tiers (Plus, Pro, Business, Enterprise).

Fit:

Best for users who want maximum in-browser automation and are comfortable with cloud-centric data handling and a still-evolving security posture.

2. Copilot Mode in Microsoft Edge: tab-reasoning with controlled autonomy

2.1 Architecture

Copilot Mode is Microsoft’s AI layer inside Edge, not a separate browser. It exposes:

A unified Copilot box on new tabs for chat, search, and navigation

Deep integration with Edge context (open tabs, history, and some browser settings) when users opt in.

Microsoft also ties Copilot Mode into:

Journeys: topic-centric clusters over browsing history, which Copilot can summarize and re-open.

Copilot Actions: an early agentic layer capable of actions like clearing cache, unsubscribing from mailing lists, and booking reservations in preview.

2.2 Agentic behavior

Compared with Atlas:

Copilot Mode can reason across multiple tabs, summarize and compare them, and assist with structured tasks like trip planning or multi-site research.

Actions Preview extends this into partially agentic flows, such as booking a restaurant or filling forms, but current evaluations show inconsistent reliability and occasional ‘hallucinated’ completions of tasks that were not successfully executed.

Crucially, Copilot Mode remains more constrained than Atlas or Comet:

It does not expose an openly programmable DOM-level agent with free cursor control

Action templates are narrower and guarded, particularly for email and account-sensitive operations

2.3 Data, privacy, and enterprise posture

Edge with Copilot Mode is clearly aimed at enterprise adoption:

Copilot access to tab and history data is explicitly permissioned; users can disable history-based personalization, Copilot context, and Copilot Mode entirely.

Microsoft integrates Prompt Shields and Azure AI safety layers to mitigate prompt injection and jailbreak attempts.

Fit:

Appropriate where organizations want AI-assisted browsing and cross-tab reasoning while keeping automation scoped and more auditable than a fully agentic browser.

3. Dia (The Browser Company): AI-first, Chromium-based, privacy-forward

3.1 Architecture and UX

Dia is The Browser Company’s AI-centric successor to Arc, built on Chromium and currently available on macOS only.

Core design choices:

The canonical interaction is ‘chat with your tabs‘: Dia’s assistant can read open tabs, referenced tabs, and selections, and answer questions or transform content in place.

Dia includes a Skills system, where users define reusable prompt ‘scripts’ and workflows for tasks like note-taking or research templates.

Dia’s UX is optimized for:

Reading and understanding long-form content

Writing and editing in-page

Learning workflows (tutoring, flashcards, argument comparison)

3.2 Memory and ‘local-first’ privacy

Dia’s main differentiation is its privacy posture:

Browsing history, chats, bookmarks, and saved content are stored locally and encrypted, with data sent to servers only when required to answer a specific query.

The Memory feature stores summaries and learned preferences, but users can disable memory entirely in settings or control what contexts are shared.

The net effect is an AI browser that tries to behave more like a local knowledge layer with scoped cloud calls rather than a continuous telemetry stream.

3.3 Agentic scope and constraints

Dia is intentionally less agentic than Atlas or Comet:

The assistant can read and summarize pages, transform text, generate content, and run Skills over the current tab set.

Current public builds do not expose a general DOM automation agent capable of open-ended clicking and form submission across arbitrary sites.

In practice, Dia behaves as a high-context copilot rather than a fully autonomous web operator. This is aligned with the company’s positioning and with Atlassian’s stated intent after acquiring The Browser Company, which emphasizes individual knowledge worker workflows over transactional automation.

3.4 Pricing and availability

Dia now ships to all Mac users, no invite required, as of October 2025.

Free tier: Core AI chat, Skills, and Memory, with usage limits.

Dia Pro at $20/month unlocks effectively unlimited AI chat usage within terms of use.

Fit:

Strong for educational and writing-heavy workflows, for users who want AI-augmented browsing without handing an agent broad control over the web session.

4. Comet (Perplexity): highly agentic assistant browser with heavy risk surface

4.1 Architecture and capabilities

Comet is Perplexity’s AI browser built on Chromium, positioned as a personal AI assistant and ‘thinking partner‘ rather than a simple search UI.

The Comet Assistant can:

Summarize and explore any page

Execute multi-step workflows for research, coding, meeting prep, and e-commerce

Manage email and calendar via integrated connectors

Handle complex tasks like comparing products, reading reviews, and moving all the way to checkout.

Recent updates extend the agent to work longer and across larger jobs, emphasizing persistent, agentic behavior over many tabs and time periods.

4.2 Data model and privacy claims

Perplexity’s Comet Privacy Notice and product pages claim:

Browsing data, cookies, and saved credentials are stored locally on the device by default.

Users can delete browsing data and stored credentials from Comet settings, and manage cookie behavior.

Integration with 1Password keeps vaults end-to-end encrypted and opaque to Perplexity.

So the official architecture is a hybrid: local browser state with selective context uploads to Comet’s servers and Perplexity’s search models.

However, multiple independent reviews argue that despite these controls, the combination of: Deep integration with third-party services (Gmail, calendar, financial accounts) and high agent autonomy over those services produces a large effective privacy risk envelope, especially for corporate data.

4.3 Security incidents and legal pressure

Comet currently has the most visible security and legal issues among the four:

Indirect prompt-injection / ‘CometJacking‘: LayerX and other researchers showed that malicious URLs and embedded prompts could hijack Comet’s assistant, exfiltrating data from connected services and even performing fraudulent actions.

Although Perplexity has patched specific vulnerabilities, security audits from Brave, Guardio, and others still recommend extreme caution for sensitive workloads.

Amazon lawsuit: Amazon is suing Perplexity over Comet’s ‘agentic shopping’ behavior, alleging that automated shopping sessions accessed customer accounts and impersonated human browsing, violating platform rules and harming personalization systems.

4.4 Pricing and availability

As of October–November 2025, Comet is free to download globally; earlier Max-only and Pro-only restrictions have been removed.

Perplexity monetizes via Pro / Max subscriptions for higher model tiers and via Comet Plus (~$5 / month), which grants access to curated news and publisher content and is bundled into Pro / Max.

Fit:

Very strong for users who want maximum automation across research, communications, and purchases, and who are comfortable operating at the bleeding edge of the security and platform-policy risk curve.

Comparison Table

DimensionChatGPT Atlas (OpenAI)Edge + Copilot Mode (Microsoft)Dia (The Browser Company)Comet (Perplexity)Engine / platformChromium-based; Atlas shell with OWL architecture; macOS now, Windows / mobile planned Edge (Chromium) on Windows and macOS with optional Copilot Mode Chromium-based AI browser; macOS only, GA, no invite; Windows not yet announced Chromium-based browser with integrated Perplexity search and assistant; desktop global, mobile rolling outAgentic autonomyHigh: Agent Mode can click, navigate, fill forms, book reservations, and chain multi-step workflows inside the browser Medium: cross-tab reasoning and Actions; can perform some transactional steps but with limited scope and reliabilityLow–medium: chat, Skills, and memory over tabs; no general agent that freely manipulates arbitrary sites; autonomy intentionally constrained High: Comet Assistant executes long-running workflows across browsing, email, calendar, and e-commerce, including end-to-end shopping and planning flows Memory / personalizationBrowser memories retain summarized context for ~30 days; persistent task context across sessions, opt-in and user-controllableJourneys over history, context sharing for Copilot is opt-in; personalization tied to Microsoft account and privacy controls Local encrypted storage of history, chats, bookmarks; Dia Memory for personalization with ability to limit shared contextLocal-first browsing data plus cloud-side models; settings allow deleting local data and tuning collectionBest-fit use casesComplex research, automation-heavy workflows, and agent experiments where strong autonomy outweighs riskEveryday browsing with AI summaries and research assistance in Microsoft-centric environmentsLearning, writing, and planning where privacy and structured Skills are more important than full automationPower users who want a personal operator for browsing, communication, and shopping, and who will actively manage security and policy risk

Which browser to choose in 2025?

Pick Atlas when you want to explore the frontier of in-browser agents. It offers the richest action surface and memory model, at the cost of greater complexity in safety and compliance design.

Pick Edge + Copilot Mode when you need incremental AI assistance in a browser that already fits Microsoft-centric enterprise governance, and you prefer scoped agents over unconstrained ones.

Pick Dia when your primary workload is reading, learning, and writing, and you want strong local-first guarantees and explicit control over what information the model sees, with minimal automation.

Pick Comet only if you explicitly want a high-autonomy personal operator in your browser and are willing to track security advisories and platform policies closely.

References:

OpenAI – Introducing ChatGPT Atlashttps://openai.com/index/introducing-chatgpt-atlas/

OpenAI – How we built OWL, the new architecture behind our browserhttps://openai.com/index/building-chatgpt-atlas/

Microsoft – AI browser innovation with Copilot Mode in Edgehttps://www.microsoft.com/en-us/microsoft-copilot/for-individuals/do-more-with-ai/ai-for-daily-life/ai-browser-innovation-with-copilot-in-edge

Microsoft – Copilot Mode | Microsoft Edgehttps://www.microsoft.com/en-us/edge/copilot-mode

Dia Browser – Official sitehttps://www.diabrowser.com/

Dia Browser – Skills Galleryhttps://www.diabrowser.com/skills

9to5Mac – Dia, The Browser Company’s AI-powered browser, is now generally available on macOShttps://9to5mac.com/2025/10/08/dia-the-browser-companys-ai-powered-browser-is-now-generally-available-on-macos/

Perplexity – Comet Browser: a Personal AI Assistanthttps://www.perplexity.ai/comet/

1Password – Secure credentials on Comet with 1Passwordhttps://1password.com/partners/perplexity

Reuters – Amazon sues Perplexity over “agentic” shopping toolhttps://www.reuters.com/business/retail-consumer/perplexity-receives-legal-threat-amazon-over-agentic-ai-shopping-tool-2025-11-04/

The post Comparing the Top 4 Agentic AI Browsers in 2025: Atlas vs Copilot Mode vs Dia vs Comet appeared first on MarkTechPost.

Cerebras Releases MiniMax-M2-REAP-162B-A10B: A Memory Efficient Versio …

Posted on November 16, 2025 by i-genie

Cerebras has released MiniMax-M2-REAP-162B-A10B, a compressed Sparse Mixture-of-Experts (SMoE) Causal Language Model derived from MiniMax-M2, using the new Router weighted Expert Activation Pruning (REAP) method. The model keeps the behavior of the original 230B total, 10B active MiniMax M2, while pruning experts and reducing memory for deployment focused workloads such as coding agents and tool calling.

Architecture and core specifications

MiniMax-M2-REAP-162B-A10B has these key properties:

Base model: MiniMax-M2

Compression method: REAP, Router weighted Expert Activation Pruning

Total parameters: 162B

Active parameters per token: 10B

Layers: 62 transformer blocks

Attention heads per layer: 48

Experts: 180 experts, obtained by pruning a 256 expert configuration

Activated experts per token: 8

Context length: 196,608 tokens

License: modified MIT, derived from MiniMaxAI MiniMax M2

The SMoE design means that the model stores 162B parameters, but each token only routes through a small set of experts, so the effective compute cost per token is similar to a 10B dense model. MiniMax M2 itself is positioned as an MoE model built for coding and agentic workflows, with 230B total parameters and 10B active, which this checkpoint inherits.

How REAP compresses MiniMax-M2?

MiniMax-M2-REAP-162B-A10B is created by applying REAP uniformly across all MoE blocks of MiniMax M2, at a 30 percent expert pruning rate.

The REAP method defines a saliency score for each expert that combines:

Router gate values: How often and how strongly the router selects that expert

Expert activation norms: The magnitude of the expert output when active

Experts that contribute minimally to the layer output, under this combined criterion, are removed. The remaining experts keep their original weights and the router keeps separate gates for each of them. This is one shot compression, there is no extra fine tuning after pruning in the method definition.

A core theoretical result in the REAP’s research paper is that expert merging with summed gates causes functional subspace collapse. When experts are merged, the router loses its independent, input dependent control over those experts, so a single merged expert must approximate an input dependent mixture that was originally expressed through multiple experts. The research team proves that, whenever the router policy depends on the input and the experts are not identical, this introduces irreducible error. In contrast, pruning removes some experts but preserves independent control of the survivors, so the error scales with the gate weight of the removed experts.

Across a set of SMoE models in the 20B to 1T parameter range, REAP consistently outperforms expert merging and other pruning criteria on generative benchmarks such as code generation, mathematical reasoning and tool calling, especially at 50 percent compression.

Accuracy under 30 percent expert pruning

The MiniMax-M2-REAP-162B-A10B model gets compared on three checkpoints on standard coding, reasoning and agentic benchmarks:

MiniMax-M2 (230B, base model)

MiniMax-M2-REAP-172B-A10B, 25 percent pruning

MiniMax-M2-REAP-162B-A10B, 30 percent pruning

https://huggingface.co/cerebras/MiniMax-M2-REAP-162B-A10B

On coding benchmarks such as HumanEval, HumanEval Plus, MBPP and MBPP Plus, the 162B REAP model stays very close to the base model. HumanEval sits around 90% range, and MBPP stays in the 80% range, with the 172B and 162B models essentially tracking the original MiniMax-M2 within a few points.

On reasoning benchmarks such as AIME 25 and MATH 500, there are small shifts between the three models, but there is no collapse at 30 percent pruning and the 162B checkpoint remains competitive with the base model.

On tool calling and agentic evaluation, represented by τ2 bench in a telecom setting, the 162B REAP model again matches the base model within small variance. The model card explicitly states that this checkpoint keeps almost identical performance while being about 30 percent lighter in parameter count.

These results line up with the broader REAP study, which reports near lossless compression for code generation and tool calling on several large SMoE architectures when pruning experts using the REAP criterion.

Deployment, memory usage and observed throughput

Cerebras provides a direct vLLM serve example and positions MiniMax-M2-REAP-162B-A10B as a drop in model for the existing MiniMax M2 integration.

Copy CodeCopiedUse a different Browservllm serve cerebras/MiniMax-M2-REAP-162B-A10B
–tensor-parallel-size 8
–tool-call-parser minimax_m2
–reasoning-parser minimax_m2_append_think
–trust-remote-code
–enable_expert_parallel
–enable-auto-tool-choice

If the run hits memory limits, the card recommends lowering –max-num-seqs, for example to 64, to keep batch size in check on a given GPU.

Key Takeaways

SMoE architecture with efficient compute: MiniMax-M2-REAP-162B-A10B is a Sparse Mixture of Experts model with 162B total parameters and 10B active parameters per token, so the compute cost per token is close to a 10B dense model while keeping frontier scale capacity.

REAP expert pruning keeps behavior of MiniMax-M2: The model is produced by applying REAP Router weighted Expert Activation Pruning to MiniMax-M2 at roughly 30 percent expert pruning, pruning experts based on router gate values and expert activation norms while leaving surviving experts and router structure intact.

Near lossless accuracy at 30 percent compression: On coding benchmarks such as HumanEval and MBPP, and on reasoning benchmarks such as AIME25 and MATH 500, the 162B REAP variant tracks the 230B MiniMax-M2 and a 172B REAP variant within a few points, showing near lossless compression for code, reasoning and tool use.

Pruning outperforms expert merging for generative SMoE: The REAP study shows that pruning experts using a saliency criterion avoids the functional subspace collapse seen with expert merging in generative tasks, and performs better across large SMoE models in the 22B to about 1T parameter range.

Comparison Table

Image source: Marktechpost.com

Editorial Comments

Cerebras’ release of MiniMax-M2-REAP-162B-A10B is a strong signal that Router weighted Expert Activation Pruning is ready for real workloads, not just as a research curiosity. The checkpoint shows that a 30 percent expert pruning schedule can keep MiniMax-M2 230B-A10B behavior almost intact while cutting memory and preserving long context coding, reasoning and tool calling performance, which is exactly what SMoE researchers need for practical deployment. Overall, Cerebras is quietly turning expert pruning into production infrastructure for frontier class SMoE models.

Check out the Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Cerebras Releases MiniMax-M2-REAP-162B-A10B: A Memory Efficient Version of MiniMax-M2 for Long Context Coding Agents appeared first on MarkTechPost.

MBZUAI Researchers Introduce PAN: A General World Model For Interactab …

Posted on November 16, 2025 by i-genie

Most text to video models generate a single clip from a prompt and then stop. They do not keep an internal world state that persists as actions arrive over time. PAN, a new model from MBZUAI’s Institute of Foundation Models, is designed to fill that gap by acting as a general world model that predicts future world states as video, conditioned on history and natural language actions.

https://arxiv.org/pdf/2511.09057

From video generator to interactive world simulator

PAN is defined as a general, interactable, long horizon world model. It maintains an internal latent state that represents the current world, then updates that state when it receives a natural language action such as ‘turn left and speed up’ or ‘move the robot arm to the red block.’ The model then decodes the updated state into a short video segment that shows the consequence of that action. This cycle repeats, so the same world state evolves across many steps.

This design allows PAN to support open domain, action conditioned simulation. It can roll out counterfactual futures for different action sequences. An external agent can query PAN as a simulator, compare predicted futures, and choose actions based on those predictions.

GLP architecture, separating what happens from how it looks

The base of PAN is the Generative Latent Prediction, GLP, architecture. GLP separates world dynamics from visual rendering. First, a vision encoder maps images or video frames into a latent world state. Second, an autoregressive latent dynamics backbone based on a large language model predicts the next latent state, conditioned on history and the current action. Third, a video diffusion decoder reconstructs the corresponding video segment from that latent state.

In PAN, the vision encoder and backbone are built on Qwen2.5-VL-7B-Instruct. The vision tower tokenizes frames into patches and produces structured embeddings. The language backbone runs over a history of world states and actions, plus learned query tokens, and outputs the latent representation of the next world state. These latents live in the shared multimodal space of the VLM, which helps ground the dynamics in both text and vision.

The video diffusion decoder is adapted from Wan2.1-T2V-14B, a diffusion transformer for high fidelity video generation. The research team trains this decoder with a flow matching objective, using one thousand denoising steps and a Rectified Flow formulation. The decoder conditions on both the predicted latent world state and the current natural language action, with a dedicated cross attention stream for the world state and another for the action text.

https://arxiv.org/pdf/2511.09057

Causal Swin DPM and sliding window diffusion

Naively chaining single shot video models by conditioning only on the last frame leads to local discontinuities and rapid quality degradation over long rollouts. PAN addresses this with Causal Swin DPM, which augments the Shift Window Denoising Process Model with chunk wise causal attention.

The decoder operates on a sliding temporal window that holds two chunks of video frames at different noise levels. During denoising, one chunk moves from high noise to clean frames and then leaves the window. A new noisy chunk enters at the other end. Chunk wise causal attention ensures that the later chunk can only attend to the earlier one, not to unseen future actions. This keeps transitions between chunks smooth and reduces error accumulation over long horizons.

PAN also adds controlled noise to the conditioning frame, rather than using a perfectly sharp frame. This suppresses incidental pixel details that do not matter for dynamics and encourages the model to focus on stable structure such as objects and layout.

https://arxiv.org/pdf/2511.09057

Training stack and data construction

PAN is trained in two stages. In the first stage, the research team adapts Wan2.1 T2V 14B into the Causal Swin DPM architecture. They train the decoder in BFloat16 with AdamW, a cosine schedule, gradient clipping, FlashAttention3 and FlexAttention kernels, and a hybrid sharded data parallel scheme across 960 NVIDIA H200 GPUs.

In the second stage, they integrate the frozen Qwen2.5 VL 7B Instruct backbone with the video diffusion decoder under the GLP objective. The vision language model remains frozen. The model learns query embeddings and the decoder so that predicted latents and reconstructed videos stay consistent. This joint training also uses sequence parallelism and Ulysses style attention sharding to handle long context sequences. Early stopping ends training after 1 epoch once validation converges, even though the schedule allows 5 epochs.

Training data comes from widely used publicly accessible video sources that cover everyday activities, human object interactions, natural environments, and multi agent scenarios. Long form videos are segmented into coherent clips using shot boundary detection. A filtering pipeline removes static or overly dynamic clips, low aesthetic quality, heavy text overlays, and screen recordings using rule based metrics, pretrained detectors, and a custom VLM filter. The research team then re-captions clips with dense, temporally grounded descriptions that emphasize motion and causal events.

Benchmarks, action fidelity, long horizon stability, planning

The research team evaluates the model along three axes, action simulation fidelity, long horizon forecast, and simulative reasoning and planning, against both open source and commercial video generators and world models. Baselines include WAN 2.1 and 2.2, Cosmos 1 and 2, V JEPA 2, and commercial systems such as KLING, MiniMax Hailuo, and Gen 3.

For action simulation fidelity, a VLM based judge scores how well the model executes language specified actions while maintaining a stable background. PAN reaches 70.3% accuracy on agent simulation and 47% on environment simulation, for an overall score of 58.6%. It achieves the highest fidelity among open source models and surpasses most commercial baselines.

For long horizon forecast, the research team measures Transition Smoothness and Simulation Consistency. Transition Smoothness uses optical flow acceleration to quantify how smooth motion is across action boundaries. Simulation Consistency uses metrics inspired by WorldScore to monitor degradation over extended sequences. PAN scores 53.6% on Transition Smoothness and 64.1% on Simulation Consistency and exceeds all baselines, including KLING and MiniMax, on these metrics.

For simulative reasoning and planning, PAN is used as an internal simulator inside an OpenAI-o3 based agent loop. In step wise simulation, PAN achieves 56.1% accuracy, the best among open source world models.

https://arxiv.org/pdf/2511.09057

Key Takwaways

PAN implements the Generative Latent Prediction architecture, combining a Qwen2.5-VL-7B based latent dynamics backbone with a Wan2.1-T2V-14B based video diffusion decoder, to unify latent world reasoning and realistic video generation.

The Causal Swin DPM mechanism introduces a sliding window, chunk wise causal denoising process that conditions on partially noised past chunks, which stabilizes long horizon video rollouts and reduces temporal drift compared to naive last frame conditioning.

PAN is trained in two stages, first adapting the Wan2.1 decoder to Causal Swin DPM on 960 NVIDIA H200 GPUs with a flow matching objective, then jointly training the GLP stack with a frozen Qwen2.5-VL backbone and learned query embeddings plus decoder.

The training corpus consists of large scale video action pairs from diverse domains, processed with segmentation, filtering, and dense temporal recaptioning, enabling PAN to learn action conditioned, long range dynamics instead of isolated short clips.

PAN achieves state of the art open source results on action simulation fidelity, long horizon forecasting, and simulative planning, with reported scores such as 70.3% agent simulation, 47% environment simulation, 53.6% transition smoothness, and 64.1% simulation consistency, while remaining competitive with leading commercial systems.

Comparison Table

DimensionPANCosmos video2world WFMWan2.1 T2V 14BV JEPA 2OrganizationMBZUAI Institute of Foundation ModelsNVIDIA ResearchWan AI and Open LaboratoryMeta AIPrimary roleGeneral world model for interactive, long horizon world simulation with natural language actionsWorld foundation model platform for Physical AI with video to world generation for control and navigationHigh quality text to video and image to video generator for general content creation and editingSelf supervised video model for understanding, prediction and planning tasksWorld model framingExplicit GLP world model, latent state, action, and next observation defined, focuses on simulative reasoning and planningDescribed as world foundation model that generates future video worlds from past video and control prompt, aimed at Physical AI, robotics, driving, navigationFramed as video generation model, not primarily as world model, no persistent internal world state described in docsJoint embedding predictive architecture for video, focuses on latent prediction rather than explicit generative supervision in observation spaceCore architectureGLP stack, vision encoder from Qwen2.5 VL 7B, LLM based latent dynamics backbone, video diffusion decoder with Causal Swin DPMFamily of diffusion based and autoregressive world models, with video2world generation, plus diffusion decoder and prompt upsampler based on a language modelSpatio temporal variational autoencoder and diffusion transformer T2V model at 14 billion parameters, supports multiple generative tasks and resolutionsJEPA style encoder plus predictor architecture that matches latent representations of consecutive video observationsBackbone and latent spaceMultimodal latent space from Qwen2.5 VL 7B, used both for encoding observations and for autoregressive latent prediction under actionsToken based video2world model with text prompt conditioning and optional diffusion decoder for refinement, latent space details depend on model variantLatent space from VAE plus diffusion transformer, driven mainly by text or image prompts, no explicit agent action sequence interfaceLatent space built from self supervised video encoder with predictive loss in representation space, not generative reconstruction lossAction or control inputNatural language actions in dialogue format, applied at every simulation step, model predicts next latent state and decodes video conditioned on action and historyControl input as text prompt and optionally camera pose for navigation and downstream tasks such as humanoid control and autonomous drivingText prompts and image inputs for content control, no explicit multi step agent action interface described as world model controlDoes not focus on natural language actions, used more as visual representation and predictor module inside larger agents or plannersLong horizon designCausal Swin DPM sliding window diffusion, chunk wise causal attention, conditioning on slightly noised last frame to reduce drift and maintain stable long horizon rolloutsVideo2world model generates future video given past window and prompt, supports navigation and long sequences but the paper does not describe a Causal Swin DPM style mechanismCan generate several seconds at 480 P and 720 P, focuses on visual quality and motion, long horizon stability is evaluated through Wan Bench but without explicit world state mechanismLong temporal reasoning comes from predictive latent modeling and self supervised training, not from generative video rollouts with explicit diffusion windowsTraining data focusLarge scale video action pairs across diverse physical and embodied domains, with segmentation, filtering and dense temporal recaptioning for action conditioned dynamicsMix of proprietary and public Internet videos focused on Physical AI categories such as driving, manipulation, human activity, navigation and nature dynamics, with a dedicated curation pipelineLarge open domain video and image corpora for general visual generation, with Wan Bench evaluation prompts, not targeted specifically at agent environment rolloutsLarge scale unlabelled video data for self supervised representation learning and prediction, details in V JEPA 2 paper

Editorial Comments

PAN is an important step because it operationalizes Generative Latent Prediction with production scale components such as Qwen2.5-VL-7B and Wan2.1-T2V-14B, then validates this stack on well defined benchmarks for action simulation, long horizon forecasting, and simulative planning. The training and evaluation pipeline is clearly documented by the research team, the metrics are reproducible, and the model is released within a transparent world modeling framework rather than as an opaque video demo. Overall, PAN shows how a vision language backbone plus diffusion video decoder can function as a practical world model instead of a pure generative toy.

Check out the Paper, Technical details and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation appeared first on MarkTechPost.

How to Design a Fully Interactive, Reactive, and Dynamic Terminal-Base …

Posted on November 16, 2025 by i-genie

In this tutorial, we build an advanced interactive dashboard using Textual, and we explore how terminal-first UI frameworks can feel as expressive and dynamic as modern web dashboards. As we write and run each snippet, we actively construct the interface piece by piece, widgets, layouts, reactive state, and event flows, so we can see how Textual behaves like a live UI engine right inside Google Colab. By the end, we notice how naturally we can blend tables, trees, forms, and progress indicators into a cohesive application that feels fast, clean, and responsive. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install textual textual-web nest-asyncio

from textual.app import App, ComposeResult
from textual.containers import Container, Horizontal, Vertical
from textual.widgets import (
Header, Footer, Button, DataTable, Static, Input,
Label, ProgressBar, Tree, Select
)
from textual.reactive import reactive
from textual import on
from datetime import datetime
import random

class StatsCard(Static):
value = reactive(0)

def __init__(self, title: str, *args, **kwargs):
super().__init__(*args, **kwargs)
self.title = title

def compose(self) -> ComposeResult:
yield Label(self.title)
yield Label(str(self.value), id=”stat-value”)

def watch_value(self, new_value: int) -> None:
if self.is_mounted:
try:
self.query_one(“#stat-value”, Label).update(str(new_value))
except Exception:
pass

We set up the environment and import all the necessary components to build our Textual application. As we define the StatsCard widget, we establish a reusable component that reacts to changes in value and updates itself automatically. We begin to see how Textual’s reactive system lets us create dynamic UI elements with minimal effort. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DataDashboard(App):
CSS = “””
Screen { background: $surface; }
#main-container { height: 100%; padding: 1; }
#stats-row { height: auto; margin-bottom: 1; }
StatsCard { border: solid $primary; height: 5; padding: 1; margin-right: 1; width: 1fr; }
#stat-value { text-style: bold; color: $accent; content-align: center middle; }
#control-panel { height: 12; border: solid $secondary; padding: 1; margin-bottom: 1; }
#data-section { height: 1fr; }
#left-panel { width: 30; border: solid $secondary; padding: 1; margin-right: 1; }
DataTable { height: 100%; border: solid $primary; }
Input { margin: 1 0; }
Button { margin: 1 1 1 0; }
ProgressBar { margin: 1 0; }
“””

BINDINGS = [
(“d”, “toggle_dark”, “Toggle Dark Mode”),
(“q”, “quit”, “Quit”),
(“a”, “add_row”, “Add Row”),
(“c”, “clear_table”, “Clear Table”),
]

total_rows = reactive(0)
total_sales = reactive(0)
avg_rating = reactive(0.0)

We define the DataDashboard class and configure global styles, key bindings, and reactive attributes. We decide how the app should look and behave right from the top, giving us full control over themes and interactivity. This structure helps us create a polished dashboard without writing any HTML or JS. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def compose(self) -> ComposeResult:
yield Header(show_clock=True)

with Container(id=”main-container”):
with Horizontal(id=”stats-row”):
yield StatsCard(“Total Rows”, id=”card-rows”)
yield StatsCard(“Total Sales”, id=”card-sales”)
yield StatsCard(“Avg Rating”, id=”card-rating”)

with Vertical(id=”control-panel”):
yield Input(placeholder=”Product Name”, id=”input-name”)
yield Select(
[(“Electronics”, “electronics”),
(“Books”, “books”),
(“Clothing”, “clothing”)],
prompt=”Select Category”,
id=”select-category”
)
with Horizontal():
yield Button(“Add Row”, variant=”primary”, id=”btn-add”)
yield Button(“Clear Table”, variant=”warning”, id=”btn-clear”)
yield Button(“Generate Data”, variant=”success”, id=”btn-generate”)
yield ProgressBar(total=100, id=”progress”)

with Horizontal(id=”data-section”):
with Container(id=”left-panel”):
yield Label(“Navigation”)
tree = Tree(“Dashboard”)
tree.root.expand()
products = tree.root.add(“Products”, expand=True)
products.add_leaf(“Electronics”)
products.add_leaf(“Books”)
products.add_leaf(“Clothing”)
tree.root.add_leaf(“Reports”)
tree.root.add_leaf(“Settings”)
yield tree

yield DataTable(id=”data-table”)

yield Footer()

We compose the entire UI layout, arranging containers, cards, form inputs, buttons, a navigation tree, and a data table. As we structure these components, we watch the interface take shape exactly the way we envision it. This snippet lets us design the visual skeleton of the dashboard in a clean, declarative manner. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def on_mount(self) -> None:
table = self.query_one(DataTable)
table.add_columns(“ID”, “Product”, “Category”, “Price”, “Sales”, “Rating”)
table.cursor_type = “row”
self.generate_sample_data(5)
self.set_interval(0.1, self.update_progress)

def generate_sample_data(self, count: int = 5) -> None:
table = self.query_one(DataTable)
categories = [“Electronics”, “Books”, “Clothing”]
products = {
“Electronics”: [“Laptop”, “Phone”, “Tablet”, “Headphones”],
“Books”: [“Novel”, “Textbook”, “Magazine”, “Comic”],
“Clothing”: [“Shirt”, “Pants”, “Jacket”, “Shoes”]
}

for _ in range(count):
category = random.choice(categories)
product = random.choice(products[category])
row_id = self.total_rows + 1
price = round(random.uniform(10, 500), 2)
sales = random.randint(1, 100)
rating = round(random.uniform(1, 5), 1)

table.add_row(
str(row_id),
product,
category,
f”${price}”,
str(sales),
str(rating)
)

self.total_rows += 1
self.total_sales += sales

self.update_stats()

def update_stats(self) -> None:
self.query_one(“#card-rows”, StatsCard).value = self.total_rows
self.query_one(“#card-sales”, StatsCard).value = self.total_sales

if self.total_rows > 0:
table = self.query_one(DataTable)
total_rating = sum(float(row[5]) for row in table.rows)
self.avg_rating = round(total_rating / self.total_rows, 2)
self.query_one(“#card-rating”, StatsCard).value = self.avg_rating

def update_progress(self) -> None:
progress = self.query_one(ProgressBar)
progress.advance(1)
if progress.progress >= 100:
progress.progress = 0

We implement all the logic for generating data, computing statistics, animating progress, and updating cards. We see how quickly we can bind backend logic to frontend components using Textual’s reactive model. This step makes the dashboard feel alive as numbers update instantly and progress bars animate smoothly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser @on(Button.Pressed, “#btn-add”)
def handle_add_button(self) -> None:
name_input = self.query_one(“#input-name”, Input)
category = self.query_one(“#select-category”, Select).value

if name_input.value and category:
table = self.query_one(DataTable)
row_id = self.total_rows + 1
price = round(random.uniform(10, 500), 2)
sales = random.randint(1, 100)
rating = round(random.uniform(1, 5), 1)

table.add_row(
str(row_id),
name_input.value,
str(category),
f”${price}”,
str(sales),
str(rating)
)

self.total_rows += 1
self.total_sales += sales
self.update_stats()
name_input.value = “”

@on(Button.Pressed, “#btn-clear”)
def handle_clear_button(self) -> None:
table = self.query_one(DataTable)
table.clear()
self.total_rows = 0
self.total_sales = 0
self.avg_rating = 0
self.update_stats()

@on(Button.Pressed, “#btn-generate”)
def handle_generate_button(self) -> None:
self.generate_sample_data(10)

def action_toggle_dark(self) -> None:
self.dark = not self.dark

def action_add_row(self) -> None:
self.handle_add_button()

def action_clear_table(self) -> None:
self.handle_clear_button()

if __name__ == “__main__”:
import nest_asyncio
nest_asyncio.apply()
app = DataDashboard()
app.run()

We connect UI events to backend actions using button handlers, keyboard shortcuts, and app-level functions. As we run the app, we interact with a fully functional dashboard that responds instantly to every click and command. This snippet completes the application and demonstrates how easily Textual enables us to build dynamic, state-driven UIs.

In conclusion, we see the whole dashboard come together in a fully functional, interactive form that runs directly from a notebook environment. We experience firsthand how Textual lets us design terminal UIs with the structure and feel of web apps, while staying entirely in Python. This tutorial leaves us confident that we can extend this foundation, even adding charts, API feeds, and multi-page navigation, as we continue to experiment with Textual’s modern reactive UI capabilities.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Fully Interactive, Reactive, and Dynamic Terminal-Based Data Dashboard Using Textual? appeared first on MarkTechPost.

OpenAI Researchers Train Weight Sparse Transformers to Expose Interpre …

Posted on November 15, 2025 by i-genie

If neural networks are now making decisions everywhere from code editors to safety systems, how can we actually see the specific circuits inside that drive each behavior? OpenAI has introduced a new mechanistic interpretability research study that trains language models to use sparse internal wiring, so that model behavior can be explained using small, explicit circuits.

https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf

Training transformers to be weight sparse

Most transformer language models are dense. Each neuron reads from and writes to many residual channels, and features are often in superposition. This makes circuit level analysis difficult. Previous OpenAI work tried to learn sparse feature bases on top of dense models using sparse autoencoders. The new research work instead changes the base model so that the transformer itself is weight sparse.

The OpenAI team trains decoder only transformers with an architecture similar to GPT 2. After each optimizer step with AdamW optimizer, they enforce a fixed sparsity level on every weight matrix and bias, including token embeddings. Only the largest magnitude entries in each matrix are kept. The rest are set to zero. Over training, an annealing schedule gradually drives the fraction of non zero parameters down until the model reaches a target sparsity.

In the most extreme setting, roughly 1 in 1000 weights is non zero. Activations are also somewhat sparse. Around 1 in 4 activations are non zero at a typical node location. The effective connectivity graph is therefore very thin even when the model width is large. This encourages disentangled features that map cleanly onto the residual channels the circuit uses.

https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf

Measuring interpretability through task specific pruning

To quantify whether these models are easier to understand, OpenAI team does not rely on qualitative examples alone. The research team define a suite of simple algorithmic tasks based on Python next token prediction. One example, single_double_quote, requires the model to close a Python string with the right quote character. Another example, set_or_string, requires the model to choose between .add and += based on whether a variable was initialized as a set or a string.

For each task, they search for the smallest subnetwork, called a circuit, that can still perform the task up to a fixed loss threshold. The pruning is node based. A node is an MLP neuron at a specific layer, an attention head, or a residual stream channel at a specific layer. When a node is pruned, its activation is replaced by its mean over the pretraining distribution. This is mean ablation.

The search uses continuous mask parameters for each node and a Heaviside style gate, optimized with a straight through estimator like surrogate gradient. The complexity of a circuit is measured as the count of active edges between retained nodes. The main interpretability metric is the geometric mean of edge counts across all tasks.

Example circuits in sparse transformers

On the single_double_quote task, the sparse models yield a compact and fully interpretable circuit. In an early MLP layer, one neuron behaves as a quote detector that activates on both single and double quotes. A second neuron behaves as a quote type classifier that distinguishes the two quote types. Later, an attention head uses these signals to attend back to the opening quote position and copy its type to the closing position.

In circuit graph terms, the mechanism uses 5 residual channels, 2 MLP neurons in layer 0, and 1 attention head in a later layer with a single relevant query key channel and a single value channel. If the rest of the model is ablated, this subgraph still solves the task. If these few edges are removed, the model fails on the task. The circuit is therefore both sufficient and necessary in the operational sense defined by the paper.

https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf

For more complex behaviors, such as type tracking of a variable named current inside a function body, the recovered circuits are larger and only partially understood. The research team show an example where one attention operation writes the variable name into the token set() at the definition, and another attention operation later copies the type information from that token back into a later use of current. This still yields a relatively small circuit graph.

Key Takeaways

Weight-sparse transformers by design: OpenAI trains GPT-2 style decoder only transformers so that almost all weights are zero, around 1 in 1000 weights is non zero, enforcing sparsity across all weights and biases including token embeddings, which yields thin connectivity graphs that are structurally easier to analyze.

Interpretability is measured as minimal circuit size: The work defines a benchmark of simple Python next token tasks and, for each task, searches for the smallest subnetwork, in terms of active edges between nodes, that still reaches a fixed loss, using node level pruning with mean ablation and a straight through estimator style mask optimization.

Concrete, fully reverse engineered circuits emerge: On tasks such as predicting matching quote characters, the sparse model yields a compact circuit with a few residual channels, 2 key MLP neurons and 1 attention head that the authors can fully reverse engineer and verify as both sufficient and necessary for the behavior.

Sparsity delivers much smaller circuits at fixed capability: At matched pre-training loss levels, weight sparse models require circuits that are roughly 16 times smaller than those recovered from dense baselines, defining a capability interpretability frontier where increased sparsity improves interpretability while slightly reducing raw capability.

Editorial Comments

OpenAI’s work on weight sparse transformers is a pragmatic step toward making mechanistic interpretability operational. By enforcing sparsity directly in the base model, the paper turns abstract discussions of circuits into concrete graphs with measurable edge counts, clear necessity and sufficiency tests, and reproducible benchmarks on Python next token tasks. The models are small and inefficient, but the methodology is relevant for future safety audits and debugging workflows. This research treats interpretability as a first class design constraint rather than an after the fact diagnostic.

Check out the Paper, GitHub Repo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OpenAI Researchers Train Weight Sparse Transformers to Expose Interpretable Circuits appeared first on MarkTechPost.

How to Design an Advanced Multi-Agent Reasoning System with spaCy Feat …

Posted on November 15, 2025 by i-genie

In this tutorial, we build an advanced Agentic AI system using spaCy, designed to allow multiple intelligent agents to reason, collaborate, reflect, and learn from experience. We work through the entire pipeline step by step, observing how each agent processes tasks using planning, memory, communication, and semantic reasoning. By the end, we see how the system evolves into a dynamic multi-agent architecture capable of extracting entities, interpreting context, forming reasoning chains, and constructing knowledge graphs, all while continuously improving through reflection and episodic learning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install spacy networkx matplotlib -q

import spacy
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass, field
from collections import defaultdict, deque
from enum import Enum
import json
import hashlib
from datetime import datetime

class MessageType(Enum):
REQUEST = “request”
RESPONSE = “response”
BROADCAST = “broadcast”
QUERY = “query”

@dataclass
class Message:
sender: str
receiver: str
msg_type: MessageType
content: Dict[str, Any]
timestamp: float = field(default_factory=lambda: datetime.now().timestamp())
priority: int = 1
def get_id(self) -> str:
return hashlib.md5(f”{self.sender}{self.timestamp}”.encode()).hexdigest()[:8]

@dataclass
class AgentTask:
task_id: str
task_type: str
data: Any
priority: int = 1
dependencies: List[str] = field(default_factory=list)
metadata: Dict = field(default_factory=dict)

@dataclass
class Observation:
state: str
action: str
result: Any
confidence: float
timestamp: float = field(default_factory=lambda: datetime.now().timestamp())

class WorkingMemory:
def __init__(self, capacity: int = 10):
self.capacity = capacity
self.items = deque(maxlen=capacity)
self.attention_scores = {}
def add(self, key: str, value: Any, attention: float = 1.0):
self.items.append((key, value))
self.attention_scores[key] = attention
def recall(self, n: int = 5) -> List[Tuple[str, Any]]:
sorted_items = sorted(self.items, key=lambda x: self.attention_scores.get(x[0], 0), reverse=True)
return sorted_items[:n]
def get(self, key: str) -> Optional[Any]:
for k, v in self.items:
if k == key:
return v
return None

class EpisodicMemory:
def __init__(self):
self.episodes = []
self.success_patterns = defaultdict(int)
def store(self, observation: Observation):
self.episodes.append(observation)
if observation.confidence > 0.7:
pattern = f”{observation.state}→{observation.action}”
self.success_patterns[pattern] += 1
def query_similar(self, state: str, top_k: int = 3) -> List[Observation]:
scored = [(obs, self._similarity(state, obs.state)) for obs in self.episodes[-50:]]
scored.sort(key=lambda x: x[1], reverse=True)
return [obs for obs, _ in scored[:top_k]]
def _similarity(self, state1: str, state2: str) -> float:
words1, words2 = set(state1.split()), set(state2.split())
if not words1 or not words2:
return 0.0
return len(words1 & words2) / len(words1 | words2)

We establish all the core structures required for our agentic system. We import key libraries, define message and task formats, and build both working and episodic memory modules. As we define these foundations, we lay the groundwork for reasoning, storage, and communication. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ReflectionModule:
def __init__(self):
self.performance_log = []
def reflect(self, task_type: str, confidence: float, result: Any) -> Dict[str, Any]:
self.performance_log.append({‘task’: task_type, ‘confidence’: confidence, ‘timestamp’: datetime.now().timestamp()})
recent = [p for p in self.performance_log if p[‘task’] == task_type][-5:]
avg_conf = sum(p[‘confidence’] for p in recent) / len(recent) if recent else 0.5
insights = {
‘performance_trend’: ‘improving’ if confidence > avg_conf else ‘declining’,
‘avg_confidence’: avg_conf,
‘recommendation’: self._get_recommendation(confidence, avg_conf)
}
return insights
def _get_recommendation(self, current: float, average: float) -> str:
if current < 0.4:
return “Request assistance from specialized agent”
elif current < average:
return “Review similar past cases for patterns”
else:
return “Continue with current approach”

class AdvancedAgent:
def __init__(self, name: str, specialty: str, nlp):
self.name = name
self.specialty = specialty
self.nlp = nlp
self.working_memory = WorkingMemory()
self.episodic_memory = EpisodicMemory()
self.reflector = ReflectionModule()
self.message_queue = deque()
self.collaboration_graph = defaultdict(int)
def plan(self, task: AgentTask) -> List[str]:
similar = self.episodic_memory.query_similar(str(task.data))
if similar and similar[0].confidence > 0.7:
return [similar[0].action]
return self._default_plan(task)
def _default_plan(self, task: AgentTask) -> List[str]:
return [‘analyze’, ‘extract’, ‘validate’]
def send_message(self, receiver: str, msg_type: MessageType, content: Dict):
msg = Message(self.name, receiver, msg_type, content)
self.message_queue.append(msg)
return msg
def receive_message(self, message: Message):
self.message_queue.append(message)
self.collaboration_graph[message.sender] += 1
def process(self, task: AgentTask) -> Dict[str, Any]:
raise NotImplementedError

class CognitiveEntityAgent(AdvancedAgent):
def process(self, task: AgentTask) -> Dict[str, Any]:
doc = self.nlp(task.data)
entities = defaultdict(list)
entity_contexts = []
for ent in doc.ents:
context_start = max(0, ent.start – 5)
context_end = min(len(doc), ent.end + 5)
context = doc[context_start:context_end].text
entities[ent.label_].append(ent.text)
entity_contexts.append({‘entity’: ent.text, ‘type’: ent.label_, ‘context’: context, ‘position’: (ent.start_char, ent.end_char)})
for ent_type, ents in entities.items():
attention = len(ents) / len(doc.ents) if doc.ents else 0
self.working_memory.add(f”entities_{ent_type}”, ents, attention)
confidence = min(len(entities) / 4, 1.0) if entities else 0.3
obs = Observation(state=f”entity_extraction_{len(doc)}tokens”, action=”extract_with_context”, result=len(entity_contexts), confidence=confidence)
self.episodic_memory.store(obs)
reflection = self.reflector.reflect(‘entity_extraction’, confidence, entities)
return {‘entities’: dict(entities), ‘contexts’: entity_contexts, ‘confidence’: confidence, ‘reflection’: reflection, ‘next_actions’: [‘semantic_analysis’, ‘knowledge_graph’] if confidence > 0.5 else []}

We construct the reflection engine and the base agent class, which provides every agent with reasoning, planning, and memory capabilities. We then implement the Cognitive Entity Agent, which processes text to extract entities with context and stores meaningful observations. As we run this part, we watch the agent learn from experience while dynamically adjusting its strategy. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SemanticReasoningAgent(AdvancedAgent):
def process(self, task: AgentTask) -> Dict[str, Any]:
doc = self.nlp(task.data)
reasoning_chains = []
for sent in doc.sents:
chain = self._extract_reasoning_chain(sent)
if chain:
reasoning_chains.append(chain)
entity_memory = self.working_memory.recall(3)
semantic_clusters = self._cluster_by_semantics(doc)
confidence = min(len(reasoning_chains) / 3, 1.0) if reasoning_chains else 0.4
obs = Observation(state=f”semantic_analysis_{len(list(doc.sents))}sents”, action=”reason_and_cluster”, result=len(reasoning_chains), confidence=confidence)
self.episodic_memory.store(obs)
return {‘reasoning_chains’: reasoning_chains, ‘semantic_clusters’: semantic_clusters, ‘memory_context’: entity_memory, ‘confidence’: confidence, ‘next_actions’: [‘knowledge_integration’]}
def _extract_reasoning_chain(self, sent) -> Optional[Dict]:
subj, verb, obj = None, None, None
for token in sent:
if token.dep_ == ‘nsubj’:
subj = token
elif token.pos_ == ‘VERB’:
verb = token
elif token.dep_ in [‘dobj’, ‘attr’, ‘pobj’]:
obj = token
if subj and verb and obj:
return {‘subject’: subj.text, ‘predicate’: verb.lemma_, ‘object’: obj.text, ‘confidence’: 0.8}
return None
def _cluster_by_semantics(self, doc) -> List[Dict]:
clusters = []
nouns = [token for token in doc if token.pos_ in [‘NOUN’, ‘PROPN’]]
visited = set()
for noun in nouns:
if noun.i in visited:
continue
cluster = [noun.text]
visited.add(noun.i)
for other in nouns:
if other.i != noun.i and other.i not in visited:
if noun.similarity(other) > 0.5:
cluster.append(other.text)
visited.add(other.i)
if len(cluster) > 1:
clusters.append({‘concepts’: cluster, ‘size’: len(cluster)})
return clusters

We design the Semantic Reasoning Agent, which analyzes sentence structures, forms reasoning chains, and groups concepts based on semantic similarity. We integrate working memory to enrich the understanding the agent builds. As we execute this, we see how the system moves from surface-level extraction to deeper inference. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass KnowledgeGraphAgent(AdvancedAgent):
def process(self, task: AgentTask) -> Dict[str, Any]:
doc = self.nlp(task.data)
graph = {‘nodes’: set(), ‘edges’: []}
for sent in doc.sents:
entities = list(sent.ents)
if len(entities) >= 2:
for ent in entities:
graph[‘nodes’].add((ent.text, ent.label_))
root = sent.root
if root.pos_ == ‘VERB’:
for i in range(len(entities) – 1):
graph[‘edges’].append({‘from’: entities[i].text, ‘relation’: root.lemma_, ‘to’: entities[i+1].text, ‘sentence’: sent.text[:100]})
graph[‘nodes’] = list(graph[‘nodes’])
confidence = min(len(graph[‘edges’]) / 5, 1.0) if graph[‘edges’] else 0.3
obs = Observation(state=f”knowledge_graph_{len(graph[‘nodes’])}nodes”, action=”construct_graph”, result=len(graph[‘edges’]), confidence=confidence)
self.episodic_memory.store(obs)
return {‘graph’: graph, ‘node_count’: len(graph[‘nodes’]), ‘edge_count’: len(graph[‘edges’]), ‘confidence’: confidence, ‘next_actions’: []}

class MetaController:
def __init__(self):
self.nlp = spacy.load(‘en_core_web_sm’)
self.agents = {
‘cognitive_entity’: CognitiveEntityAgent(‘CognitiveEntity’, ‘entity_analysis’, self.nlp),
‘semantic_reasoning’: SemanticReasoningAgent(‘SemanticReasoner’, ‘reasoning’, self.nlp),
‘knowledge_graph’: KnowledgeGraphAgent(‘KnowledgeBuilder’, ‘graph_construction’, self.nlp)
}
self.task_history = []
self.global_memory = WorkingMemory(capacity=20)
def execute_with_planning(self, text: str) -> Dict[str, Any]:
initial_task = AgentTask(task_id=”task_001″, task_type=”cognitive_entity”, data=text, metadata={‘source’: ‘user_input’})
results = {}
task_queue = [initial_task]
iterations = 0
max_iterations = 10
while task_queue and iterations < max_iterations:
task = task_queue.pop(0)
agent = self.agents.get(task.task_type)
if not agent or task.task_type in results:
continue
result = agent.process(task)
results[task.task_type] = result
self.global_memory.add(task.task_type, result, result[‘confidence’])
for next_action in result.get(‘next_actions’, []):
if next_action in self.agents and next_action not in results:
next_task = AgentTask(task_id=f”task_{iterations+1:03d}”, task_type=next_action, data=text, dependencies=[task.task_id])
task_queue.append(next_task)
iterations += 1
self.task_history.append({‘results’: results, ‘iterations’: iterations, ‘timestamp’: datetime.now().isoformat()})
return results
def generate_insights(self, results: Dict[str, Any]) -> str:
report = “=” * 70 + “n”
report += ” ADVANCED AGENTIC AI SYSTEM – ANALYSIS REPORTn”
report += “=” * 70 + “nn”
for agent_type, result in results.items():
agent = self.agents[agent_type]
report += f” {agent.name}n”
report += f” Specialty: {agent.specialty}n”
report += f” Confidence: {result[‘confidence’]:.2%}n”
if ‘reflection’ in result:
report += f” Performance: {result[‘reflection’].get(‘performance_trend’, ‘N/A’)}n”
report += ” Key Findings:n”
report += json.dumps({k: v for k, v in result.items() if k not in [‘reflection’, ‘next_actions’]}, indent=6) + “nn”
report += ” System-Level Insights:n”
report += f” Total iterations: {len(self.task_history)}n”
report += f” Active agents: {len(results)}n”
report += f” Global memory size: {len(self.global_memory.items)}n”
return report

We implement the Knowledge Graph Agent, enabling the system to connect entities through relations extracted from text. We then build the Meta-Controller, which coordinates all agents, manages planning, and handles multi-step execution. As we use this component, we watch the system behave like a true multi-agent pipeline with dynamic flow control. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
sample_text = “””
Artificial intelligence researchers at OpenAI and DeepMind are developing
advanced language models. Sam Altman leads OpenAI in San Francisco, while
Demis Hassabis heads DeepMind in London. These organizations collaborate
with universities like MIT and Stanford. Their research focuses on machine
learning, neural networks, and reinforcement learning. The breakthrough
came when transformers revolutionized natural language processing in 2017.
“””
controller = MetaController()
results = controller.execute_with_planning(sample_text)
print(controller.generate_insights(results))
print(“Advanced multi-agent analysis complete with reflection and learning!”)

We run the entire agentic system end-to-end on a sample text. We execute planning, call each agent in sequence, and generate a comprehensive analysis report. As we reach this stage, we see the full power of the multi-agent architecture working together in real time.

In conclusion, we developed a comprehensive multi-agent reasoning framework that operates on real-world text using spaCy, integrating planning, learning, and memory into a cohesive workflow. We observe how each agent contributes a unique layer of understanding, and we see the Meta-Controller orchestrate them to generate rich, interpretable insights. Lastly, we recognize the flexibility and extensibility of this agentic design, and we feel confident that we can now adapt it to more complex tasks, larger datasets, or even integrate language models to further enhance the system’s intelligence.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design an Advanced Multi-Agent Reasoning System with spaCy Featuring Planning, Reflection, Memory, and Knowledge Graphs appeared first on MarkTechPost.

Comparing the Top 6 Agent-Native Rails for the Agentic Internet: MCP, …

Posted on November 15, 2025 by i-genie

As AI agents move from single-app copilots to autonomous systems that browse, transact, and coordinate with each other, a new infrastructure layer is emerging underneath them. This article compares six key “agent-native rails” — MCP, A2A, AP2, ACP, x402, and Kite — focusing on how they standardize tool access, inter-agent communication, payment authorization, and settlement, and what that means for engineers designing secure, commerce-capable agentic systems.

The agent stack is around six trending agentic ‘rails’:

MCP – standard interface for tools and data.

A2A – transport and lifecycle for agent-to-agent calls.

AP2 – trust and mandates for agent-initiated payments.

ACP – interaction model for agentic checkout and commerce flows.

x402 – HTTP-native, on-chain payment protocol for APIs and agents.

Kite – L1 + state channels for high-frequency agent payments and policy-enforced autonomy.

They are complementary, not competing: MCP and A2A wire agents to context and each other, AP2/ACP encode commercial intent, and x402/Kite handle settlement.

The 6 rails at a glance

RailLayerPrimary roleTransport / substrateMCP (Model Context Protocol)Tools & dataStandard interface to tools, data sources, promptsJSON-RPC over stdio / process, HTTP / SSEA2A (Agent2Agent)Agent meshDiscovery and task lifecycle between agentsJSON-RPC 2.0 over HTTPS, optional SSE streamsAP2 (Agent Payments Protocol)Payment control planeVerifiable mandates and roles for agent paymentsProtocol-agnostic over existing rails, including blockchains like SuiACP (Agentic Commerce Protocol)Commerce flowsShared language for catalog, offers, checkout stateProtocol spec + HTTP APIs, open standard co-developed by OpenAI and Stripex402Settlement railInternet-native, per-request payments for APIs and agentsHTTP 402 with on-chain stablecoins such as USDCKiteL1 + state channelsAgent-centric chain with identity and streaming micropaymentsL1 chain + off-chain state-channel rails for agents

The rest of the article unpacks each rail along four axes:

Capabilities

Security posture

Ecosystem traction

OS / runtime integration trajectory

1. MCP: tool and context rail

Capabilities

The Model Context Protocol is an open protocol for connecting LLM applications to external tools and data. It defines a client–server architecture:

MCP clients (agents, IDEs, chat UIs) connect to

MCP servers that expose tools, resources, and prompts via a standardized JSON-RPC schema.

Tools are strongly typed (name + JSON schema for parameters and results) and can wrap arbitrary systems: HTTP APIs, databases, file operations, internal services, etc.

The same protocol works across transports (stdio for local processes, HTTP/SSE for remote servers), which is why multiple runtimes can consume the same MCP servers.

Security posture

MCP is deliberately agnostic about identity and payments. Security is inherited from the host:

Servers can run locally or remotely and may have full access to files, networks, and cloud APIs.

The main risks are classic: arbitrary code execution in tools, prompt injection, over-privileged credentials, and exfiltration of sensitive data.

Security guidance from Red Hat and others focuses on:

Least-privilege credentials per MCP server.

Sandboxing tools where possible.

Strong review and signing of server configurations.

Logging and audit for tool calls.

MCP itself does not give you access control semantics like ‘this agent can call this tool only under policy P’; those are layered on by hosts and IAM systems.

Ecosystem traction

MCP moved from Anthropic-only to ecosystem standard quickly:

Anthropic launched MCP and open-sourced the spec and TypeScript schemas.

OpenAI added full MCP client support in ChatGPT Developer Mode and the platform ‘Connectors’ system.

Microsoft integrated MCP into VS Code, Visual Studio, GitHub Copilot, and Copilot for Azure, including an “Azure MCP server.”

LangChain and LangGraph ship langchain-mcp-adapters for treating MCP tools as first-class LangChain tools.

Cloudflare runs a catalog of managed remote MCP servers and exposes them via its Agents SDK.

MCP is now effectively the ‘USB-C port’ for agent tools across IDEs, browsers, cloud agents, and edge runtimes

2. A2A: agent-to-agent protocol

Capabilities

The Agent2Agent (A2A) protocol is an open standard for inter-agent communication and task handoff. The spec defines:

A2A client – initiates tasks on behalf of a user or system.

A2A server (remote agent) – exposes a JSON-RPC endpoint that executes tasks.

Agent cards – JSON metadata at well-known paths (for example, /.well-known/agent-card.json) describing capabilities, endpoint, and auth.

Transport is standardized:

JSON-RPC 2.0 over HTTPS for requests and responses.

Optional SSE streams for long-running or streaming tasks.

This gives agents a common ‘RPC fabric’ independent of vendor or framework.

Security posture

At the protocol layer, A2A leans on common web primitives:

HTTPS with standard auth (API keys, OAuth-like tokens, mTLS) negotiated based on agent cards.

JSON-RPC 2.0 message format; parser correctness is a concern, since bugs in JSON-RPC handling become a security vector.

Red Hat and other analyses highlight:

Keep JSON-RPC libraries patched.

Protect against replay and downgrade attacks at the HTTP / TLS layer.

Treat agent-to-agent traffic like service-mesh traffic: identity, authz, and rate-limiting matter.

The protocol does not itself decide which agents should talk; that is a policy question for the platform.

Ecosystem traction

Google introduced A2A and is driving it as an interoperability layer for agents across enterprise platforms.

The A2A open-source org maintains the reference spec and implementation.

Amazon Bedrock AgentCore Runtime now supports A2A as a first-class protocol, with documented contract requirements.

Third-party frameworks (for example, CopilotKit) are adopting A2A for cross-agent and app-agent communication.

3. AP2: payment control layer

Capabilities

Agent Payments Protocol (AP2) is Google’s open standard for agent-initiated payments. Its core problem statement: when an AI agent pays, how do we know it had permission, the payment matches user intent, and someone is clearly accountable?

AP2 introduces:

Mandates – cryptographically signed digital contracts that encode who can pay, under which limits, for what kinds of transactions.

Role separation – payer agents, merchants, issuers, networks, and wallets each have explicit protocol roles.

Rail-agnostic design – AP2 can authorize payments over cards, bank transfers, or programmable blockchains such as Sui.

The protocol is designed to compose with A2A and MCP: A2A handles the messaging, MCP connects to tools, AP2 governs the payment semantics.

Security posture

Security is the main reason AP2 exists:

Mandates are signed using modern public-key cryptography and can be independently verified.

The protocol explicitly targets authorization, authenticity, and accountability: did the agent have permission, does the action match user intent, and who is liable if something goes wrong.

Ecosystem traction

AP2 is still early but already has meaningful backing:

Google announced AP2 with more than 60 organizations across ecommerce, payments, banking, and crypto as collaborators or early supporters.

Cohorts include networks like Mastercard and American Express, wallets and PSPs such as PayPal, and crypto players including Coinbase.

4. ACP: commerce interaction model

Capabilities

The Agentic Commerce Protocol (ACP), co-developed by OpenAI and Stripe, is the interaction model underlying ChatGPT Instant Checkout. It gives agents and merchants a shared language for:

Product discovery (catalog and offers).

Configuration (variants, shipping options).

Checkout state (selected item, price, shipping, terms).

Fulfillment and post-purchase status.

ACP is designed to:

Work across processors and business types without forcing backend rewrites.

Keep merchants as the merchant of record for fulfillment, returns, and support, even when the interaction starts in an agent.

Security posture

In ACP deployments:

Payments are handled by processors such as Stripe; ACP itself focuses on the structure of the commerce interaction, not on cryptography.

OpenAI’s Instant Checkout uses limited-scope payment credentials and explicit confirmation steps in the ChatGPT UI, which makes agent-initiated purchases visible to the user.

ACP does not replace anti-fraud, KYC, or PCI responsibilities; those remain with the PSPs and merchants.

Ecosystem traction

OpenAI and Stripe have open-sourced ACP and are actively recruiting merchants and platforms.

Instant Checkout is live for Etsy sellers, with Shopify merchants and additional regions coming next, and multiple press reports highlight ACP as the underlying protocol.

Salesforce has announced ACP-based integrations for its Agentforce Commerce stack.

ACP is essentially becoming the agent-side ‘checkout API‘ for multiple commerce ecosystems.

5. x402: HTTP-native settlement

Capabilities

x402 is Coinbase’s open payment protocol for AI agents and APIs. It revives HTTP status code 402 Payment Required as the trigger for machine-initiated, per-request payments.

Key properties:

Instant, automatic stablecoin payments over HTTP, primarily using USDC on chains like Base.

Clients (agents, apps) can pay for API calls, content, or services without accounts or sessions, by programmatically responding to 402 challenges.

Designed for both human and machine consumers, but the machine-to-machine case is explicitly emphasized.

Security posture

Settlement is on-chain, so the usual blockchain guarantees (and risks) apply: immutability, transparent balances, but exposure to contract bugs and key theft.

Coinbase runs the compliant infrastructure (KYT, sanctions screening, etc.) behind its managed offering.

There are no chargebacks; dispute handling must be layered at ACP/AP2 or application level.

Ecosystem traction

Coinbase and Cloudflare announced the x402 Foundation to push x402 as an open standard for internet payments, targeting both agents and human-facing APIs.

Cloudflare integrated x402 into its Agents SDK and MCP integration, so Workers and agents can offer paywalled endpoints and call x402 servers with a single wrapper.

6. Kite: agent-native L1 and state channels

Capabilities

Kite is an AI-oriented L1 chain and payment rail designed for agentic commerce. It states:

State-channel based micropayments– agents open off-chain channels and stream tiny payments with instant finality, settling periodically on-chain.

Agent-centric identity and constraints– cryptographic identity is used to bind agents and users, with protocol-level spend constraints and policy enforcement.

PoAI-oriented design– the chain is explicitly tuned for the AI-agent economy, not generic DeFi.

Security posture

Kite inherits L1 security concerns (consensus safety, smart-contract correctness) plus state-channel specifics:

Off-chain channels must be protected against fraud (for example, outdated state publication) and key compromise.

Policy constraints are enforced at protocol level; if implemented correctly, this can significantly reduce the chance of runaway spending by agents.

Because the design is agent-specific, there is less ‘legacy baggage’ than in generalized DeFi chains, but also less battle-tested code.

Ecosystem traction

PayPal Ventures and others have publicly backed Kite as part of the agentic commerce stack.

Crypto and infra publications describe it as a complementary rail to x402, optimized for streaming, high-frequency interactions between agents.

The ecosystem is still young compared to mainstream L1s, but it is clearly positioned as an ‘AI-payments L1,’ not a general-purpose chain.

How the rails compose in real systems

A realistic agentic workflow will touch several of these rails:

Tooling and data

An IDE agent, OS agent, or backend agent connects to internal APIs, file systems, and monitoring systems via MCP servers.

Multi-agent orchestration

The primary agent delegates specialized tasks (for example, cost optimization, legal review, marketing ops) to other agents via A2A.

Commerce flow

For purchasing, the agent enters an ACP flow with a merchant: fetch catalog, configure a product, receive a priced offer, confirm checkout state.

Payment authorization

The user has previously granted an AP2 mandate to a wallet-backed payment agent, specifying limits and scope. The commerce or orchestration agent requests payment via that AP2-capable payment agent.

Settlement

Depending on the scenario, the payment agent may:

Use traditional rails (card, bank) under AP2, or

Use x402 for per-call on-chain payments to an API, or

Use Kite state channels for streaming micro-transactions between agents.

This composition preserves separation of concerns:

MCP & A2A: who talks to whom, and about what.

AP2 & ACP: how intent, consent, and liability for commerce are encoded.

x402 & Kite: how value is actually moved at low latency.

References:

Model Context Protocol – official sitehttps://modelcontextprotocol.io/

Anthropic: “Introducing the Model Context Protocol”https://www.anthropic.com/news/model-context-protocol

Claude Docs: “Model Context Protocol (MCP)”https://docs.claude.com/en/docs/mcp

OpenAI Docs: “Connectors and MCP servers”https://platform.openai.com/docs/guides/tools-connectors-mcp

OpenAI Docs: “MCP Server Documentation”https://platform.openai.com/docs/mcp

LangChain MCP Adapters – GitHubhttps://github.com/langchain-ai/langchain-mcp-adapters

LangChain Docs: “Model Context Protocol (MCP)”https://docs.langchain.com/oss/python/langchain/mcp

npm package: @langchain/mcp-adaptershttps://www.npmjs.com/package/%40langchain/mcp-adapters

Azure AI Foundry: “Create an MCP Server with Azure AI Agent Service”https://devblogs.microsoft.com/foundry/integrating-azure-ai-agents-mcp/

Azure AI Foundry Docs: “Connect to Model Context Protocol servers (preview)”https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/tools/model-context-protocol

Azure AI Foundry MCP Server – May 2025 updatehttps://devblogs.microsoft.com/foundry/azure-ai-foundry-mcp-server-may-2025/

Windows AI Foundry (MCP integration in Windows)https://developer.microsoft.com/en-us/windows/ai/

The Verge: “Windows is getting support for the ‘USB-C of AI apps’”https://www.theverge.com/news/669298/microsoft-windows-ai-foundry-mcp-support

Agent2Agent (A2A) Protocol – official specificationhttps://a2a-protocol.org/latest/specification/

Google Developers Blog: “Announcing the Agent2Agent Protocol (A2A)”https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/

IBM Think: “What is A2A protocol (Agent2Agent)?”https://www.ibm.com/think/topics/agent2agent-protocol

Amazon Bedrock: “Deploy A2A servers in AgentCore Runtime”https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/runtime-a2a.html

Amazon Bedrock: “A2A protocol contract”https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/runtime-a2a-protocol-contract.html

AWS News: “Amazon Bedrock AgentCore is now generally available”https://aws.amazon.com/about-aws/whats-new/2025/10/amazon-bedrock-agentcore-available/

Google Cloud Blog: “Announcing Agent Payments Protocol (AP2)”https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol

AP2 overview / technical details (Google / partner materials)https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol

Coinbase x402 + AP2 launch with Googlehttps://www.coinbase.com/developer-platform/discover/launches/google_x402

Omni (Swedish) coverage: “Google teamar upp med betaljättar – vill låta AI-agenter shoppa åt dig”https://omni.se/a/RzkWqO

OpenAI: “Buy it in ChatGPT: Instant Checkout and the Agentic Commerce Protocol”https://openai.com/index/buy-it-in-chatgpt/

OpenAI Developer Docs: “Agentic Commerce Protocol – Get started”https://developers.openai.com/commerce/guides/get-started/

Stripe Newsroom: “Stripe powers Instant Checkout in ChatGPT and releases the Agentic Commerce Protocol”https://stripe.com/newsroom/news/stripe-openai-instant-checkout

TechRadar Pro: “You can now buy things through ChatGPT with a single click”https://www.techradar.com/pro/you-can-now-buy-things-through-chatgpt-with-a-single-click-if-youre-one-of-the-lucky-ones

Reuters: “OpenAI partners with Etsy, Shopify on ChatGPT payment checkout”https://www.reuters.com/world/americas/openai-partners-with-etsy-shopify-chatgpt-checkout-2025-09-29/

Salesforce Press Release: “Salesforce Announces Support for Agentic Commerce Protocol with Stripe and OpenAI”https://www.salesforce.com/news/press-releases/2025/10/14/stripe-openai-agentic-commerce-protocol-announcement/

Salesforce Investor News: “Salesforce and OpenAI Partner Across Enterprise Work and Commerce”https://investor.salesforce.com/news/news-details/2025/Salesforce-and-OpenAI-Partner-Across-Enterprise-Work-and-Commerce/default.aspx

Salesforce: Agentforce Commercehttps://www.salesforce.com/commerce/

Coinbase Developer Platform: “x402: The internet-native payment protocol”https://www.coinbase.com/developer-platform/products/x402

Base Docs: “Building Autonomous Payment Agents with x402”https://docs.base.org/base-app/agents/x402-agents

Cloudflare Agents Docs: “x402 · Cloudflare Agents docs”https://developers.cloudflare.com/agents/x402/

Cloudflare Blog: “Launching the x402 Foundation with Coinbase, and support for x402 transactions”https://blog.cloudflare.com/x402/

Cloudflare x402 tag pagehttps://blog.cloudflare.com/tag/x402/

Zuplo Blog: “Autonomous API & MCP Server Payments with x402”https://zuplo.com/blog/mcp-api-payments-with-x402

Kite whitepaper: “Building Trustless Payment Infrastructure for Agentic AI”https://gokite.ai/kite-whitepaper

Kite: “Whitepaper”https://gokite.ai/whitepaper

Kite Docs: “Introduction & Mission”https://docs.gokite.ai/get-started-why-kite/introduction-and-mission

PayPal Newsroom: “Kite Raises $18M in Series A Funding To Enforce Trust in the Agentic Web”https://newsroom.paypal-corp.com/2025-09-02-Kite-Raises-18M-in-Series-A-Funding-To-Enforce-Trust-in-the-Agentic-Web

PayPal Ventures: “The state of agentic commerce and why we invested in Kite AI”https://paypal.vc/news/news-details/2025/The-state-of-agentic-commerce-and-why-we-invested-in-Kite-AI-2025-LroAXfplpA/default.aspx

Binance Research: “Kite enables an agentic internet…”https://www.binance.com/en-KZ/research/projects/kite

Phemex Academy: “What Is Kite (KITE)? Guide to the AI Agent Economy”https://phemex.com/academy/what-is-kite-ai-agent-economy

Finextra: “PayPal leads funding round in agentic AI firm Kite”https://www.finextra.com/newsarticle/46535/paypal-leads-funding-round-in-agentic-ai-firm-kite

Plug and Play Tech Center: “How Kite is Building the Infrastructure for the Agentic Internet”https://www.plugandplaytechcenter.com/venture-capital/investment-announcements/kite-investment

PYMNTS: “PayPal Ventures-Backed Kite Nets $18M for Agentic AI”https://www.pymnts.com/news/investment-tracker/2025/paypal-backed-kite-raises-18-million-for-agentic-web/

GlobeNewswire: “Kite announces investment from Coinbase Ventures…”https://www.globenewswire.com/news-release/2025/10/27/3174837/0/en/Kite-announces-investment-from-Coinbase-Ventures-to-Advance-Agentic-Payments-with-the-x402-Protocol.html

Keycard – official sitehttps://www.keycard.ai/

Keycard: product page (alternate URL)https://www.keycard.sh/

Help Net Security: “Keycard emerges from stealth with identity and access platform for AI agents”https://www.helpnetsecurity.com/2025/10/22/keycard-ai-agents-identity-access-platform/

GlobeNewswire: “Keycard Launches to Solve the AI Agent Identity and Access Problem…”https://www.globenewswire.com/news-release/2025/10/21/3170297/0/en/Keycard-Launches-to-Solve-the-AI-Agent-Identity-and-Access-Problem-With-38-Million-in-Funding-From-Andreessen-Horowitz-Boldstart-Ventures-and-Acrew-Capital.html

The post Comparing the Top 6 Agent-Native Rails for the Agentic Internet: MCP, A2A, AP2, ACP, x402, and Kite appeared first on MarkTechPost.

Build a biomedical research agent with Biomni tools and Amazon Bedrock …

Posted on November 15, 2025 by i-genie

This post is co-authored with the Biomni group from Stanford.
Biomedical researchers spend approximately 90% of their time manually processing massive volumes of scattered information. This is evidenced by Genentech’s challenge of processing 38 million biomedical publications in PubMed, public repositories like the Human Protein Atlas, and their internal repository of hundreds of millions of cells across hundreds of diseases. There is a rapid proliferation of specialized databases and analytical tools across different modalities including genomics, proteomics, and pathology. Researchers must stay current with the large landscape of tools, leaving less time for the hypothesis-driven work that drives breakthrough discoveries.
AI agents powered by foundation models offer a promising solution by autonomously planning, executing, and adapting complex research tasks. Stanford researchers built Biomni that exemplifies this potential. Biomni is a general-purpose biomedical AI agent that integrates 150 specialized tools, 105 software packages, and 59 databases to execute sophisticated analyses such as gene prioritization, drug repurposing, and rare disease diagnosis.
However, deploying such agents in production requires robust infrastructure capable of handling computationally intensive workflows and multiple concurrent users while maintaining security and performance standards. Amazon Bedrock AgentCore is a set of comprehensive services to deploy and operate highly capable agents using any framework or model, with enterprise-grade security and scalability.
In this post, we show you how to implement a research agent using AgentCore with access to over 30 specialized biomedical database tools from Biomni, thereby accelerating scientific discovery while maintaining enterprise-grade security and production scale. The code for this solution is available in the open-source toolkit repository of starter agents for life sciences on Amazon Web Services (AWS). The step by step instruction helps you deploy your own tools and infrastructure, along with AgentCore components, and examples.
Prototype-to-production complexity gap
Moving from a local biomedical research prototype to a production system accessible by multiple research teams requires addressing complex infrastructure challenges.
Agent deployment with enterprise security
Enterprise security challenges include OAuth-based authentication, secure tool sharing through scalable gateways, comprehensive observability for research audit trails, and automatic scaling to handle concurrent research workloads. Many promising prototypes fail to reach production because of the complexity of implementing these enterprise-grade requirements while maintaining the specialized domain expertise needed for accurate biomedical analysis.
Session-aware research context management
Biomedical research workflows often span multiple conversations and require persistent memory of previous analyses, experimental parameters, and research preferences across extended research sessions. Research agents must maintain contextual awareness of ongoing projects, remember specific protein targets, experimental conditions, and analytical preferences. All that must be done while facilitating proper session isolation between different researchers and research projects in a multi-tenant production environment.
Scalable tool gateway
Implementing a reusable tool gateway that can handle concurrent requests from research agent, proper authentication, and consistent performance becomes critical at scale. The gateway must enable agents to discover and use tools through secure endpoints, help agents find the right tools through contextual search capabilities, and manage both inbound authentication (verifying agent identity) and outbound authentication (connecting to external biomedical databases) in a unified service. Without this architecture, research teams face authentication complexity and reliability issues that prevent effective scaling.
Solution overview
We use Strands Agents, an open source agent framework, to build a research agent with local tool implementation for PubMed biomedical literature search. We extended the agent’s capabilities by integrating Biomni database tools, providing access to over 30 specialized biomedical databases.
The overall architecture is shown in the following diagram.

The AgentCore Gateway service centralizes Biomni database tools as more secure, reusable endpoints with semantic search capabilities. AgentCore Memory service maintains contextual awareness across research sessions using specialized strategies for research context. Security is handled by AgentCore Identity service, which manages authentication for both users and tool access control. Deployment is streamlined with the AgentCore Runtime service, providing scalable, managed deployment with session isolation. Finally, the AgentCore Observability service enables comprehensive monitoring and auditing of research workflows that are critical for scientific reproducibility.
Step 1 – Creating tools such as the Biomni database tools using AgentCore Gateway
In real-world use cases, we need to connect agents to different data sources. Each agent might duplicate the same tools, leading to extensive code, inconsistent behavior, and maintenance nightmares. AgentCore Gateway service streamlines this process by centralizing tools into reusable, secure endpoints that agents can access. Combined with the AgentCore Identity service for authentication, AgentCore Gateway creates an enterprise-grade tool sharing infrastructure. To give more context to the agent with reusable tools, we provided access to over 30 specialized public database APIs through the Biomni tools registered on the gateway. The gateway exposes Biomni’s database tools through the Model Context Protocol (MCP), allowing the research agent to discover and invoke these tools alongside local tools like PubMed. It handles authentication, rate limiting, and error handling, providing a seamless research experience.

def create_gateway(gateway_name: str, api_spec: list) -> dict:
# JWT authentication with Cognito
auth_config = {
“customJWTAuthorizer”: {
“allowedClients”: [
get_ssm_parameter(“/app/researchapp/agentcore/machine_client_id”)
],
“discoveryUrl”:
get_ssm_parameter(“/app/researchapp/agentcore/cognito_discovery_url”),
}
}

# Enable semantic search for BioImm tools
search_config = {“hcp”: {“searchType”: “SEMANTIC”}}

# Create the gateway
gateway = bedrock_agent_client.create_gateway(
name=gateway_name,
collectionexecution_role_arn,
protocolType=”MCP”,
authorizerType=”CUSTOM_JWT”,
authorizerConfiguration=auth_config,
protocolConfiguration=search_config,
description=”My App Template AgentCore Gateway”,
)

We use an
AWS Lambda function to host the Biomni integration code. The Lambda function is automatically configured as an MCP target in the AgentCore Gateway. The Lambda function exposes its available tools through the API specification (
api_spec.json).
# Gateway Target Configuration
lambda_target_config = {
“mcp”: {
“lambda”: {
“lambdaArn”: get_ssm_parameter(“/app/researchapp/agentcore/lambda_arn”),
“toolSchema”: {“inlinePayload”: api_spec},
}
}
}

# Create the target
create_target_response = gateway_client.create_gateway_target(
gatewayIdentifier=gateway_id,
name=”LambdaUsingSDK”,
description=”Lambda Target using SDK”,
targetConfiguration=lambda_target_config,
credentialProviderConfigurations=[{
“credentialProviderType”: “GATEWAY_IAM_ROLE”
}],
)
The full list of Biomni database tools included on the gateway are listed in the following table:

Group
Tool
Description

Protein and structure databases
UniProt
Query the UniProt REST API for comprehensive protein sequence and functional information

AlphaFold
Query the AlphaFold Database API for AI-predicted protein structure predictions

InterPro
Query the InterPro REST API for protein domains, families, and functional sites

PDB (Protein Data Bank)
Query the RCSB PDB database for experimentally determined protein structures

STRING
Query the STRING protein interaction database for protein-protein interaction networks

EMDB (Electron Microscopy Data Bank)
Query for 3D macromolecular structures determined by electron microscopy

Genomics and variants
ClinVar
Query NCBI’s ClinVar database for clinically relevant genetic variants and their interpretations

dbSNP
Query the NCBI dbSNP database for single nucleotide polymorphisms and genetic variations

gnomAD
Query gnomAD for population-scale genetic variant frequencies and annotations

Ensembl
Query the Ensembl REST API for genome annotations, gene information, and comparative genomics

UCSC Genome Browser
Query the UCSC Genome Browser API for genomic data and annotations

Expression and omics
GEO (Gene Expression Omnibus)
Query NCBI’s GEO for RNA-seq, microarray, and other gene expression datasets

PRIDE
Query the PRIDE database for proteomics identifications and mass spectrometry data

Reactome
Query the Reactome database for biological pathways and molecular interactions

Clinical and drug data
cBioPortal
Query the cBioPortal REST API for cancer genomics data and clinical information

ClinicalTrials.gov
Query ClinicalTrials.gov API for information about clinical studies and trials

OpenFDA
Query the OpenFDA API for FDA drug, device, and food safety data

GtoPdb (Guide to PHARMACOLOGY)
Query the Guide to PHARMACOLOGY database for drug targets and pharmacological data

Disease and phenotype
OpenTargets
Query the OpenTargets Platform API for disease-target associations and drug discovery data

Monarch Initiative
Query the Monarch Initiative API for phenotype and disease information across species

GWAS Catalog
Query the GWAS Catalog API for genome-wide association study results

RegulomeDB
Query the RegulomeDB database for regulatory variant annotations and functional predictions

Specialized databases
JASPAR
Query the JASPAR REST API for transcription factor binding site profiles and motifs

WoRMS (World Register of Marine Species)
Query the WoRMS REST API for marine species taxonomic information

Paleobiology Database (PBDB)
Query the PBDB API for fossil occurrence and taxonomic data

MPD (Mouse Phenome Database)
Query the Mouse Phenome Database for mouse strain phenotype data

Synapse
Query Synapse REST API for biomedical datasets and collaborative research data

The following are examples of how individual tools get triggered through the MCP from our test suite:

# Protein and Structure Analysis
“Use uniprot tool to find information about human insulin protein”
# → Triggers uniprot MCP tool with protein query parameters
“Use alphafold tool for structure predictions for uniprot_id P01308”
# → Triggers alphafold MCP tool for 3D structure prediction
“Use pdb tool to find protein structures for insulin”
# → Triggers pdb MCP tool for crystallographic structures
# Genetic Variation Analysis
“Use clinvar tool to find pathogenic variants in BRCA1 gene”
# → Triggers clinvar MCP tool with gene variant parameters
“Use gnomad tool to find population frequencies for BRCA2 variants”
# → Triggers gnomad MCP tool for population genetics data

As the tool collection grows, the agent can use built-in semantic search capabilities to discover and select tools based on the task context. This improves agent performance and reducing development complexity at scale. For example, the user asks, “tell me about HER2 variant rs1136201.” Instead of listing all 30 or more tools from the gateway back to the agent, semantic search returns ‘n’ most relevant tools. For example, Ensembl, Gwas catalog, ClinVar, and Dbsnp to the agent. The agent now uses a smaller subset of tools as input to the model to return a more efficient and faster response.
The following graphic illustrates using AgentCore Gateway for tool search.

You can now test your deployed AgentCore gateway using the following test scripts and compare how semantic search narrows down the list of relevant tools based on the search query.
uv run tests/test_gateway.py –prompt “What tools are available?”
uv run tests/test_gateway.py –prompt “Find information about human insulin protein” –use-search
Step 2- Strands research agent with a local tool
The following code snippet shows model initialization, implementing the PubMed local tool that’s declared using the Strands @tool decorator. We’ve implemented the PubMed tool in research_tools.py that calls PubMed APIs to enable biomedical literature search capabilities within the agent’s execution context.

PubMed Tool Creation

from agent.agent_config.tools.PubMed import PubMed

@tool(
name=”Query_pubmed”,
description=(
“Query PubMed for relevant biomedical literature based on the user’s query. ”
“This tool searches PubMed abstracts and returns relevant studies with ”
“titles, links, and summaries.”
),
)
def query_pubmed(query: str) -> str:
“””
Query PubMed for relevant biomedical literature based on the user’s query.

This tool searches PubMed abstracts and returns relevant studies with
titles, links, and summaries.

Args:
query: The search query for PubMed literature

Returns:
str: Formatted results from PubMed search
“””
pubmed = PubMed()

print(f”nPubMed Query: {query}n”)
result = pubmed.run(query)
print(f”nPubMed Results: {result}n”)

return result

Create the Strands research agent with the local tool and Claude Sonnet 4 Interleaved Thinking.

class ResearchAgent:
def __init__(
self,
bearer_token: str,
memory_hook: MemoryHook = None,
session_manager: AgentCoreMemorySessionManager = None,
bedrock_model_id: str = “us.anthropic.claude-sonnet-4-20250514-v1.0”,
#bedrock_model_id: str = “openai.gpt-oss-120b-1.0”, # Alternative
system_prompt: str = None,
tools: List[callable] = None,
):

self.model_id = bedrock_model_id
# For Anthropic Sonnet 4 interleaved thinking
self.model = BedrockModel(
model_id=self.model_id,
additional_request_fields={
“anthropic_beta”: [“interleaved-thinking-2025-05-14”],
“thinking”: {“type”: “enabled”, “budget_tokens”: 8000},
},
)

self.system_prompt = (
system_prompt
if system_prompt
else “””
You are a **Comprehensive Biomedical Research Agent** specialized in conducting
systematic literature reviews and multi-database analyses to answer complex biomedical research
questions. Your primary mission is to synthesize evidence from both published literature
(PubMed) and real-time database queries to provide comprehensive, evidence-based insights for
pharmaceutical research, drug discovery, and clinical decision-making.

Your core capabilities include literature analysis and extracting data from 30+ specialized
biomedical databases** through the Bioimm gateway, enabling comprehensive data analysis. The
database tool categories include genomics and genetics, protein structure and function, pathways
and system biology, clinical and pharmacological data, expression and omics data and other
specialized databases.
“””
)

In addition, we implemented citations that use a structured system prompt to enforce numbered in-text citations [1], [2], [3] with standardized reference formats for both academic literature and database queries, marking sure every data source is properly attributed. This allows researchers to quickly access and reference the scientific literature that supports their biomedical research queries and findings.

“””
<citation_requirements>
– ALWAYS use numbered in-text citations [1], [2], [3], etc. when referencing any data source
– Provide a numbered “References” section at the end with full source details
– For academic literature: format as “1. Author et al. Title. Journal. Year. ID: [PMID/DOI], available at: [URL]”
– For database sources: format as “1. Database Name (Tool: tool_name), Query: [query_description], Retrieved: [current_date]”
– Use numbered in-text citations throughout your response to support all claims and data points
– Each tool query and each literature source must be cited with its own unique reference number
– When tools return academic papers, cite them using the academic format with full bibliographic details
– Structure: Format each reference on a separate line with proper numbering – NO bullet points
– Present the References section as a clean numbered list, not a confusing paragraph
– Maintain sequential numbering across all reference types in a single “References” section
</citation_requirements>
“””

You can now test your agent locally:
uv run tests/test_agent_locally.py –prompt “Find information about human insulin protein”
uv run tests/test_agent_locally.py –prompt “Find information about human insulin protein” –use-search
Step 3 – Add Persistent Memory for contextual research assistance
The research agent implements the AgentCore Memory service with three strategies: semantic for factual research context, user_preference for research methodologies, and summary for session continuity. The AgentCore Memory session manager is integrated with Strands session management and retrieves relevant context before queries and save interactions after responses. This enables the agent to remember research preferences, ongoing projects, and domain expertise across sessions without manual context re-establishment.
# Test memory functionality with research conversations
python tests/test_memory.py load-conversation<br />python tests/test_memory.py load-prompt “My preferred response format is detailed explanations”
Step 4 – Deploy with AgentCore Runtime
To deploy our agent, we use AgentCore Runtime to configure and launch the research agent as a managed service. The deployment process configures the runtime with the agent’s main entrypoint (agent/main.py), assigns an IAM execution role for AWS service access, and supports both OAuth and IAM authentication modes. After deployment, the runtime becomes a scalable, serverless agent that can be invoked using API calls. The agent automatically handles session management, memory persistence, and tool orchestration while providing secure access to the Biomni gateway and local research tools.
agentcore configure –entrypoint agent/main.py -er arn:aws:iam::<Account-Id>:role/<Role> –name researchapp<AgentName>
For more information about deploying with AgentCore Runtime, see Get started with AgentCore Runtime in the Amazon Bedrock AgentCore Developer Guide.
Agents in action
The following are three representative research scenarios that showcase the agent’s capabilities across different domains: drug mechanism analysis, genetic variant investigation, and pathway exploration. For each query, the agent autonomously determines which combination of tools to use, formulates appropriate sub-queries, analyzes the returned data, and synthesizes a comprehensive research report with proper citations. The accompanying demo video shows the complete agent workflow, including tools selection, reasoning, and response generation.

Conduct a comprehensive analysis of trastuzumab (Herceptin) mechanism of action and resistance mechanisms you’ll need:

HER2 protein structure and binding sites
Downstream signaling pathways affected
Known resistance mechanisms from clinical data
Current clinical trials investigating combination therapies
Biomarkers for treatment response predictionQuery relevant databases to provide a comprehensive research report.

Analyze the clinical significance of BRCA1 variants in breast cancer risk and treatment response. Investigate:

Population frequencies of pathogenic BRCA1 variants
Clinical significance and pathogenicity classifications
Associated cancer risks and penetrance estimates
Treatment implications (PARP inhibitors, platinum agents)
Current clinical trials for BRCA1-positive patients Use multiple databases to provide comprehensive evidence

The following video is a demonstration of a biomedical research agent:

Scalability and observability
One of the most critical challenges in deploying sophisticated AI agents is making sure they scale reliably while maintaining comprehensive visibility into their operations. Biomedical research workflows are inherently unpredictable—a single genomic analysis might process thousands of files, while a literature review could span millions of publications. Traditional infrastructure struggles with these dynamic workloads, particularly when handling sensitive research data that requires strict isolation between different research projects.In this deployment, we use Amazon Bedrock AgentCore Observability to visualize each step in the agent workflow. You can use this service to inspect an agent’s execution path, audit intermediate outputs, and debug performance bottlenecks and failures. For biomedical research, this level of transparency is not just helpful—it’s essential for regulatory compliance and scientific reproducibility.
Sessions, traces, and spans form a three-tiered hierarchical relationship in the observability framework. A session contains multiple traces, with each trace representing a discrete interaction within the broader context of the session. Each trace contains multiple spans that capture fine-grained operations. The following screenshot shoes the usage of one agent: Number of sessions, token usage, and error rate in production

The following screenshot shows the agents in production and their usage (number of Sessions, number of invocations)

The built-in dashboards show performance bottlenecks and identify why certain interactions might fail, enabling continuous improvement and reducing the mean time to detect (MTTD) and mean time to repair (MTTR). For biomedical applications where failed analyses can delay critical research timelines, this rapid issue resolution capability makes sure that research momentum is maintained.
Future direction
While this implementation focuses on only a subset of tools, the AgentCore Gateway architecture is designed for extensibility. Research teams can seamlessly add new tools without requiring code changes by using the MCP protocol. Newly registered tools are automatically discoverable by agents allowing your research infrastructure to evolve alongside the rapidly changing tool sets.
For computational analysis that requires code execution, the AgentCore Code Interpreter service can be integrated into the research workflow. With AgentCore Code Interpreter the research agent can retrieve data and execute Python-based analysis using domain-specific libraries like BioPython, scikit-learn, or custom genomics packages.
Future extensions could support multiple research agents to collaborate on complex projects, with specialized agents for literature review, experimental design, data analysis, and result interpretation working together through multi-agent collaboration. Organizations can also develop specialized research agents tailored to specific therapeutic areas, disease domains, or research methodologies that share the same enterprise infrastructure and tool gateway.
Looking ahead with Biomni
“Biomni today is already useful for academic research and open exploration. But to enable real discovery—like advancing drug development—we need to move beyond prototypes and make the system enterprise-ready. Embedding Biomni into the workflows of biotech and pharma is essential to turn research potential into tangible impact.
That’s why we are excited to integrate the open-source environment with Amazon Bedrock AgentCore, bridging the gap from research to production. Looking ahead, we’re also excited about extending these capabilities with the Biomni A1 agent architecture and the Biomni-R0 model, which will unlock even more sophisticated biomedical reasoning and analysis. At the same time, Biomni will remain a thriving open-source environment, where researchers and industry teams alike can contribute tools, share workflows, and push the frontier of biomedical AI together with AgentCore.”
Conclusion
This implementation demonstrates how organizations can use Amazon Bedrock AgentCore to transform biomedical research prototypes into production-ready systems. By integrating Biomni’s comprehensive collection of over 150 specialized tools through the AgentCore Gateway service, we illustrate how teams can create enterprise-grade tool sharing infrastructure that scales across multiple research domains.The combination of Biomni’s biomedical tools with the enterprise infrastructure of Bedrock AgentCore organizations can build research agents that maintain scientific rigor while meeting production requirements for security, scalability, and observability. Biomni’s diverse tool collection—spanning genomics, proteomics, and clinical databases—exemplifies how specialized research capabilities can be centralized and shared across research teams through a secure gateway architecture.
To begin building your own biomedical research agent with Biomni tools, explore the implementation by visiting our GitHub repository for the complete code and documentation. You can follow the step-by-step implementation guide to set up your research agent with local tools, gateway integration, and Bedorck AgentCore deployment. As your needs evolve, you can extend the system with your organization’s proprietary databases and analytical tools. We encourage you to join the growing environment of life sciences AI agents and tools by sharing your extensions and improvements.

About the authors
Hasan Poonawala is a Senior AI/ML Solutions Architect at AWS, working with Healthcare and Life Sciences customers. Hasan helps design, deploy and scale Generative AI and Machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.
Pierre de Malliard is a Senior AI/ML Solutions Architect at Amazon Web Services and supports customers in the Healthcare and Life Sciences Industry. He is currently based in New York City.
Necibe Ahat is a Senior AI/ML Specialist Solutions Architect at AWS, working with Healthcare and Life Sciences customers. Necibe helps customers to advance their generative AI and machine learning journey. She has a background in computer science with 15 years of industry experience helping customers ideate, design, build and deploy solutions at scale. She is a passionate inclusion and diversity advocate.
Kexin Huang is a final-year PhD student in Computer Science at Stanford University, advised by Prof. Jure Leskovec. His research applies AI to enable interpretable and deployable biomedical discoveries, addressing core challenges in multi-modal modeling, uncertainty, and reasoning. His work has appeared in Nature Medicine, Nature Biotechnology, Nature Chemical Biology, Nature Biomedical Engineering and top ML venues (NeurIPS, ICML, ICLR), earning six best paper awards. His research has been highlighted by Forbes, WIRED, and MIT Technology Review, and he has contributed to AI research at Genentech, GSK, Pfizer, IQVIA, Flatiron Health, Dana-Farber, and Rockefeller University.

Make your web apps hands-free with Amazon Nova Sonic

Posted on November 15, 2025 by i-genie

Graphical user interfaces have carried the torch for decades, but today’s users increasingly expect to talk to their applications. Amazon Nova Sonic is a state-of-the-art foundation model from Amazon Bedrock, that helps enable this shift by providing natural, low-latency, bidirectional speech conversations over a simple streaming API. Users can collaborate with the applications through voice and embedded intelligence rather than merely operating them.
In this post we show how we added a true voice-first experience to a reference application—the Smart Todo App—turning routine task management into a fluid, hands-free conversation.
Rethinking user interaction through collaborative AI voice agents
Important usability enhancements are often deprioritized—not because they aren’t valuable, but because they’re difficult to implement within traditional mouse-and-keyboard interfaces. Features like intelligent batch actions, personalized workflows, or voice-guided assistance are frequently debated but deferred due to UI complexity. This is about voice as an additional, general-purpose interaction mode—not a replacement for device-specific controls or an accessibility-only solution. Voice enables new interaction patterns, it also benefits users of assistive technologies, such as screen readers, by offering an additional, inclusive way to interact with the application.
Amazon Nova Sonic goes far beyond one-shot voice commands. The model can plan multistep workflows, call backend tools, and keep context across turns so that your application can collaborate with the users.
The following table shows voice interactions from different application domains, like task management, CRM, and help desk.

Voice interaction (example phrase)
Intent / goal
System action / behavior
Confirmation / UX

Mark all my tasks as complete.
Bulk-complete tasks
Find user’s open tasks → mark complete → archive if configured
All 12 open tasks are marked complete.

Create a plan for preparing the Q3 budget: break it into steps, assign owners, and set deadlines.
Create multistep workflow
Generate plan → create tasks → assign owners → set deadlines → surface review options
Plan created with 6 tasks. Notify owners?

Find enterprise leads in APAC with ARR over $1M and draft personalized outreach.
Build targeted prospect list and draft outreach
Query CRM → assemble filtered list → draft personalized messages for review
Drafted 24 personalized outreach messages. Review and send?

Prioritize all P1 tickets opened in the last 24 hours and assign them to on-call.
Triage and assign
Filter tickets → set priority → assign to on-call → log changes
12 P1 tickets prioritized and assigned to the on-call team.

Amazon Nova Sonic understands the intent, invokes the required APIs, and confirms the results—no forms required. This helps to create an environment where productivity is multiplied, and context becomes the interface. It’s not about replacing traditional UI, it’s about unlocking new capabilities through voice.
The sample application at a glance
With the Smart Todo reference application, users can create to-do lists and manage notes within those lists. The application offers a focused yet flexible interface for task tracking and note organization. With the addition of voice, the application becomes a hands-free experience that unlocks more natural and productive interactions. In Smart Todo App, users can say:

“Add a note to follow up on the project charter.”
“Archive all completed tasks.”

Behind each command are focused actions—like creating a new note, organizing content, or updating task status—executed through speech in a way that feels natural and efficient.
How Amazon Nova Sonic bidirectional APIs work
Amazon Nova Sonic implements a real-time, bidirectional streaming architecture. After a session is initiated with InvokeModelWithBidirectionalStream, audio input and model responses flow simultaneously over an open stream:

Session Start – Client sends a sessionStart event with model configuration (for example, temperature and topP).
Prompt and Content Start – Client sends structured events indicating whether upcoming data is audio, text, or tool input.
Audio Streaming – Microphone audio is streamed as base64-encoded audio input events.
Model Responses – As the model processes input, it streams the following responses asynchronously:

Automatic speech recognition (ASR) results
Tool use invocations
Text responses
Audio output for playback

Session Close – Conversations are explicitly closed by sending contentEnd, promptEnd, and sessionEnd events.

Nova Sonic Architecture Diagram

You can use this event-driven approach to interrupt the assistant (barge-in), enable multi-turn conversations, and support real-time adaptability.
Solution architecture
For this solution, we use a serverless application architecture pattern, where the UI is a React single page application. The React single page application is integrated with backend web APIs running on server-side containers. The Smart Todo App is deployed using a scalable and security-aware AWS architecture that’s designed to support real-time voice interactions. The following image provides an architecture overview of AWS services working together to support bidirectional streaming needs of a voice enabled application.

Key AWS services include:

Amazon Bedrock – Powers real-time, bidirectional speech interactions through the Amazon Nova Sonic foundation model.
Amazon CloudFront – A content delivery network (CDN) that distributes the application globally with low latency. It routes /(root) traffic to the React application hosted on an Amazon S3 bucket and /api and /novasonic traffic to the Application Load Balancer.
AWS Fargate for Amazon Amazon Elastic Container Service (Amazon ECS) – Runs the backend containerized services for WebSocket handling and REST APIs capable of supporting long lived bidirectional streams.
Application Load Balancer (ALB) – Forwards web traffic /api (HTTPS REST API calls) to backend ECS services, handling Smart Todo App APIs, and /novasonic (WebSocket connections) to ECS services managing real-time voice streaming with Amazon Nova Sonic.
Amazon Virtual Private Cloud (Amazon VPC) – Provides network isolation and security for backend services. The Public Subnets host the Application Load Balancer (ALB) and Private Subnets host ECS Fargate tasks running WebSocket and REST APIs.
NAT Gateway allows Amazon ECS tasks in private subnets to more securely connect to the internet for operations like Cognito JWT token verification endpoints.
Amazon Simple Storage Service (Amazon S3) –Hosts React frontend for user interactions
AWS WAF – Helps protect the Application Load Balancer (ALB) from malicious traffic and enforces security rules at the application layer.
Amazon Cognito – Manages authentication and issues tokens.
Amazon DynamoDB – Stores application data such as to-do lists and notes.

The following image illustrates how the user requests are served with support for low-latency bidirectional streaming.

Request Workflow

Deploying the solution
To evaluate this solution, we provided sample code of a Smart Todo App available at GitHub repository.
Smart Todo App consists of multiple independent Node.js projects, including a CDK infrastructure project, a React frontend application, and backend API services. The deployment workflow makes sure that the components are correctly built and integrated with AWS services like Amazon Cognito, Amazon DynamoDB, and Amazon Bedrock.
Prerequisites

AWS account with appropriate permissions that facilitate security best practices, including least-privilege permissions.
Docker Engine installed locally and running to build container image locally.
AWS CLI configured with AWS admin credentials.
Node.js >= 20.x and npm installed.
Amazon Nova Sonic enabled in Amazon Bedrock. For more information, see Add or remove access to Amazon Bedrock foundation models.

Deployment steps

Clone the following repository:

git clone https://github.com/aws-samples/sample-amazon-q-developer-vibe-coded-projects.git
cd NovaSonicVoiceAssistant

For first-time deployment, use the following automated script:

npm run deploy:first-time

This script will:

Install the dependencies using npm (node package manager)
Build the components and container image using locally installed docker engine
Deploy the infrastructure using CDK (CDK BootStrap ==> CDK Synth ==> CDK Deploy)
Update environment variables with Amazon Cognito settings
Rebuild the UI with updated environment variables
Deploy the final infrastructure (CDK Deploy)

Verifying deployment
After deployment is successful, complete the following steps:

Access the Amazon CloudFront URL provided in the CDK outputs. Note: The URL shown in the image is for reference only, every deployment will get a unique URL.

Successful deployment screen shot

Create a new user by signing up using the Create Account section.

Create User and Log in

Test the voice functionality to verify the integration with Amazon Nova Sonic. The following image illustrates a conversation between the signed-in user and the Amazon Bedrock agent. The AI agent is able to invoke existing APIs, and the UI is updated in real time to reflect agent’s actions.

Granting Microphone access to the application

Voice interaction in Smart Todo App

Clean up
You can remove the stacks with the following command.

# move to the infra folder, assuming you are in the project’s root folder
cd infra
# Removes the AWS stack
npm run destroy

Next steps
Voice isn’t just an accessibility add-on—it’s becoming the primary interface for complex workflows. Turns out talking is faster than selecting—especially when your app talks back.
Try these resources to get started.

Sample Code repo – A working Amazon Nova Sonic integration you can run locally. See how real-time voice interactions, intent handling, and multistep flows are implemented end to end.
Amazon Nova Sonic hands-on workshop – A guided lab that walks you through deploying Amazon Nova Sonic in your AWS account and testing voice-native features.
Amazon Nova Sonic docs – Provides API reference, streaming examples, and best practices to help you design and deploy voice-driven workflows.
Contact your AWS account team to learn more about how AI-driven solutions can transform your operations.

About the authors
Manu Mishra is a Senior Solutions Architect at AWS, specializing in artificial intelligence, data and analytics, and security. His expertise spans strategic oversight and hands-on technical leadership, where he reviews and guides the work of both internal and external customers. Manu collaborates with AWS customers to shape technical strategies that drive impactful business outcomes, providing alignment between technology and organizational goals.
AK Soni is a Senior Technical Account Manager with AWS Enterprise Support, where he empowers enterprise customers to achieve their business goals by offering proactive guidance on implementing innovative cloud and AI/ML-based solutions aligned with industry best practices. With over 19 years of experience in enterprise application architecture and development, he uses his expertise in generative AI technologies to enhance business operations and overcome existing technological limitations.
Raj Bagwe is a Senior Solutions Architect at Amazon Web Services, based in San Francisco, California. With over 6 years at AWS, he helps customers navigate complex technological challenges and specializes in Cloud Architecture, Security and Migrations. In his spare time, he coaches a robotics team and plays volleyball. He can be reached at X handle @rajesh_bagwe.

Harnessing the power of generative AI: Druva’s multi-agent copilot f …

Posted on November 15, 2025 by i-genie

This post is co-written with David Gildea and Tom Nijs from Druva.
Generative AI is transforming the way businesses interact with their customers and revolutionizing conversational interfaces for complex IT operations. Druva, a leading provider of data security solutions, is at the forefront of this transformation. In collaboration with Amazon Web Services (AWS), Druva is developing a cutting-edge generative AI-powered multi-agent copilot that aims to redefine the customer experience in data security and cyber resilience.
Powered by Amazon Bedrock and using advanced large language models (LLMs), this innovative solution will provide Druva’s customers with an intuitive, conversational interface to access data management, security insights, and operational support across their product suite. By harnessing the power of generative AI and agentic AI, Druva aims to streamline operations, increase customer satisfaction, and enhance the overall value proposition of its data security and cyber resilience solutions.
In this post, we examine the technical architecture behind this AI-powered copilot, exploring how it processes natural language queries, maintains context across complex workflows, and delivers secure, accurate responses to streamline data protection operations.
Challenges and opportunities
Druva wants to effectively serve enterprises moving beyond traditional query-based AI and into agentic systems and meet their complex data management and security needs with greater speed, simplicity, and confidence.
Comprehensive data security necessitates tracking a high volume of data and metrics to identify potential cyber threats. As threats evolve, it can be difficult for customers to stay abreast of new data anomalies to hunt for within their organization’s data, but missing any threat signals can lead to unauthorized access to sensitive information. For example, a global financial services company managing more than 500 servers across multiple regions currently spends hours manually checking logs across dozens of systems when backup fails. With an AI-powered copilot, they could simply ask, “Why did my backups fail last night?” and instantly receive an analysis showing that a specific policy update caused conflicts in their European data centers, along with a step-by-step remediation, reducing investigation time from hours to minutes. This solution not only reduces the volume of support requests and accelerates the time to resolution, but also unlocks greater operational efficiency for end users.
By reimagining how users engage with the system—from AI-powered workflows to smarter automation—Druva saw a clear opportunity to deliver a more seamless customer experience that strengthens customer satisfaction, loyalty, and long-term success.
The key opportunities for Druva in implementing a generative AI-powered multi-agent copilot include:

Simplified user experience: By providing a natural language interface, the copilot can simplify complex data protection tasks and help users access the information they need quickly.
Intelligent Troubleshooting: The copilot can leverage AI capabilities to analyze data from various sources, identify the root causes of backup failures, and provide personalized recommendations for resolution.
Streamlined Policy Management: The multi-agent copilot can guide users through the process of creating, modifying, and implementing data protection policies, reducing the potential for human errors and improving compliance.
Proactive Support: By continuously monitoring data protection environments, the copilot can proactively identify potential issues and provide guidance to help prevent failures or optimize performance.
Scalable and Efficient Operations: The AI-powered solution can handle a large volume of customer inquiries and tasks simultaneously, reducing the burden on Druva’s support team so that they can focus on more complex and strategic initiatives.

Solution overview
The proposed solution for Druva’scopilot leverages a sophisticated architecture that combines the power of Amazon Bedrock (including Amazon Bedrock Knowledge Bases), LLMs, and a dynamic API selection process to deliver an intelligent and efficient user experience. In the following diagram, we demonstrate the end-to-end architecture and various sub-components.

At the core of the system is the supervisor agent, which serves as the central coordination component of the multi-agent system. This agent is responsible for overseeing the entire conversation flow, delegating tasks to specialized sub-agents, and maintaining seamless communication between the various components.
The user interacts with the supervisor agent through a user interface, submitting natural language queries related to data protection, backup management, and troubleshooting. The supervisor agent analyzes the user’s input and routes the request to the appropriate sub-agents based on the nature of the query.
The data agent is responsible for retrieving relevant information from Druva’s systems by interacting with the GET APIs. This agent fetches data such as scheduled backup jobs, backup status, and other pertinent details to provide the user with accurate and up-to-date information.
The help agent assists users by providing guidance on best practices, step-by-step instructions, and troubleshooting tips. This agent draws upon an extensive knowledge base, which includes detailed API documentation, user manuals, and frequently asked questions, to deliver context-specific assistance to users.
When a user needs to perform critical actions, such as initiating a backup job or modifying data protection policies, the action agent comes into play. This agent interacts with the POST API endpoints to execute the necessary operations, making sure that the user’s requirements are met promptly and accurately.
To make sure that the multi-agent copilot operates with the most suitable APIs and parameters, the solution incorporates a dynamic API selection process. In the following diagram, we highlight the various AWS services used to implement dynamic API selection, with which both the data agent and the action agent are equipped. Bedrock Knowledge Bases contains comprehensive information about available APIs, their functionalities, and optimal usage patterns. Once an input query is received, we use semantic search to retrieve the top K relevant APIs. This semantic search capability enables the system to adapt to the specific context of each user request, enhancing the Copilot’s accuracy, efficiency, and scalability. Once the appropriate APIs are identified, the agent prompts the LLM to parse the top K relevant APIs and finalize the API selection along with the required parameters. This step makes sure that the copilot is fully equipped to run the user’s request effectively.

Finally, the selected API is invoked, and the multi-agent copilot carries out the desired action or retrieves the requested information. The user receives a clear and concise response, along with relevant recommendations or guidance, through the user interface.
Throughout the interaction, users can provide additional information or explicit approvals by using the user feedback node before the copilot performs critical actions. With this human-in-the-loop approach, the system operates with the necessary safeguards and maintains user control over sensitive operations.
Evaluation
The evaluation process for Druva’s generative AI-powered multi-agent copilot focuses on assessing the performance and effectiveness of each critical component of the system. By thoroughly testing individual components such as dynamic API selection, isolated tests on individual agents, and end-to-end functionality, the copilot delivers accurate, reliable, and efficient results to its users.
Evaluation methodology:

Unit testing: Isolated tests are conducted for each component (individual agents, data extraction, API selection) to verify their functionality, performance, and error handling capabilities.
Integration Testing: Tests are performed to validate the seamless integration and communication between the various components of the multi-agent copilot, maintaining data flow and control flow integrity.
System Testing: End-to-end tests are executed on the complete system, simulating real-world user scenarios and workflows to assess the overall functionality, performance, and user experience.

Evaluation results
Choosing the right model for the right task is critical to the system’s performance. The dynamic tool selection represents one of the most critical parts of the system—invoking the correct API is essential for end-to-end solution success. A single incorrect API call can lead to fetching wrong data, which cascades into erroneous results throughout the multi-agent system. To optimize the dynamic tool selection component, various Nova and Anthropic models were tested and benchmarked against the ground truth created using Sonnet 3.7.
The findings showed that even smaller models like Nova Lite and Haiku 3 were able to select the correct API every time. However, these smaller models struggled with parameter parsing such as calling the API with the correct parameters relative to the input question. When parameter parsing accuracy was taken into account, the overall API selection accuracy dropped to 81% for Nova Micro, 88% for Nova Lite, and 93% for Nova Pro. The performance of Haiku 3, Haiku 3.5, and Sonnet 3.5 was comparable, ranging from 91% to 92%. Nova Pro provided an optimal tradeoff between accuracy and latency with an average response time of just over one second. In contrast, Sonnet 3.5 had a latency of eight seconds, although this could be attributed to Sonnet 3.5’s more verbose output, generating an average of 291 tokens compared to Nova Pro’s 86 tokens. The prompts could potentially be optimized to make Sonnet 3.5’s output more concise, thus reducing the latency.
For end-to-end testing of real world scenarios, it is essential to engage human subject matter expert evaluators familiar with the system to assess performance based on completeness, accuracy, and relevance of the solutions. Across 11 challenging questions during the initial development phase, the system achieved scores averaging 3.3 out of 5 across these dimensions. This represented solid performance considering the evaluation was conducted in the early stages of development, providing a strong foundation for future improvements.
By focusing on evaluating each critical component and conducting rigorous end-to-end testing, Druva has made sure that the generative AI-powered multi-agent copilot meets the highest standards of accuracy, reliability, and efficiency. The insights gained from this evaluation process have guided the continuous improvement and optimization of the copilot.

“Druva is at the forefront of leveraging advanced AI technologies to revolutionize the way organizations protect and manage their critical data. Our Generative AI-powered Multi-agent Copilot is a testament to our commitment to delivering innovative solutions that simplify complex processes and enhance customer experiences. By collaborating with the AWS Generative AI Innovation Center, we are embarking on a transformative journey to create an interactive, personalized, and efficient end-to-end experience for our customers. We are excited to harness the power of Amazon Bedrock and our proprietary data to continue reimagining the future of data security and cyber resilience.”- David Gildea, VP of Generative AI at Druva

Conclusion
Druva’s generative AI-powered multi-agent copilot showcases the immense potential of combining structured and unstructured data sources using AI to create next-generation virtual copilots. This innovative approach sets Druva apart from traditional data protection vendors by transforming hours-long manual investigations into instant, AI-powered conversational insights, with 90% of routine data protection tasks executable through natural language interactions, fundamentally redefining customer expectations in the data security space. For organizations in the data security and protection space, this technology enables more efficient operations, enhanced customer engagement, and data-driven decision-making. The insights and intelligence provided by the copilot empower Druva’s stakeholders, including customers, support teams, partners, and executives, to make informed decisions faster, reducing average time-to-resolution for data security issues by up to 70% and accelerating backup troubleshooting from hours to minutes. Although this project focuses on the data protection industry, the underlying principles and methodology can be applied across various domains. With careful design, testing, and continuous improvement, organizations in any industry can benefit from AI-powered copilots that contextualize their data, documents, and content to deliver intelligent and personalized experiences.
This implementation leverages Amazon Bedrock AgentCore Runtime and Amazon Bedrock AgentCore Gateway to provide robust agent orchestration and management capabilities. This approach has the potential to provide intelligent automation and data search capabilities through customizable agents, transforming user interactions with applications to be more natural, efficient, and effective. For those interested in implementing similar functionalities, explore Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases and Amazon Bedrock AgentCore as a fully managed AWS solution.

About the authors
David Gildea With over 25 years of experience in cloud automation and emerging technologies, David has led transformative projects in data management and cloud infrastructure. As the founder and former CEO of CloudRanger, he pioneered innovative solutions to optimize cloud operations, later leading to its acquisition by Druva. Currently, David leads the Labs team in the Office of the CTO, spearheading R&D into Generative AI initiatives across the organization, including projects like Dru Copilot, Dru Investigate, and Amazon Q. His expertise spans technical research, commercial planning, and product development, making him a prominent figure in the field of cloud technology and generative AI.
Tom Nijs is an experienced backend and AI engineer at Druva, driven by a passion for both learning and sharing knowledge. As the Lead Architect for Druva’s Labs team, he channels this passion into developing cutting-edge solutions, leading projects such as Dru Copilot, Dru Investigate, and Dru AI Labs. With a core focus on optimizing systems and harnessing the power of AI, Tom is dedicated to helping teams and developers turn groundbreaking ideas into reality.
Gauhar Bains is a Deep Learning Architect at the AWS Generative AI Innovation Center, where he designs and delivers innovative GenAI solutions for enterprise customers. With a passion for leveraging cutting-edge AI technologies, Gauhar specializes in developing agentic AI applications, and implementing responsible AI practices across diverse industries.
Ayushi Gupta is a Senior Technical Account Manager at AWS who partners with organizations to architect optimal cloud solutions. She specializes in ensuring business-critical applications operate reliably while balancing performance, security, and cost efficiency. With a passion for GenAI innovation, Ayushi helps customers leverage cloud technologies that deliver measurable business value while maintaining robust data protection and compliance standards.
Marius Moisescu is a Machine Learning Engineer at the AWS Generative AI Innovation Center. He works with customers to develop agentic applications. His interests are deep research agents and evaluation of multi agent architectures.
Ahsan Ali is an Senior Applied Scientist at the Amazon Generative AI Innovation Center, where he works with customers from different industry verticals to solve their urgent and expensive problems using Generative AI.
Sandy Farr is an Applied Science Manager at the AWS Generative AI Innovation Center, where he leads a team of scientists, deep learning architects and software engineers to deliver innovative GenAI solutions for AWS customers. Sandy holds a PhD in Physics and has over a decade of experience developing AI/ML, NLP and GenAI solutions for large organizations.
Govindarajan Varadan is a Manager of the Solutions Architecture team at Amazon Web Services (AWS) based out of Silicon Valley in California. He works with AWS customers to help them achieve their business objectives through innovative applications of AI at scale.
Saeideh Shahrokh Esfahani is an Applied Scientist at the Amazon Generative AI Innovation Center, where she focuses on transforming cutting-edge AI technologies into practical solutions that address real-world challenges.

How to Build a Fully Self-Verifying Data Operations AI Agent Using Loc …

Posted on November 14, 2025 by i-genie

In this tutorial, we build a self-verifying DataOps AIAgent that can plan, execute, and test data operations automatically using local Hugging Face models. We design the agent with three intelligent roles: a Planner that creates an execution strategy, an Executor that writes and runs code using pandas, and a Tester that validates the results for accuracy and consistency. By using Microsoft’s Phi-2 model locally in Google Colab, we ensure that the workflow remains efficient, reproducible, and privacy-preserving while demonstrating how LLMs can automate complex data-processing tasks end-to-end. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q transformers accelerate bitsandbytes scipy
import json, pandas as pd, numpy as np, torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

MODEL_NAME = “microsoft/phi-2″

class LocalLLM:
def __init__(self, model_name=MODEL_NAME, use_8bit=False):
print(f”Loading model: {model_name}”)
self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
model_kwargs = {“device_map”: “auto”, “trust_remote_code”: True}
if use_8bit and torch.cuda.is_available():
model_kwargs[“quantization_config”] = BitsAndBytesConfig(load_in_8bit=True)
else:
model_kwargs[“torch_dtype”] = torch.float32 if not torch.cuda.is_available() else torch.float16
self.model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
self.pipe = pipeline(“text-generation”, model=self.model, tokenizer=self.tokenizer,
max_new_tokens=512, do_sample=True, temperature=0.3, top_p=0.9,
pad_token_id=self.tokenizer.eos_token_id)
print(“✓ Model loaded successfully!n”)

def generate(self, prompt, system_prompt=””, temperature=0.3):
if system_prompt:
full_prompt = f”Instruct: {system_prompt}nn{prompt}nOutput:”
else:
full_prompt = f”Instruct: {prompt}nOutput:”
output = self.pipe(full_prompt, temperature=temperature, do_sample=temperature>0,
return_full_text=False, eos_token_id=self.tokenizer.eos_token_id)
result = output[0][‘generated_text’].strip()
if “Instruct:” in result:
result = result.split(“Instruct:”)[0].strip()
return result

We install the required libraries and load the Phi-2 model locally using Hugging Face Transformers. We create a LocalLLM class that initializes the tokenizer and model, supports optional quantization, and defines a generate method to produce text outputs. We ensure that the model runs smoothly on both CPU and GPU, making it ideal for use on Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserPLANNER_PROMPT = “””You are a Data Operations Planner. Create a detailed execution plan as valid JSON.

Return ONLY a JSON object (no other text) with this structure:
{“steps”: [“step 1″,”step 2″],”expected_output”:”description”,”validation_criteria”:[“criteria 1″,”criteria 2″]}”””

EXECUTOR_PROMPT = “””You are a Data Operations Executor. Write Python code using pandas.

Requirements:
– Use pandas (imported as pd) and numpy (imported as np)
– Store final result in variable ‘result’
– Return ONLY Python code, no explanations or markdown”””

TESTER_PROMPT = “””You are a Data Operations Tester. Verify execution results.

Return ONLY a JSON object (no other text) with this structure:
{“passed”:true,”issues”:[“any issues found”],”recommendations”:[“suggestions”]}”””

class DataOpsAgent:
def __init__(self, llm=None):
self.llm = llm or LocalLLM()
self.history = []

def _extract_json(self, text):
try:
return json.loads(text)
except:
start, end = text.find(‘{‘), text.rfind(‘}’)+1
if start >= 0 and end > start:
try:
return json.loads(text[start:end])
except:
pass
return None

We define the system prompts for the Planner, Executor, and Tester roles of our DataOps Agent. We then initialize the DataOpsAgent class with helper methods and a JSON extraction utility to parse structured responses. We prepare the foundation for the agent’s reasoning and execution pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def plan(self, task, data_info):
print(“n” + “=”*60)
print(“PHASE 1: PLANNING”)
print(“=”*60)
prompt = f”Task: {task}nnData Information:n{data_info}nnCreate an execution plan as JSON with steps, expected_output, and validation_criteria.”
plan_text = self.llm.generate(prompt, PLANNER_PROMPT, temperature=0.2)
self.history.append((“PLANNER”, plan_text))
plan = self._extract_json(plan_text) or {“steps”:[task],”expected_output”:”Processed data”,”validation_criteria”:[“Result generated”,”No errors”]}
print(f”n Plan Created:”)
print(f” Steps: {len(plan.get(‘steps’, []))}”)
for i, step in enumerate(plan.get(‘steps’, []), 1):
print(f” {i}. {step}”)
print(f” Expected: {plan.get(‘expected_output’, ‘N/A’)}”)
return plan

def execute(self, plan, data_context):
print(“n” + “=”*60)
print(“PHASE 2: EXECUTION”)
print(“=”*60)
steps_text = ‘n’.join(f”{i}. {s}” for i, s in enumerate(plan.get(‘steps’, []), 1))
prompt = f”Task Steps:n{steps_text}nnData available: DataFrame ‘df’n{data_context}nnWrite Python code to execute these steps. Store final result in ‘result’ variable.”
code = self.llm.generate(prompt, EXECUTOR_PROMPT, temperature=0.1)
self.history.append((“EXECUTOR”, code))
if ““`python” in code: code = code.split(““`python”)[1].split(““`”)[0]
elif ““`” in code: code = code.split(““`”)[1].split(““`”)[0]
lines = []
for line in code.split(‘n’):
s = line.strip()
if s and (not s.startswith(‘#’) or ‘import’ in s):
lines.append(line)
code = ‘n’.join(lines).strip()
print(f”n Generated Code:n” + “-“*60)
for i, line in enumerate(code.split(‘n’)[:15],1):
print(f”{i:2}. {line}”)
if len(code.split(‘n’))>15: print(f” … ({len(code.split(‘n’))-15} more lines)”)
print(“-“*60)
return code

We implement the Planning and Execution phases of the agent. We let the Planner create detailed task steps and validation criteria, and then the Executor generates corresponding Python code based on pandas to perform the task. We visualize how the agent autonomously transitions from reasoning to generating actionable code. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef test(self, plan, result, execution_error=None):
print(“n” + “=”*60)
print(“PHASE 3: TESTING & VERIFICATION”)
print(“=”*60)
result_desc = f”EXECUTION ERROR: {execution_error}” if execution_error else f”Result type: {type(result).__name__}n”
if not execution_error:
if isinstance(result, pd.DataFrame):
result_desc += f”Shape: {result.shape}nColumns: {list(result.columns)}nSample:n{result.head(3).to_string()}”
elif isinstance(result, (int,float,str)):
result_desc += f”Value: {result}”
else:
result_desc += f”Value: {str(result)[:200]}”
criteria_text = ‘n’.join(f”- {c}” for c in plan.get(‘validation_criteria’, []))
prompt = f”Validation Criteria:n{criteria_text}nnExpected: {plan.get(‘expected_output’, ‘N/A’)}nnActual Result:n{result_desc}nnEvaluate if result meets criteria. Return JSON with passed (true/false), issues, and recommendations.”
test_result = self.llm.generate(prompt, TESTER_PROMPT, temperature=0.2)
self.history.append((“TESTER”, test_result))
test_json = self._extract_json(test_result) or {“passed”:execution_error is None,”issues”:[“Could not parse test result”],”recommendations”:[“Review manually”]}
print(f”n✓ Test Results:n Status: {‘ PASSED’ if test_json.get(‘passed’) else ‘ FAILED’}”)
if test_json.get(‘issues’):
print(” Issues:”)
for issue in test_json[‘issues’][:3]:
print(f” • {issue}”)
if test_json.get(‘recommendations’):
print(” Recommendations:”)
for rec in test_json[‘recommendations’][:3]:
print(f” • {rec}”)
return test_json

def run(self, task, df=None, data_info=None):
print(“n SELF-VERIFYING DATA-OPS AGENT (Local HF Model)”)
print(f”Task: {task}n”)
if data_info is None and df is not None:
data_info = f”Shape: {df.shape}nColumns: {list(df.columns)}nSample:n{df.head(2).to_string()}”
plan = self.plan(task, data_info)
code = self.execute(plan, data_info)
result, error = None, None
try:
local_vars = {‘pd’: pd, ‘np’: np, ‘df’: df}
exec(code, local_vars)
result = local_vars.get(‘result’)
except Exception as e:
error = str(e)
print(f”n Execution Error: {error}”)
test_result = self.test(plan, result, error)
return {‘plan’: plan,’code’: code,’result’: result,’test’: test_result,’history’: self.history}

We focus on the Testing and Verification phase of our workflow. We let the agent evaluate its own output against predefined validation criteria and summarize the outcome as a structured JSON. We then integrate all three phases, planning, execution, and testing, into a single self-verifying pipeline that ensures complete automation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_basic(agent):
print(“n” + “#”*60)
print(“# DEMO 1: Sales Data Aggregation”)
print(“#”*60)
df = pd.DataFrame({‘product’:[‘A’,’B’,’A’,’C’,’B’,’A’,’C’],
‘sales’:[100,150,200,80,130,90,110],
‘region’:[‘North’,’South’,’North’,’East’,’South’,’West’,’East’]})
task = “Calculate total sales by product”
output = agent.run(task, df)
if output[‘result’] is not None:
print(f”n Final Result:n{output[‘result’]}”)
return output

def demo_advanced(agent):
print(“n” + “#”*60)
print(“# DEMO 2: Customer Age Analysis”)
print(“#”*60)
df = pd.DataFrame({‘customer_id’:range(1,11),
‘age’:[25,34,45,23,56,38,29,41,52,31],
‘purchases’:[5,12,8,3,15,7,9,11,6,10],
‘spend’:[500,1200,800,300,1500,700,900,1100,600,1000]})
task = “Calculate average spend by age group: young (under 35) and mature (35+)”
output = agent.run(task, df)
if output[‘result’] is not None:
print(f”n Final Result:n{output[‘result’]}”)
return output

if __name__ == “__main__”:
print(” Initializing Local LLM…”)
print(“Using CPU mode for maximum compatibilityn”)
try:
llm = LocalLLM(use_8bit=False)
agent = DataOpsAgent(llm)
demo_basic(agent)
print(“nn”)
demo_advanced(agent)
print(“n” + “=”*60)
print(” Tutorial Complete!”)
print(“=”*60)
print(“nKey Features:”)
print(” • 100% Local – No API calls required”)
print(” • Uses Phi-2 from Microsoft (2.7B params)”)
print(” • Self-verifying 3-phase workflow”)
print(” • Runs on free Google Colab CPU/GPU”)
except Exception as e:
print(f”n Error: {e}”)
print(“Troubleshooting:n1. pip install -q transformers accelerate scipyn2. Restart runtimen3. Try a different model”)

We built two demo examples to test the agent’s capabilities using simple sales and customer datasets. We initialize the model, execute the Data-Ops workflow, and observe the full cycle from planning to validation. We conclude the tutorial by summarizing key benefits and encouraging further experimentation with local models.

In conclusion, we created a fully autonomous and self-verifying DataOps system powered by a local Hugging Face model. We experience how each stage, planning, execution, and testing, seamlessly interacts to produce reliable results without relying on any cloud APIs. This workflow highlights the strength of local LLMs, such as Phi-2, for lightweight automation and inspires us to expand this architecture for more advanced data pipelines, validation frameworks, and multi-agent data systems in the future.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Fully Self-Verifying Data Operations AI Agent Using Local Hugging Face Models for Automated Planning, Execution, and Testing appeared first on MarkTechPost.

How Powerful are Diffusion LLMs? Rethinking Generation with Any-Proces …

Posted on November 14, 2025 by i-genie

How powerful are Diffusion LLMs compared to classic autoregressive LLMs, once you treat generation as an algorithm with time and space complexity, not just as a decoding trick? A new research paper from a team researchers from Toyota Technological Institute at Chicago and MIT gives a formal answer. This new research compares Auto-Regressive Models (ARM), Masked Diffusion Models (MDM), and a new family called Any-Process MDM (AP-MDM), using complexity theory and controlled reasoning tasks.

https://arxiv.org/pdf/2510.06190

ARM vs MDM: Same Expressivity, Different Parallel Time

ARM uses next token prediction in a strict left to right order. Prior work already shows that with enough intermediate steps, ARM is Turing complete, so it can represent any computable function in principle, given enough context and compute.

MDM, the discrete diffusion style used in diffusion LLMs, works on a masked sequence. The model starts from a fully masked sequence and iteratively unmasks tokens. It can update many positions in parallel and in any order. MDM is modeled as an encoder only Transformer with context length (S(n)) and decoding steps (T(n)) for an input of size (n).

The research team shows:

MDM can simulate any PRAM (Parallel Random Access Machine) algorithm with parallel time (T(n)) using (O(T(n))) diffusion steps and context (S(n)) proportional to total work.

This makes MDM Turing complete and lets it match ideal parallel time on problems in NC, such as graph connectivity and some context free language tasks, where ARM needs time linear in sequence length.

Diffusion LLMs therefore gain efficiency on parallelizable problems, not extra expressive power by themselves.

Any-Order Generation Has Limited Benefits

A natural question is whether Any-Order Generation is strictly more powerful than left to right generation.

To isolate this, the research team defines an Any-Order MDM (AO-MDM) and a corresponding Masked ARM with the same architecture and similar token budget, but decoding in a fixed left to right way over a sequence padded with masks.

The main result:

Any computation performed by AO-MDM with one token per step and context (S(n)) can be reorganized into a left to right schedule and simulated by a Masked ARM with sequence length (O(S(n))) plus a constant number of extra layers.

In other words, once you control for parallelism and architecture, any order generation alone does not expand the class of problems beyond what ARM can already handle.

Both ARM and AO-MDM also share a space limitation. With context length (S(n)), they cannot efficiently solve problems that require more than roughly (S(n)3) serial time. With polynomial context, they are effectively limited to problems in the class P and cannot handle general NP hard tasks just by test time scaling.

Any-Process Generation and AP-MDM

To go beyond these limits, the research team proposes Any-Process Generation, instantiated as Any-Process MDM (AP-MDM).

AP-MDM keeps the masked diffusion view but extends the transition function with three extra operations, in addition to the usual unmask:

remask: turn an already decoded token back into the mask token M

insert: insert a new mask token at a chosen position

delete: delete a mask token that is no longer needed

These are controlled by a 3 bit vector per position (ct,i = (ct,i[1], ct,i[2], ct,i[3]). The same Transformer backbone predicts both content logits and these control bits.

remask uses the first bit to decide whether to overwrite a position with M, which enables backtracking and self correction.

insert and delete use the second and third bits to add or remove mask tokens, so the sequence length can grow or shrink during decoding.

Architecturally, AP-MDM only adds three small linear heads on top of an encoder only Transformer, so it is easy to add on top of existing MDM style diffusion LLMs.

https://arxiv.org/pdf/2510.06190

The key theoretical result:

AP-MDM can simulate any PRAM algorithm with optimal parallel time and optimal space, using context proportional to the true space (S(n)) rather than total work. With polynomial context, AP-MDM can realize computations in PSPACE, while standard MDM and ARM under the same context budget are restricted to P.

The research team also tried to prove that there exists a constant depth AP-MDM whose generation process cannot be simulated by any constant depth ARM or Masked ARM, under standard complexity assumptions.

Empirical Results: Sudoku, Dyck, Graphs, Parity

The experiments match the theory and make the differences concrete.

Sudoku

Sudoku, generalized to (n2 x n2) grids, is NP complete.

AP-MDM reaches 99.28 percent accuracy with about 1.2 million parameters and only 100 training instances.

An ARM baseline with ordering reaches 87.18 percent using 1.8 million training instances and about 5 times more parameters.

The best AO-MDM baseline reaches 89.49 percent under the same large data regime.

https://arxiv.org/pdf/2510.06190

This shows that editing operations, especially remask, are crucial to exploit test time scaling on hard reasoning tasks.

Dyck languages and coding style constraints

The research also analyzes two sided Dyck k languages, which model matched parentheses and are a core abstraction for code syntax. It proves that fixed ARM models cannot ensure valid generation for arbitrary lengths, while there exists an AP-MDM that generates exactly the Dyck language using insert and remask.

This matches how coding tasks require structure aware edits under global constraints, for example balanced brackets and consistent scopes.

Graph generation and structural editing

For graph editing tasks under global constraints, AP-MDM uses insert, delete and remask to implement a sequence of structured edits over a graph representation. The reported accuracy stays near perfect as graph size scales, while ARM degrades as the graph gets larger.

Parity and length generalization

On parity, AP-MDM learns a local elimination rule by repeatedly deleting pairs of bits, driven by remask and delete. It is trained only on length 2 sequences, then achieves 100 percent generalization to arbitrary lengths. ARM baselines struggle to reach similar generalization even with much longer training sequences.

https://arxiv.org/pdf/2510.06190

Key Takeaways

Any order Masked Diffusion Models are as expressive as autoregressive models once you fix architecture and parallelism, they mainly provide parallel efficiency rather than new computational power.

Masked Diffusion Models can simulate PRAM algorithms and achieve exponential speedup on parallelizable tasks in NC, but with polynomial context they remain effectively limited to problems in class P, similar to autoregressive models.

Any Process MDM extends diffusion LLMs with remask, insert and delete operations, implemented via a three bit control vector per token, and can simulate PRAM with both optimal parallel time and optimal space, reaching PSPACE level expressivity under polynomial context.

On hard reasoning tasks such as generalized Sudoku, Dyck languages, graph editing and parity, AP MDM shows strong empirical advantages, for example achieving about 99.28 percent Sudoku accuracy with only 100 training instances and a much smaller parameter budget than autoregressive and any order MDM baselines.

For domains like coding, mathematics and AI4Science that involve structured edits and revision histories, AP MDM aligns better with the underlying generation processes than next token prediction, and its editing operations are provably hard to simulate with constant depth autoregressive models.

Editorial Comments

Any-Process MDM is an important step because it treats generation as a full algorithm, not just a decoding order. The research work shows that Masked Diffusion Models already match PRAM parallel time, but remain in P under polynomial context, similar to autoregressive models. By adding remask, insert and delete, AP-MDM reaches PSPACE-level expressivity with polynomial context and achieves strong empirical gains on Sudoku, Dyck, graph editing and parity. Overall, AP-MDM makes a strong case that future frontier LLMs should adopt edit-based Any-Process Generation, not just faster autoregression.

Check out the Paper and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How Powerful are Diffusion LLMs? Rethinking Generation with Any-Process Masked Diffusion Models appeared first on MarkTechPost.

How to Build a Fully Functional Custom GPT-style Conversational AI Loc …

Posted on November 14, 2025 by i-genie

In this tutorial, we build our own custom GPT-style chat system from scratch using a local Hugging Face model. We start by loading a lightweight instruction-tuned model that understands conversational prompts, then wrap it inside a structured chat framework that includes a system role, user memory, and assistant responses. We define how the agent interprets context, constructs messages, and optionally uses small built-in tools to fetch local data or simulated search results. By the end, we have a fully functional, conversational model that behaves like a personalized GPT running. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install transformers accelerate sentencepiece –quiet
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List, Tuple, Optional
import textwrap, json, os

We begin by installing the essential libraries and importing the required modules. We ensure that the environment has all necessary dependencies, such as transformers, torch, and sentencepiece, ready for use. This setup allows us to work seamlessly with Hugging Face models inside Google Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserMODEL_NAME = “microsoft/Phi-3-mini-4k-instruct”
BASE_SYSTEM_PROMPT = (
“You are a custom GPT running locally. ”
“Follow user instructions carefully. ”
“Be concise and structured. ”
“If something is unclear, say it is unclear. ”
“Prefer practical examples over corporate examples unless explicitly asked. ”
“When asked for code, give runnable code.”
)
MAX_NEW_TOKENS = 256

We configure our model name, define the system prompt that governs the assistant’s behavior, and set token limits. We establish how our custom GPT should respond, concise, structured, and practical. This section defines the foundation of our model’s identity and instruction style. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“Loading model…”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map=”auto”
)
model.eval()
print(“Model loaded.”)

We load the tokenizer and model from Hugging Face into memory and prepare them for inference. We automatically adjust the device mapping based on available hardware, ensuring GPU acceleration if possible. Once loaded, our model is ready to generate responses. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserConversationHistory = List[Tuple[str, str]]
history: ConversationHistory = [(“system”, BASE_SYSTEM_PROMPT)]

def wrap_text(s: str, w: int = 100) -> str:
return “n”.join(textwrap.wrap(s, width=w))

def build_chat_prompt(history: ConversationHistory, user_msg: str) -> str:
prompt_parts = []
for role, content in history:
if role == “system”:
prompt_parts.append(f”<|system|>n{content}n”)
elif role == “user”:
prompt_parts.append(f”<|user|>n{content}n”)
elif role == “assistant”:
prompt_parts.append(f”<|assistant|>n{content}n”)
prompt_parts.append(f”<|user|>n{user_msg}n”)
prompt_parts.append(“<|assistant|>n”)
return “”.join(prompt_parts)

We initialize the conversation history, starting with a system role, and create a prompt builder to format messages. We define how user and assistant turns are arranged in a consistent conversational structure. This ensures the model always understands the dialogue context correctly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef local_tool_router(user_msg: str) -> Optional[str]:
msg = user_msg.strip().lower()
if msg.startswith(“search:”):
query = user_msg.split(“:”, 1)[-1].strip()
return f”Search results about ‘{query}’:n- Key point 1n- Key point 2n- Key point 3″
if msg.startswith(“docs:”):
topic = user_msg.split(“:”, 1)[-1].strip()
return f”Documentation extract on ‘{topic}’:n1. The agent orchestrates tools.n2. The model consumes output.n3. Responses become memory.”
return None

We add a lightweight tool router that extends our GPT’s capability to simulate tasks like search or documentation retrieval. We define logic to detect special prefixes such as “search:” or “docs:” in user queries. This simple agentic design gives our assistant contextual awareness. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef generate_reply(history: ConversationHistory, user_msg: str) -> str:
tool_context = local_tool_router(user_msg)
if tool_context:
user_msg = user_msg + “nnUseful context:n” + tool_context
prompt = build_chat_prompt(history, user_msg)
inputs = tokenizer(prompt, return_tensors=”pt”).to(model.device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=MAX_NEW_TOKENS,
do_sample=True,
top_p=0.9,
temperature=0.6,
pad_token_id=tokenizer.eos_token_id
)
decoded = tokenizer.decode(output_ids[0], skip_special_tokens=True)
reply = decoded.split(“<|assistant|>”)[-1].strip() if “<|assistant|>” in decoded else decoded[len(prompt):].strip()
history.append((“user”, user_msg))
history.append((“assistant”, reply))
return reply

def save_history(history: ConversationHistory, path: str = “chat_history.json”) -> None:
data = [{“role”: r, “content”: c} for (r, c) in history]
with open(path, “w”) as f:
json.dump(data, f, indent=2)

def load_history(path: str = “chat_history.json”) -> ConversationHistory:
if not os.path.exists(path):
return [(“system”, BASE_SYSTEM_PROMPT)]
with open(path, “r”) as f:
data = json.load(f)
return [(item[“role”], item[“content”]) for item in data]

We define the primary reply generation function, which combines history, context, and model inference to produce coherent outputs. We also add functions to save and load past conversations for persistence. This snippet forms the operational core of our custom GPT. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n— Demo turn 1 —“)
demo_reply_1 = generate_reply(history, “Explain what this custom GPT setup is doing in 5 bullet points.”)
print(wrap_text(demo_reply_1))

print(“n— Demo turn 2 —“)
demo_reply_2 = generate_reply(history, “search: agentic ai with local models”)
print(wrap_text(demo_reply_2))

def interactive_chat():
print(“nChat ready. Type ‘exit’ to stop.”)
while True:
try:
user_msg = input(“nUser: “).strip()
except EOFError:
break
if user_msg.lower() in (“exit”, “quit”, “q”):
break
reply = generate_reply(history, user_msg)
print(“nAssistant:n” + wrap_text(reply))

# interactive_chat()
print(“nCustom GPT initialized successfully.”)

We test the entire setup by running demo prompts and displaying generated responses. We also create an optional interactive chat loop to converse directly with the assistant. By the end, we confirm that our custom GPT runs locally and responds intelligently in real time.

In conclusion, we designed and executed a custom conversational agent that mirrors GPT-style reasoning without relying on any external services. We saw how local models can be made interactive through prompt orchestration, lightweight tool routing, and conversational memory management. This approach enables us to understand the internal logic behind commercial GPT systems. It empowers us to experiment with our own rules, behaviors, and integrations in a transparent and fully offline manner.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Fully Functional Custom GPT-style Conversational AI Locally Using Hugging Face Transformers appeared first on MarkTechPost.

How to Build an End-to-End Interactive Analytics Dashboard Using PyGWa …

Posted on November 13, 2025 by i-genie

In this tutorial, we explore the advanced capabilities of PyGWalker, a powerful tool for visual data analysis that integrates seamlessly with pandas. We begin by generating a realistic e-commerce dataset enriched with time, demographic, and marketing features to mimic real-world business data. We then prepare multiple analytical views, including daily sales, category performance, and customer segment summaries. Finally, we use PyGWalker to interactively explore patterns, correlations, and trends across these dimensions through intuitive drag-and-drop visualizations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install pygwalker pandas numpy scikit-learn

import pandas as pd
import numpy as np
import pygwalker as pyg
from datetime import datetime, timedelta

We begin by setting up our environment, installing all necessary dependencies, and importing essential libraries, including pandas, numpy, and pygwalker. We ensure that everything is ready for building our interactive data exploration workflow in Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef generate_advanced_dataset():
np.random.seed(42)
start_date = datetime(2022, 1, 1)
dates = [start_date + timedelta(days=x) for x in range(730)]
categories = [‘Electronics’, ‘Clothing’, ‘Home & Garden’, ‘Sports’, ‘Books’]
products = {
‘Electronics’: [‘Laptop’, ‘Smartphone’, ‘Headphones’, ‘Tablet’, ‘Smartwatch’],
‘Clothing’: [‘T-Shirt’, ‘Jeans’, ‘Dress’, ‘Jacket’, ‘Sneakers’],
‘Home & Garden’: [‘Furniture’, ‘Lamp’, ‘Rug’, ‘Plant’, ‘Cookware’],
‘Sports’: [‘Yoga Mat’, ‘Dumbbell’, ‘Running Shoes’, ‘Bicycle’, ‘Tennis Racket’],
‘Books’: [‘Fiction’, ‘Non-Fiction’, ‘Biography’, ‘Science’, ‘History’]
}
n_transactions = 5000
data = []
for _ in range(n_transactions):
date = np.random.choice(dates)
category = np.random.choice(categories)
product = np.random.choice(products[category])
base_prices = {
‘Electronics’: (200, 1500),
‘Clothing’: (20, 150),
‘Home & Garden’: (30, 500),
‘Sports’: (25, 300),
‘Books’: (10, 50)
}
price = np.random.uniform(*base_prices[category])
quantity = np.random.choice([1, 1, 1, 2, 2, 3], p=[0.5, 0.2, 0.15, 0.1, 0.03, 0.02])
customer_segment = np.random.choice([‘Premium’, ‘Standard’, ‘Budget’], p=[0.2, 0.5, 0.3])
age_group = np.random.choice([’18-25′, ’26-35′, ’36-45′, ’46-55′, ’56+’])
region = np.random.choice([‘North’, ‘South’, ‘East’, ‘West’, ‘Central’])
month = date.month
seasonal_factor = 1.0
if month in [11, 12]:
seasonal_factor = 1.5
elif month in [6, 7]:
seasonal_factor = 1.2
revenue = price * quantity * seasonal_factor
discount = np.random.choice([0, 5, 10, 15, 20, 25], p=[0.4, 0.2, 0.15, 0.15, 0.07, 0.03])
marketing_channel = np.random.choice([‘Organic’, ‘Social Media’, ‘Email’, ‘Paid Ads’])
base_satisfaction = 4.0
if customer_segment == ‘Premium’:
base_satisfaction += 0.5
if discount > 15:
base_satisfaction += 0.3
satisfaction = np.clip(base_satisfaction + np.random.normal(0, 0.5), 1, 5)
data.append({
‘Date’: date, ‘Category’: category, ‘Product’: product, ‘Price’: round(price, 2),
‘Quantity’: quantity, ‘Revenue’: round(revenue, 2), ‘Customer_Segment’: customer_segment,
‘Age_Group’: age_group, ‘Region’: region, ‘Discount_%’: discount,
‘Marketing_Channel’: marketing_channel, ‘Customer_Satisfaction’: round(satisfaction, 2),
‘Month’: date.strftime(‘%B’), ‘Year’: date.year, ‘Quarter’: f’Q{(date.month-1)//3 + 1}’
})
df = pd.DataFrame(data)
df[‘Profit_Margin’] = round(df[‘Revenue’] * (1 – df[‘Discount_%’]/100) * 0.3, 2)
df[‘Days_Since_Start’] = (df[‘Date’] – df[‘Date’].min()).dt.days
return df

We design a function to generate a comprehensive e-commerce dataset that mirrors real-world business conditions. We include product categories, customer demographics, seasonal effects, and satisfaction levels, ensuring that our data is diverse and analytically rich. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“Generating advanced e-commerce dataset…”)
df = generate_advanced_dataset()
print(f”nDataset Overview:”)
print(f”Total Transactions: {len(df)}”)
print(f”Date Range: {df[‘Date’].min()} to {df[‘Date’].max()}”)
print(f”Total Revenue: ${df[‘Revenue’].sum():,.2f}”)
print(f”nColumns: {list(df.columns)}”)
print(“nFirst few rows:”)
print(df.head())

We execute the dataset generation function and display key insights, including total transactions, revenue range, and sample records. We get a clear snapshot of the data’s structure and confirm that it’s suitable for detailed analysis. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdaily_sales = df.groupby(‘Date’).agg({
‘Revenue’: ‘sum’, ‘Quantity’: ‘sum’, ‘Customer_Satisfaction’: ‘mean’
}).reset_index()

category_analysis = df.groupby(‘Category’).agg({
‘Revenue’: [‘sum’, ‘mean’], ‘Quantity’: ‘sum’, ‘Customer_Satisfaction’: ‘mean’, ‘Profit_Margin’: ‘sum’
}).reset_index()
category_analysis.columns = [‘Category’, ‘Total_Revenue’, ‘Avg_Order_Value’,
‘Total_Quantity’, ‘Avg_Satisfaction’, ‘Total_Profit’]

segment_analysis = df.groupby([‘Customer_Segment’, ‘Region’]).agg({
‘Revenue’: ‘sum’, ‘Customer_Satisfaction’: ‘mean’
}).reset_index()

print(“n” + “=”*50)
print(“DATASET READY FOR PYGWALKER VISUALIZATION”)
print(“=”*50)

We perform data aggregations to prepare multiple analytical perspectives, including time-based trends, category-level summaries, and performance metrics for customer segments. We organize this information to make it easily visualizable in PyGWalker. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n Launching PyGWalker Interactive Interface…”)
walker = pyg.walk(
df,
spec=”./pygwalker_config.json”,
use_kernel_calc=True,
theme_key=’g2′
)

print(“n PyGWalker is now running!”)
print(” Try creating these visualizations:”)
print(” – Revenue trend over time (line chart)”)
print(” – Category distribution (pie chart)”)
print(” – Price vs Satisfaction scatter plot”)
print(” – Regional sales heatmap”)
print(” – Discount effectiveness analysis”)

We launch the PyGWalker interactive interface to visually explore our dataset. We create meaningful charts, uncover trends in sales, satisfaction, and pricing, and observe how interactive visualization enhances our analytical understanding.

Data View

Visualization

Chat with Data

In conclusion, we developed a comprehensive data visualization workflow using PyGWalker, encompassing dataset generation, feature engineering, multidimensional analysis, and interactive exploration. We experience how PyGWalker transforms raw tabular data into rich, exploratory dashboards without needing complex code or BI tools. Through this exercise, we strengthen our ability to derive insights quickly, experiment visually, and connect data storytelling directly to practical business understanding.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build an End-to-End Interactive Analytics Dashboard Using PyGWalker Features for Insightful Data Exploration appeared first on MarkTechPost.