i-genie, Author at i-genie.co.uk

StepFun AI Releases Step-Audio-EditX: A New Open-Source 3B LLM-Grade A …

Posted on November 10, 2025 by i-genie

How can speech editing become as direct and controllable as simply rewriting a line of text? StepFun AI has open sourced Step-Audio-EditX, a 3B parameter LLM based audio model that turns expressive speech editing into a token level text like operation, instead of a waveform level signal processing task.

https://arxiv.org/pdf/2511.03601

Why developers care about controllable TTS?

Most zero shot TTS systems copy emotion, style, accent, and timbre directly from a short reference audio. They can sound natural, but control is weak. Style prompts in text help only for in domain voices, and the cloned voice often ignores the requested emotion or speaking style.

Past work tries to disentangle factors with extra encoders, adversarial losses, or complex architectures. Step-Audio-EditX keeps a relatively entangled representation and instead changes the data and post training objective. The model learns control by seeing many pairs and triplets where text is fixed, but one attribute changes with a large margin.

Architecture, dual codebook tokenizer plus compact audio LLM

Step-Audio-EditX reuses the Step-Audio dual codebook tokenizer. Speech is mapped into two token streams, a linguistic stream at 16.7 Hz with a 1024 entry codebook, and a semantic stream at 25 Hz with a 4096 entry codebook. Tokens are interleaved with a 2 to 3 ratio. The tokenizer keeps prosody and emotion information, so it is not fully disentangled.

On top of this tokenizer, the StepFun research team builds a 3B parameter audio LLM. The model is initialized from a text LLM, then trained on a blended corpus with a 1 to 1 ratio of pure text and dual codebook audio tokens in chat style prompts. The audio LLM reads text tokens, audio tokens, or both, and always generates dual codebook audio tokens as output.

A separate audio decoder handles reconstruction. A diffusion transformer based flow matching module predicts Mel spectrograms from audio tokens, reference audio, and a speaker embedding, and a BigVGANv2 vocoder converts Mel spectrograms to waveform. The flow matching module is trained on about 200000 hours of high quality speech, which improves pronunciation and timbre similarity.

https://arxiv.org/pdf/2511.03601

Large margin synthetic data instead of complicated encoders

The key idea is large margin learning. The model is post trained on triplets and quadruplets that keep text fixed and change only one attribute with a clear gap.

For zero shot TTS, Step-Audio-EditX uses a high quality in house dataset, mainly Chinese and English, with a small amount of Cantonese and Sichuanese, and about 60000 speakers. The data covers wide intra speaker and inter speaker variation in style and emotion.(arXiv)

For emotion and speaking style editing, the team builds synthetic large margin triplets (text, audio neutral, audio emotion or style). Voice actors record about 10 second clips for each emotion and style. StepTTS zero shot cloning then produces neutral and emotional versions for the same text and speaker. A margin scoring model, trained on a small human labeled set, scores pairs on a 1 to 10 scale, and only samples with score at least 6 are kept.

Paralinguistic editing, which covers breathing, laughter, filled pauses and other tags, uses a semi synthetic strategy on top of the NVSpeech dataset. The research team builds quadruplets where the target is the original NVSpeech audio and transcript, and the input is a cloned version with tags removed from the text. This gives time domain editing supervision without a margin model.

Reinforcement learning data uses two preference sources. Human annotators rate 20 candidates per prompt on a 5 point scale for correctness, prosody, and naturalness, and pairs with margin greater than 3 are kept. A comprehension model scores emotion and speaking style on a 1 to 10 scale, and pairs with margin greater than 8 are kept.

Post training, SFT plus PPO on token sequences

Post training has two stages, supervised fine tuning followed by PPO.

In supervised fine tuning, system prompts define zero shot TTS and editing tasks in a unified chat format. For TTS, the prompt waveform is encoded to dual codebook tokens, converted to string form, and inserted into the system prompt as speaker information. The user message is the target text, and the model returns new audio tokens. For editing, the user message includes original audio tokens plus a natural language instruction, and the model outputs edited tokens.

Reinforcement learning then refines instruction following. A 3B reward model is initialized from the SFT checkpoint and trained with Bradley Terry loss on large margin preference pairs. The reward is computed directly on dual codebook token sequences, without decoding to waveform. PPO training uses this reward model, a clip threshold, and a KL penalty to balance quality and deviation from the SFT policy.

Step-Audio-Edit-Test, iterative editing and generalization

To quantify control, the research team introduced Step-Audio-Edit-Test. It uses Gemini 2.5 Pro as an LLM as a judge to evaluate emotion, speaking style, and paralinguistic accuracy. The benchmark has 8 speakers, drawn from Wenet Speech4TTS, GLOBE V2, and Libri Light, with 4 speakers per language.

The emotion set has 5 categories with 50 Chinese and 50 English prompts per category. The speaking style set has 7 styles with 50 prompts per language per style. The paralinguistic set has 10 labels such as breathing, laughter, surprise oh, and uhm, with 50 prompts per label and language.

Editing is evaluated iteratively. Iteration 0 is the initial zero shot clone. Then the model applies 3 rounds of editing with text instructions. In Chinese, emotion accuracy rises from 57.0 at iteration 0 to 77.7 at iteration 3. Speaking style accuracy rises from 41.6 to 69.2. English shows similar behavior, and a prompt fixed ablation, where the same prompt audio is used for all iterations, still improves accuracy, which supports the large margin learning hypothesis.

https://arxiv.org/pdf/2511.03601

The same editing model is applied to four closed source TTS systems, GPT 4o mini TTS, ElevenLabs v2, Doubao Seed TTS 2.0, and MiniMax speech 2.6 hd. For all of them, one editing iteration with Step-Audio-EditX improves both emotion and style accuracy, and further iterations continue to help.

Paralinguistic editing is scored on a 1 to 3 scale. The average score rises from 1.91 at iteration 0 to 2.89 after a single edit, in both Chinese and English, which is comparable to native paralinguistic synthesis in strong commercial systems.

https://arxiv.org/pdf/2511.03601

Key Takeaways

Step Audio EditX uses a dual codebook tokenizer and a 3B parameter audio LLM so it can treat speech as discrete tokens and edit audio in a text like way.

The model relies on large margin synthetic data for emotion, speaking style, paralinguistic cues, speed, and noise, rather than adding extra disentangling encoders.

Supervised fine tuning plus PPO with a token level reward model aligns the audio LLM to follow natural language editing instructions for both TTS and editing tasks.

The Step Audio Edit Test benchmark with Gemini 2.5 Pro as a judge shows clear accuracy gains over 3 editing iterations for emotion, style, and paralinguistic control in both Chinese and English.

Step Audio EditX can post process and improve speech from closed source TTS systems, and the full stack, including code and checkpoints, is available as open source for developers.

Editorial Comments

Step Audio EditX is a precise step forward in controllable speech synthesis, because it keeps the Step Audio tokenizer, adds a compact 3B audio LLM, and optimizes control through large margin data and PPO. The introduction of Step Audio Edit Test with Gemini 2.5 Pro as a judge makes the evaluation story concrete for emotion, speaking style, and paralinguistic control, and the open release lowers the barrier for practical audio editing research. Overall, this release makes audio editing feel much closer to text editing.

Check out the Paper, Repo and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post StepFun AI Releases Step-Audio-EditX: A New Open-Source 3B LLM-Grade Audio Editing Model Excelling at Expressive and Iterative Audio Editing appeared first on MarkTechPost.

Nested Learning: A New Machine Learning Approach for Continual Learnin …

Posted on November 9, 2025 by i-genie

How can we build AI systems that keep learning new information over time without forgetting what they learned before or retraining from scratch? Google Researchers has introduced Nested Learning, a machine learning approach that treats a model as a collection of smaller nested optimization problems, instead of a single network trained by one outer loop. The goal is to attack catastrophic forgetting and move large models toward continual learning, closer to how biological brains manage memory and adaptation over time.

https://abehrouz.github.io/files/NL.pdf

What is Nested Learning?

The research paper from Google ‘Nested Learning, The Illusion of Deep Learning Architectures’ models a complex neural network as a set of coherent optimization problems, nested or running in parallel, that are optimized together. Each internal problem has its own context flow, the sequence of inputs, gradients, or states that this component observes, and its own update frequency.

Instead of seeing training as a flat stack of layers plus one optimizer, Nested Learning imposes an ordering by update frequency. Parameters that update often sit at inner levels, while slowly updated parameters form outer levels. This hierarchy defines a Neural Learning Module, where every level compresses its own context flow into its parameters. The research team show that this view covers standard back-propagation on an MLP, linear attention, and common optimizers, all as instances of associative memory.

In this framework, associative memory is any operator that maps keys to values and is trained with an internal objective. The research team formalizes associative memory and then shows that back-propagation itself can be written as a one step gradient descent update that learns a mapping from inputs to local surprise signals, the gradient of the loss with respect to the output.

https://abehrouz.github.io/files/NL.pdf

Deep Optimizers as Associative Memory

Once optimizers are treated as learning modules, Nested Learning suggests redesigning them with richer internal objectives. Standard momentum can be written as a linear associative memory over past gradients, trained with a dot product similarity objective. This internal objective produces a Hebbian like update rule that does not model dependencies between data samples.

The researcher team replaced this similarity objective with an L2 regression loss over gradient features, which yields an update rule that better manages limited memory capacity and better memorizes gradient sequences. They then generalize the momentum memory from a linear map to an MLP and define Deep Momentum Gradient Descent, where the momentum state is produced by a neural memory and can pass through a non linear function such as Newton Schulz. This perspective also recovers the Muon optimizer as a special case.

https://abehrouz.github.io/files/NL.pdf

Continuum Memory System

In typical sequence models, attention acts as working memory over the current context window, while feedforward blocks store pre training knowledge as long term memory that is rarely updated after training. The Nested Learning researchers extend this binary view to a Continuum Memory System, or CMS.

CMS is defined as a chain of MLP blocks, MLP(f₁) through MLP(fₖ), where each block has its own update frequency and chunk size. For an input sequence, the output is obtained by sequentially applying these blocks. The parameters of each block are updated only every C^(ℓ) steps, so each block compresses a different time scale of context into its parameters. A standard Transformer with one feedforward block is recovered as the special case with k equal to 1.

This construction turns long term memory into a spectrum of levels across frequency, instead of a single static feedforward layer. The research connects this directly to multi time scale synaptic and system consolidation processes in the brain, where different parts of the system learn at different rates while sharing a common architecture.

HOPE, A Self Modifying Architecture Built On Titans

To show that Nested Learning is practical, the research team designed HOPE, a self referential sequence model that applies the paradigm to a recurrent architecture. HOPE is built as a variant of Titans, a long term memory architecture where a neural memory module learns to memorize surprising events at test time and helps attention attend to long past tokens.

Titans has only 2 levels of parameter update, which yields first order in context learning. HOPE extends Titans in 2 ways. First, it is self modifying, it can optimize its own memory through a self referential process and can in principle support unbounded levels of in context learning. Second, it integrates Continuum Memory System blocks so that memory updates occur at multiple frequencies and scale to longer context windows.

https://abehrouz.github.io/files/NL.pdf

Understanding the Results

The research team evaluates HOPE and baselines on language modeling and common sense reasoning tasks at 3 parameter scales, 340M, 760M, and 1.3B parameters. Benchmarks include Wiki and LMB perplexity for language modeling and PIQA, HellaSwag, WinoGrande, ARC Easy, ARC Challenge, Social IQa, and BoolQ accuracy for reasoning. The below given Table 1 reports results for HOPE, Transformer++, RetNet, Gated DeltaNet, TTT, Samba, and Titans.

https://abehrouz.github.io/files/NL.pdf

Key Takeaways

Nested Learning treats a model as multiple nested optimization problems with different update frequencies, which directly targets catastrophic forgetting in continual learning.

The framework reinterprets backpropagation, attention, and optimizers as associative memory modules that compress their own context flow, giving a unified view of architecture and optimization.

Deep optimizers in Nested Learning replace simple dot product similarity with richer objectives such as L2 regression and use neural memories, which leads to more expressive and context aware update rules.

The Continuum Memory System models memory as a spectrum of MLP blocks that update at different rates, creating short, medium, and long range memory rather than one static feedforward layer.

The HOPE architecture, a self modifying variant of Titans built using Nested Learning principles, shows improved language modeling, long context reasoning, and continual learning performance compared to strong Transformer and recurrent baselines.

Editorial Comments

Nested Learning is a useful reframing of deep networks as Neural Learning Modules that integrate architecture and optimization into one system. The introduction of Deep Momentum Gradient Descent, Continuum Memory System, and the HOPE architecture gives a concrete path to richer associative memory and better continual learning. Overall, this work turns continual learning from an afterthought into a primary design axis.

Check out the Paper and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Nested Learning: A New Machine Learning Approach for Continual Learning that Views Models as Nested Optimization Problems to Enhance Long Context Processing appeared first on MarkTechPost.

Prior Labs Releases TabPFN-2.5: The Latest Version of TabPFN that Unlo …

Posted on November 9, 2025 by i-genie

Tabular data is still where many important models run in production. Finance, healthcare, energy and industry teams work with tables of rows and columns, not images or long text. Prior Labs now extends this space with TabPFN-2.5, a new tabular foundation model that scales in context learning to 50,000 samples and 2,000 features while keeping a training free workflow.

https://priorlabs.ai/technical-reports/tabpfn-2-5-model-report

From TabPFN And TabPFNv2 To TabPFN-2.5

The first TabPFN showed that a transformer can learn a Bayesian like inference procedure on synthetic tabular tasks. It handled up to about 1,000 samples and clean numerical features. TabPFNv2 extended this to messy real world data. It added support for categorical features, missing values and outliers, and was practical up to 10,000 samples and 500 features.

TabPFN-2.5 is the next generation in this line. Prior Labs describes it as best for datasets with up to 50,000 samples and 2,000 features, which is a 5 times increase in rows and a 4 times increase in columns over TabPFNv2. That gives roughly 20 times more data cells in the supported regime. The model is exposed through the tabpfn Python package and also through an API.

AspectTabPFN (v1)TabPFNv2TabPFN-2.5Max Rows (recommended)1,00010,00050,000Max Features (recommended)1005002,000Supported data typesNumeric onlyMixedMixed

In Context Learning For Tables

TabPFN-2.5 follows the same prior data fitted network idea as earlier versions. It is a transformer based foundation model that uses in context learning to solve tabular prediction problems in a forward pass. At training time, the model is meta trained on large synthetic distributions of tabular tasks. At inference time, you pass training rows and labels and the test rows together. The model runs one forward pass and outputs predictions, so there is no dataset specific gradient descent or hyperparameter search.

https://priorlabs.ai/technical-reports/tabpfn-2-5-model-report

Benchmark Results On TabArena And RealCause

The research team uses the TabArena Lite benchmark to measure medium sized tasks up to 10,000 samples and 500 features. TabPFN-2.5 in a forward pass outperforms any other model in the comparison. When the Real-TabPFN-2.5 variant is fine tuned on real datasets, the lead increases further. AutoGluon 1.4 in extreme mode is the baseline ensemble, tuned for 4 hours and even including TabPFNv2.

On industry standard benchmarks with up to 50,000 data points and 2,000 features, TabPFN-2.5 substantially outperforms tuned tree based models such as XGBoost and CatBoost. On the same benchmarks it matches the accuracy of AutoGluon 1.4, which runs a complex four hour tuned ensemble that includes previous methods.

Model Architecture And Training Setup

The model architecture follows TabPFNv2 with alternating attention and 18 to 24 layers. Alternating attention means that the network attends along the sample axis and along the feature axis in separate stages, which enforces permutation invariance over rows and columns. This design is important for tabular data where the order of rows and the order of columns do not carry information.

The training setup keeps the prior data based learning idea. TabPFN-2.5 uses synthetic tabular tasks with different priors over functions and data distributions as its meta training source. Real-TabPFN-2.5 uses continued pre training on a set of real world tabular datasets from repositories like OpenML and Kaggle, while the team carefully avoids overlap with evaluation benchmarks.

Key Takeaways

TabPFN 2.5 scales prior data fitted tabular transformers to about 50,000 samples and 2,000 features while keeping a one forward pass, no tuning workflow.

The model is trained on synthetic tabular tasks and evaluated on TabArena, internal industry benchmarks and RealCause, where it substantially outperforms tuned tree based baselines and matches AutoGluon 1.4 on benchmarks in this size range.

TabPFN 2.5 keeps the TabPFNv2 style alternating attention transformer for rows and features, which enables permutation invariance over tables and in context learning without task specific training.

A distillation engine turns TabPFN 2.5 into compact MLP or tree ensemble students that preserve most of the accuracy while giving much lower latency and plug in deployment in existing tabular stacks.

Editorial Comments

TabPFN 2.5 is an important release for tabular machine learning because it turns model selection and hyperparameter tuning into a single forward pass workflow on datasets with up to 50,000 samples and 2,000 features. It combines synthetic meta training, Real-TabPFN-2.5 fine tuning and a distillation engine into MLP and TreeEns students, with a clear non commercial license and enterprise path. Overall, this release makes prior data fitted networks practical for real tabular problems.

Check out the Paper, Model Weights, Repo and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Prior Labs Releases TabPFN-2.5: The Latest Version of TabPFN that Unlocks Scale and Speed for Tabular Foundation Models appeared first on MarkTechPost.

Anthropic Turns MCP Agents Into Code First Systems With ‘Code Execut …

Posted on November 9, 2025 by i-genie

Agents that use the Model Context Protocol MCP have a scaling problem. Every tool definition and every intermediate result is pushed through the context window, which means large workflows burn tokens and hit latency and cost limits fast. Anthropic’s new ‘code execution with MCP’ pattern restructures this pipeline by turning MCP tools into code level APIs and asking the model to write and run code instead of calling tools directly.

The problem, MCP tools as direct model calls

MCP is an open standard that lets AI applications connect to external systems through MCP servers that expose tools. These tools let a model query databases, call APIs, or work with files through a unified interface.

In the default pattern, an agent loads many tool definitions into the model context. Each tool definition contains schema information and metadata. Intermediate results from each tool call are also streamed back into the context so the model can decide the next call.

Anthropic describes a typical case where an agent uses an MCP server for Google Drive to fetch a long sales meeting transcript and then uses another MCP server for Salesforce to update a record with that transcript. The full transcript is first returned through the model, then sent back again when the Salesforce tool is called. For a long meeting this can add tens of thousands of extra tokens that do not change the logic of the task.

When there are many MCP servers and many tools, this pattern does not scale. The model pays to read large tool catalogs and to move large payloads between tools. Latency increases, costs grow, and context limits become a hard cap on system behavior.

The shift, represent MCP servers as code APIs

Anthropic’s proposal is to place MCP inside a code execution loop. Instead of letting the model call tools directly, the MCP client exposes each server as a set of code modules in a filesystem. The model writes TypeScript code that imports and composes those modules, and this code runs in a sandboxed environment.

The pattern has three main steps.

The MCP client generates a directory such as servers that mirrors the available MCP servers and tools.

For each MCP tool, it creates a thin wrapper function implemented in a source file, for example servers/google-drive/getDocument.ts, that internally calls the MCP tool with typed parameters.

The model is instructed to write TypeScript code that imports these functions, runs them, and handles control flow and data movement inside the execution environment.

The earlier Google Drive and Salesforce workflow becomes a short script. The script calls the Google Drive wrapper once, manipulates or inspects the data locally, then calls the Salesforce wrapper. The large transcript does not pass through the model, only the final status and any small samples or summaries do.

Cloudflare’s ‘Code Mode’ work uses the same idea in its Workers platform. It converts MCP tools into TypeScript APIs and runs model generated code inside an isolate with restricted bindings.

Quantitative impact, token usage drops by 98.7 percent

Anthropic reports a concrete example. A workflow that previously consumed about 150,000 tokens when tools and intermediate data were passed directly through the model was reimplemented with code execution and filesystem based MCP APIs. The new pattern used about 2,000 tokens. That is a 98.7 percent reduction in token usage for that scenario, which also reduces cost and latency.

Design benefits for agent builders

Code execution with MCP introduces several practical benefits for engineers who design agents:

Progressive tool discovery: The agent does not need all tool definitions in context. It can explore the generated filesystem, list available servers, and read specific tool modules only when needed. This shifts tool catalogs from the model context into code, so tokens are spent only on relevant interfaces.

Context efficient data handling: Large datasets remain inside the execution environment. For example, TypeScript code can read a large spreadsheet through an MCP tool, filter rows, compute aggregates, and log only small samples and summary statistics back to the model. The model sees a compact view of the data while the heavy lifting happens in code.

Privacy preserving operations: Anthropic describes a pattern where sensitive fields such as email or phone are tokenized inside the execution environment. The model sees placeholders, while the MCP client maintains a secure mapping and restores real values when calling downstream tools. This lets data move between MCP servers without exposing raw identifiers to the model.

State and reusable skills: The filesystem lets agents store intermediate files and reusable scripts. A helper script that transforms a sheet into a report can be saved in a skills directory and imported in later sessions. Anthropic connects this idea to Claude Skills, where collections of scripts and metadata define higher level capabilities.

Editorial Comments

Anthropic’s ‘code execution with MCP’ approach is a sensible next step for MCP powered agents. It directly attacks the token costs of loading tool definitions and routing large intermediate results through the context, by presenting MCP servers as code APIs and pushing work into a sandboxed TypeScript runtime. This makes agents more efficient, while also forcing teams to take code execution security seriously. This launch turns MCP from a tool list into an executable API surface.
The post Anthropic Turns MCP Agents Into Code First Systems With ‘Code Execution With MCP’ Approach appeared first on MarkTechPost.

Google AI Releases ADK Go: A New Open-Source Toolkit Designed to Empow …

Posted on November 8, 2025 by i-genie

How do you build reliable AI agents that plug into your existing Go services without bolting on a separate language stack? Google has just released Agent Development Kit for Go. Go developers can now build AI agents with the same framework that already supports Python and Java, while keeping everything inside a familiar Go toolchain and deployment model.

For AI devs and backend developers who already use Go for services, this closes a gap. You no longer need a separate Python based stack for agents. You can express agent logic, orchestration, and tool use directly in Go code, then move the same agents into Vertex AI Agent Builder and Agent Engine when you are ready for production.

What Agent Development Kit Provides?

Agent Development Kit, or ADK, is an open source framework for developing and deploying AI agents. It is optimized for Gemini and Google Cloud, but the design is model agnostic and deployment agnostic.

In practical terms, ADK gives you:

A code first programming model where agent behavior, tools, and orchestration live in normal source files

Workflow agents for sequential, parallel, and loop style control flow inside an agent system

A rich tool ecosystem with built in tools, custom function tools, OpenAPI tools, Google Cloud tools, and ecosystem tools

Deployment paths that cover local runs, containers, Cloud Run, and Vertex AI Agent Engine

Built in evaluation and safety patterns, integrated with Vertex AI Agent Builder

For a developer, ADK turns an agent into a normal service. You run it locally, inspect traces, and deploy it to a managed runtime, instead of treating it as a one off script that calls an LLM.

What ADK for Go Adds?

The Go release keeps the same core feature set as the Python and Java SDKs but exposes it through an idiomatic Go API. The Google AI team describes ADK for Go as an idiomatic and performant way to build agents that use Go concurrency and strong typing.

Here are some key points:

ADK for Go is installed with go get google.golang.org/adk

The project is open source and hosted at github.com/google/adk-go

It supports building, evaluating, and deploying sophisticated AI agents with flexibility and control

It uses the same abstractions for agents, tools, and workflows as the other ADK languages

This means a Go service can embed agent behavior without switching languages. You can build a multi agent architecture where each agent is a Go component that composes with others using the same framework.

A2A Protocol Support in Go

ADK for Go ships with native support for the Agent2Agent protocol, or A2A.

The A2A protocol defines a way for agents to call other agents over a standard interface. In the Go release, Google highlights that a primary agent can orchestrate and delegate tasks to specialized sub agents. Those sub agents can run locally or as remote deployments. A2A keeps these interactions secure and opaque, so an agent does not need to expose internal memory or proprietary logic to participate.

Google also contributed an A2A Go SDK to the main A2A project. That gives Go developers a protocol level entry point if they want agents that interoperate with other runtimes and frameworks that also support A2A.

MCP Toolbox for Databases and Tooling

A key detail in the official Google announcement is native integration with MCP Toolbox for Databases. It states that ADK Go has out of the box support for more than 30 databases through this toolbox.

MCP Toolbox for Databases is an open source MCP server for databases. It handles connection pooling, authentication, and other concerns, and exposes database operations as tools using the Model Context Protocol.

Within ADK, that means:

You register MCP Toolbox for Databases as an MCP tool provider

The agent calls database operations through MCP tools rather than constructing raw SQL

The toolbox enforces a set of safe, predefined actions that the agent can perform

This fits the ADK model for tools in general, where agents use a mix of built in tools, Google Cloud tools, ecosystem tools, and MCP tools, all described in the Vertex AI Agent Builder documentation.

Integration with Vertex AI Agent Builder and Agent Engine

ADK is the primary framework supported in Vertex AI Agent Builder for building multi agent systems.

The latest Agent Builder updates describe a build path where you:

Develop the agent locally using ADK, now including ADK for Go

Use the ADK quickstart and dev UI to test the agent with multiple tools

Deploy the agent to Vertex AI Agent Engine as a managed runtime

For Go teams, this means the language used in services and infrastructure is now available across the full agent lifecycle, from local development to managed production deployment.

Editorial Comments

This launch positions Agent Development Kit for Go as a practical bridge between AI agents and existing Go services, using the same open source, code first toolkit that underpins Python and Java agents. It brings A2A protocol support and MCP Toolbox for Databases into a Go native environment, aligned with Vertex AI Agent Builder and Vertex AI Agent Engine for deployment, evaluation, and observability. Overall, this release makes Go a first class language for building production ready, interoperable AI agents in Google’s ecosystem.

Check out the Repo, Samples and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Releases ADK Go: A New Open-Source Toolkit Designed to Empower Go Developers to Build Powerful AI Agents appeared first on MarkTechPost.

Why Spatial Supersensing is Emerging as the Core Capability for Multim …

Posted on November 8, 2025 by i-genie

Even strong ‘long-context’ AI models fail badly when they must track objects and counts over long, messy video streams, so the next competitive edge will come from models that predict what comes next and selectively remember only surprising, important events, not from just buying more compute and bigger context windows. A team of researchers from New York University and Stanford introduce Cambrian-S, a spatially grounded video multimodal large language model family, together with the VSI Super benchmark and the VSI 590K dataset to test and train spatial supersensing in long videos.

https://arxiv.org/pdf/2511.04670

From video question answering to spatial supersensing

The research team frames spatial supersensing as a progression of capabilities beyond linguistic only reasoning. The stages are semantic perception, streaming event cognition, implicit 3D spatial cognition and predictive world modeling.

Most current video MLLMs sample sparse frames and rely on language priors. They often answer benchmark questions using captions or single frames rather than continuous visual evidence. Diagnostic tests show that several popular video benchmarks are solvable with limited or text only input, so they do not strongly test spatial sensing.

Cambrian-S targets the higher stages of this hierarchy, where the model must remember spatial layouts across time, reason about object locations and counts and anticipate changes in a 3D world.

VSI Super, a stress test for continual spatial sensing

To expose the gap between current systems and spatial supersensing, the research team designed VSI Super, a two part benchmark that runs on arbitrarily long indoor videos.

https://arxiv.org/pdf/2511.04670

VSI Super Recall, or VSR, evaluates long horizon spatial observation and recall. Human annotators take indoor walkthrough videos from ScanNet, ScanNet++ and ARKitScenes and use Gemini to insert an unusual object, such as a Teddy Bear, into four frames at different spatial locations. These edited sequences are concatenated into streams up to 240 minutes. The model must report the order of locations where the object appears, which is a visual needle in a haystack task with sequential recall.

https://arxiv.org/pdf/2511.04670

VSI Super Count, or VSC, measures continual counting under changing viewpoints and rooms. The benchmark concatenates room tour clips from VSI Bench and asks for the total number of instances of a target object across all rooms. The model must handle viewpoint changes, revisits and scene transitions and maintain a cumulative count. Evaluation uses mean relative accuracy for durations from 10 to 120 minutes.

When Cambrian-S 7B is evaluated on VSI Super in a streaming setup at 1 frame per second, accuracy on VSR drops from 38.3 percent at 10 minutes to 6.0 percent at 60 minutes and becomes zero beyond 60 minutes. VSC accuracy is near zero across lengths. Gemini 2.5 Flash also degrades on VSI Super despite a long context window, which shows that brute force context scaling is not sufficient for continual spatial sensing.

VSI 590K, spatially focused instruction data

To test whether data scaling can help, the research team construct VSI 590K, a spatial instruction corpus with 5,963 videos, 44,858 images and 590,667 question answer pairs from 10 sources.

Sources include 3D annotated real indoor scans such as ScanNet, ScanNet++ V2, ARKitScenes, S3DIS and Aria Digital Twin, simulated scenes from ProcTHOR and Hypersim and pseudo annotated web data such as YouTube RoomTour and robot datasets Open X Embodiment and AgiBot World.

The dataset defines 12 spatial question types, such as object count, absolute and relative distance, object size, room size and appearance order. Questions are generated from 3D annotations or reconstructions so that spatial relationships are grounded in geometry rather than text heuristics. Ablations show that annotated real videos contribute the largest gains on VSI Bench, followed by simulated data and then pseudo annotated images and that training on the full mix gives the best spatial performance.

https://arxiv.org/pdf/2511.04670

Cambrian-S model family and spatial performance

Cambrian-S builds on Cambrian-1 and uses Qwen2.5 language backbones at 0.5B, 1.5B, 3B and 7B parameters with a SigLIP2 SO400M vision encoder and a two layer MLP connector.

Training follows a four stage pipeline. Stage 1 performs vision language alignment on image text pairs. Stage 2 applies image instruction tuning, equivalent to the improved Cambrian-1 setup. Stage 3 extends to video with general video instruction tuning on a 3 million sample mixture called Cambrian-S 3M. Stage 4 performs spatial video instruction tuning on a mixture of VSI 590K and a subset of the stage 3 data.

https://arxiv.org/pdf/2511.04670

On VSI Bench, Cambrian-S 7B reaches 67.5 percent accuracy and outperforms open source baselines like InternVL3.5 8B and Qwen VL 2.5 7B as well as proprietary Gemini 2.5 Pro by more than 16 absolute points. The model also maintains strong performance on Perception Test, EgoSchema and other general video benchmarks, so the focus on spatial sensing does not destroy general capabilities.

Predictive sensing with latent frame prediction and surprise

To go beyond static context expansion, the research team propose predictive sensing. They add a Latent Frame Prediction head, which is a two layer MLP that predicts the latent representation of the next video frame in parallel with next token prediction.

Training modifies stage 4. The model uses mean squared error and cosine distance losses between predicted and ground truth latent features, weighted against the language modeling loss. A subset of 290,000 videos from VSI 590K, sampled at 1 frame per second, is reserved for this objective. During this stage the connector, language model and both output heads are trained jointly, while the SigLIP vision encoder remains frozen.

https://arxiv.org/pdf/2511.04670

At inference time the cosine distance between predicted and actual features becomes a surprise score. Frames with low surprise are compressed before being stored in long term memory and high surprise frames are retained with more detail. A fixed size memory buffer uses surprise to decide which frames to consolidate or drop and queries retrieve frames that are most relevant to the question.

https://arxiv.org/pdf/2511.04670

For VSR, this surprise driven memory system lets Cambrian-S maintain accuracy as video length increases while keeping GPU memory usage stable. It outperforms Gemini 1.5 Flash and Gemini 2.5 Flash on VSR at all tested durations and avoids the sharp degradation seen in models that only extend context.

For VSC, the research team designed a surprise driven event segmentation scheme. The model accumulates features in an event buffer and when a high surprise frame signals a scene change, it summarizes that buffer into a segment level answer and resets the buffer. Aggregating segment answers gives the final count. In streaming evaluation, Gemini Live and GPT Realtime achieve less than 15 percent mean relative accuracy and drop near zero on 120 minute streams, while Cambrian-S with surprise segmentation reaches about 38 percent at 10 minutes and maintains around 28 percent at 120 minutes.

Key Takeaways

Cambrian-S and VSI 590K show that careful spatial data design and strong video MLLMs can significantly improve spatial cognition on VSI Bench, but they still fail on VSI Super, so scale alone does not solve spatial supersensing.

VSI Super, through VSR and VSC, is intentionally built from arbitrarily long indoor videos to stress continual spatial observation, recall and counting, which makes it resistant to brute force context window expansion and standard sparse frame sampling.

Benchmarking shows that frontier models, including Gemini 2.5 Flash and Cambrian S, degrade sharply on VSI Super even when video lengths remain within their nominal context limits, revealing a structural weakness in current long context multimodal architectures.

The Latent Frame Prediction based predictive sensing module uses next latent frame prediction error, or surprise, to drive memory compression and event segmentation, which yields substantial gains on VSI Super compared to long context baselines while keeping GPU memory usage stable.

The research work positions spatial supersensing as a hierarchy from semantic perception to predictive world modeling and argues that future video MLLMs must incorporate explicit predictive objectives and surprise driven memory, not only larger models and datasets, to handle unbounded streaming video in real applications.

Editorial Comments

Cambrian-S is a useful stress test of current video MLLMs because it shows that VSI SUPER is not just a harder benchmark, it exposes a structural failure of long context architectures that still rely on reactive perception. The predictive sensing module, based on Latent Frame Prediction and surprise driven memory, is an important step because it couples spatial sensing with internal world modeling rather than only scaling data and parameters. This research signals a shift from passive video understanding to predictive spatial supersensing as the next design target for multimodal models.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Why Spatial Supersensing is Emerging as the Core Capability for Multimodal AI Systems? appeared first on MarkTechPost.

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Posted on November 8, 2025 by i-genie

Large language models are now limited less by training and more by how fast and cheaply we can serve tokens under real traffic. That comes down to three implementation details: how the runtime batches requests, how it overlaps prefill and decode, and how it stores and reuses the KV cache. Different engines make different tradeoffs on these axes, which show up directly as differences in tokens per second, P50/P99 latency, and GPU memory usage.

This article compares six runtimes that show up repeatedly in production stacks:

vLLM

TensorRT LLM

Hugging Face Text Generation Inference (TGI v3)

LMDeploy

SGLang

DeepSpeed Inference / ZeRO Inference

1. vLLM

Design

vLLM is built around PagedAttention. Instead of storing each sequence’s KV cache in a large contiguous buffer, it partitions KV into fixed size blocks and uses an indirection layer so each sequence points to a list of blocks.

This gives:

Very low KV fragmentation (reported <4% waste vs 60–80% in naïve allocators)

High GPU utilization with continuous batching

Native support for prefix sharing and KV reuse at block level

Recent versions add KV quantization (FP8) and integrate FlashAttention style kernels.

Performance

vLLM evaluation:

vLLM achieves 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs.

KV and memory behavior

PagedAttention provides a KV layout that is both GPU friendly and fragmentation resistant.

FP8 KV quantization reduces KV size and improves decode throughput when compute is not the bottleneck.

Where it fits

Default high performance engine when you need a general LLM serving backend with good throughput, good TTFT, and hardware flexibility.

2. TensorRT LLM

Design

TensorRT LLM is a compilation based engine on top of NVIDIA TensorRT. It generates fused kernels per model and shape, and exposes an executor API used by frameworks such as Triton.

Its KV subsystem is explicit and feature rich:

Paged KV cache

Quantized KV cache (INT8, FP8, with some combinations still evolving)

Circular buffer KV cache

KV cache reuse, including offloading KV to CPU and reusing it across prompts to reduce TTFT

NVIDIA reports that CPU based KV reuse can reduce time to first token by up to 14× on H100 and even more on GH200 in specific scenarios.

Performance

TensorRT LLM is highly tunable, so results vary. Common patterns from public comparisons and vendor benchmarks:

Very low single request latency on NVIDIA GPUs when engines are compiled for the exact model and configuration.

At moderate concurrency, it can be tuned either for low TTFT or for high throughput; at very high concurrency, throughput optimized profiles push P99 up due to aggressive batching.

KV and memory behavior

Paged KV plus quantized KV gives strong control over memory use and bandwidth.

Executor and memory APIs let you design cache aware routing policies at the application layer.

Where it fits

Latency critical workloads and NVIDIA only environments, where teams can invest in engine builds and per model tuning.

3. Hugging Face TGI v3

Design

Text Generation Inference (TGI) is a server focused stack with:

Rust based HTTP and gRPC server

Continuous batching, streaming, safety hooks

Backends for PyTorch and TensorRT and tight Hugging Face Hub integration

TGI v3 adds a new long context pipeline:

Chunked prefill for long inputs

Prefix KV caching so long conversation histories are not recomputed on each request

Performance

For conventional prompts, recent third party work shows:

vLLM often edges out TGI on raw tokens per second at high concurrency due to PagedAttention, but the difference is not huge on many setups.

TGI v3 processes around 3× more tokens and is up to 13× faster than vLLM on long prompts, under a setup with very long histories and prefix caching enabled.

Latency profile:

P50 for short and mid length prompts is similar to vLLM when both are tuned with continuous batching.

For long chat histories, prefill dominates in naive pipelines; TGI v3’s reuse of earlier tokens gives a large win in TTFT and P50.

KV and memory behavior

TGI uses KV caching with paged attention style kernels and reduces memory footprint through chunking of prefill and other runtime changes.

It integrates quantization through bits and bytes and GPTQ and runs across several hardware backends.

Where it fits

Production stacks already on Hugging Face, especially for chat style workloads with long histories where prefix caching gives large real world gains.

4. LMDeploy

Design

LMDeploy is a toolkit for compression and deployment from the InternLM ecosystem. It exposes two engines:

TurboMind: high performance CUDA kernels for NVIDIA GPUs

PyTorch engine: flexible fallback

Key runtime features:

Persistent, continuous batching

Blocked KV cache with a manager for allocation and reuse

Dynamic split and fuse for attention blocks

Tensor parallelism

Weight only and KV quantization (including AWQ and online INT8 / INT4 KV quant)

LMDeploy delivers up to 1.8× higher request throughput than vLLM, attributing this to persistent batching, blocked KV and optimized kernels.

Performance

Evaluations show:

For 4 bit Llama style models on A100, LMDeploy can reach higher tokens per second than vLLM under comparable latency constraints, especially at high concurrency.

It also reports that 4 bit inference is about 2.4× faster than FP16 for supported models.

Latency:

Single request TTFT is in the same ballpark as other optimized GPU engines when configured without extreme batch limits.

Under heavy concurrency, persistent batching plus blocked KV let LMDeploy sustain high throughput without TTFT collapse.

KV and memory behavior

Blocked KV cache trades contiguous per sequence buffers for a grid of KV chunks managed by the runtime, similar in spirit to vLLM’s PagedAttention but with a different internal layout.

Support for weight and KV quantization targets large models on constrained GPUs.

Where it fits

NVIDIA centric deployments that want maximum throughput and are comfortable using TurboMind and LMDeploy specific tooling.

5. SGLang

Design

SGLang is both:

A DSL for building structured LLM programs such as agents, RAG workflows and tool pipelines

A runtime that implements RadixAttention, a KV reuse mechanism that shares prefixes using a radix tree structure rather than simple block hashes.

RadixAttention:

Stores KV for many requests in a prefix tree keyed by tokens

Enables high KV hit rates when many calls share prefixes, such as few shot prompts, multi turn chat, or tool chains

Performance

Key Insights:

SGLang achieves up to 6.4× higher throughput and up to 3.7× lower latency than baseline systems such as vLLM, LMQL and others on structured workloads.

Improvements are largest when there is heavy prefix reuse, for example multi turn chat or evaluation workloads with repeated context.

Reported KV cache hit rates range from roughly 50% to 99%, and cache aware schedulers get close to the optimal hit rate on the measured benchmarks.

KV and memory behavior

RadixAttention sits on top of paged attention style kernels and focuses on reuse rather than just allocation.

SGLang integrates well with hierarchical context caching systems that move KV between GPU and CPU when sequences are long, although those systems are usually implemented as separate projects.

Where it fits

Agentic systems, tool pipelines, and heavy RAG applications where many calls share large prompt prefixes and KV reuse matters at the application level.

6. DeepSpeed Inference / ZeRO Inference

Design

DeepSpeed provides two pieces relevant for inference:

DeepSpeed Inference: optimized transformer kernels plus tensor and pipeline parallelism

ZeRO Inference / ZeRO Offload: techniques that offload model weights, and in some setups KV cache, to CPU or NVMe to run very large models on limited GPU memory

ZeRO Inference focuses on:

Keeping little or no model weights resident in GPU

Streaming tensors from CPU or NVMe as needed

Targeting throughput and model size rather than low latency

Performance

In the ZeRO Inference OPT 30B example on a single V100 32GB:

Full CPU offload reaches about 43 tokens per second

Full NVMe offload reaches about 30 tokens per second

Both are 1.3–2.4× faster than partial offload configurations, because full offload enables larger batch sizes

These numbers are small compared to GPU resident LLM runtimes on A100 or H100, but they apply to a model that does not fit natively in 32GB.

A recent I/O characterization of DeepSpeed and FlexGen confirms that offload based systems are dominated by small 128 KiB reads and that I/O behavior becomes the main bottleneck.

KV and memory behavior

Model weights and sometimes KV blocks are offloaded to CPU or SSD to fit models beyond GPU capacity.

TTFT and P99 are high compared to pure GPU engines, but the tradeoff is the ability to run very large models that otherwise would not fit.

Where it fits

Offline or batch inference, or low QPS services where model size matters more than latency and GPU count.

Comparison Tables

This table summarizes the main tradeoffs qualitatively:

RuntimeMain design ideaRelative strengthKV strategyTypical use casevLLMPagedAttention, continuous batchingHigh tokens per second at given TTFTPaged KV blocks, FP8 KV supportGeneral purpose GPU serving, multi hardwareTensorRT LLMCompiled kernels on NVIDIA + KV reuseVery low latency and high throughput on NVIDIAPaged, quantized KV, reuse and offloadNVIDIA only, latency sensitiveTGI v3HF serving layer with long prompt pathStrong long prompt performance, integrated stackPaged KV, chunked prefill, prefix cachingHF centric APIs, long chat historiesLMDeployTurboMind kernels, blocked KV, quantUp to 1.8× vLLM throughput in vendor testsBlocked KV cache, weight and KV quantNVIDIA deployments focused on raw throughputSGLangRadixAttention and structured programsUp to 6.4× throughput and 3.7× lower latency on structured workloadsRadix tree KV reuse over prefixesAgents, RAG, high prefix reuseDeepSpeedGPU CPU NVMe offload for huge modelsEnables large models on small GPU; throughput orientedOffloaded weights and sometimes KVVery large models, offline or low QPS

Choosing a runtime in practice

For a production system, the choice tends to collapse to a few simple patterns:

You want a strong default engine with minimal custom work: You can start with vLLM. It gives you good throughput, reasonable TTFT, and solid KV handling on common hardware.

You are committed to NVIDIA and want fine grained control over latency and KV: You can use TensorRT LLM, likely behind Triton or TGI. Plan for model specific engine builds and tuning.

Your stack is already on Hugging Face and you care about long chats: You can use TGI v3. Its long prompt pipeline and prefix caching are very effective for conversation style traffic.

You want maximum throughput per GPU with quantized models: You can use LMDeploy with TurboMind and blocked KV, especially for 4 bit Llama family models.

You are building agents, tool chains or heavy RAG systems: You can use SGLang and design prompts so that KV reuse via RadixAttention is high.

You must run very large models on limited GPUs: You can use DeepSpeed Inference / ZeRO Inference, accept higher latency, and treat the GPU as a throughput engine with SSD in the loop.

Overall, all these engines are converging on the same idea: KV cache is the real bottleneck resource. The winners are the runtimes that treat KV as a first class data structure to be paged, quantized, reused and offloaded, not just a big tensor slapped into GPU memory.
The post Comparing the Top 6 Inference Runtimes for LLM Serving in 2025 appeared first on MarkTechPost.

Connect Amazon Bedrock agents to cross-account knowledge bases

Posted on November 8, 2025 by i-genie

Organizations need seamless access to their structured data repositories to power intelligent AI agents. However, when these resources span multiple AWS accounts integration challenges can arise. This post explores a practical solution for connecting Amazon Bedrock agents to knowledge bases in Amazon Redshift clusters residing in different AWS accounts.
The challenge
Organizations that build AI agents using Amazon Bedrock can maintain their structured data in Amazon Redshift clusters. When these data repositories exist in separate AWS accounts from their AI agents, they face a significant limitation: Amazon Bedrock Knowledge Bases doesn’t natively support cross-account Redshift integration.
This creates a challenge for enterprises with multi-account architectures who want to:

Leverage existing structured data in Redshift for their AI agents.
Maintain separation of concerns across different AWS accounts.
Avoid duplicating data across accounts.
Ensure proper security and access controls.

Solution overview
Our solution enables cross-account knowledge base integration through a secure, serverless architecture that maintains secure access controls while allowing AI agents to query structured data. The approach uses AWS Lambda as an intermediary to facilitate secure cross-account data access.

The action flow as shown above:

Users enter their natural language question in Amazon Bedrock Agents which is configured in the agent account.
Amazon Bedrock Agents invokes a Lambda function through action groups which provides access to the Amazon Bedrock knowledge base configured in the agent-kb account above.
Action group Lambda function running in agent account assumes an IAM role created in agent-kb account above to connect to the knowledge base in the agent-kb account.
Amazon Bedrock Knowledge Base in the agent-kb account uses an IAM role created in the same account to access Amazon Redshift data warehouse and query data in the data warehouse.

The solution follows these key components:

Amazon Bedrock agent in the agent account that handles user interactions.
Amazon Redshift serverless workgroup in VPC and private subnet in the agent-kb account containing structured data.
Amazon Bedrock Knowledge base using the Amazon Redshift serverless workgroup as structured data source.
Lambda function in the agent account.
Action group configuration to connect the agent in the agent account to the Lambda function.
IAM roles and policies that enable secure cross-account access.

Prerequisites
This solution requires you to have the following:

Two AWS accounts. Create an AWS account if you do not have one. Specific permissions required for both account which will be set up in subsequent steps.
Install the AWS CLI (2.24.22 – current version)
Set up authentication using IAM user credentials for the AWS CLI for each account
Make sure you have jq installed, jq is lightweight command-line JSON processor. For example, in Mac you can use the command brew install jq (jq-1.7.1-apple – current version) to install it.
Navigate to the Amazon Bedrock console and make sure you enable access to the meta.llama3-1-70b-instruct-v1:0 model for the agent-kb account and access for us.amazon.nova-pro-v1:0 model in the agent account in the us-west-2, US West (Oregon) AWS Region.

Assumption
Let’s call the AWS account profile, agent profile that has the Amazon Bedrock agent. Similarly, the AWS account profile be called agent-kb that has the Amazon Bedrock knowledge base with Amazon Redshift Serverless and the structured data source. We will use the us-west-2 US West (Oregon) AWS Region but feel free to choose another AWS Region as necessary (the prerequisites will be applicable to the AWS Region you choose to deploy this solution in). We will use the meta.llama3-1-70b-instruct-v1:0 model for the agent-kb. This is an available on-demand model in us-west-2. You are free to choose other models with cross-Region inference but that would mean changing the roles and polices accordingly and enable model access in all Regions they are available in. Based on our model choice for this solution the AWS Region must be us-west-2. For the agent we will be using an Amazon Bedrock agent optimized model like us.amazon.nova-pro-v1:0.
Implementation walkthrough
The following is a step-by-step implementation guide. Make sure to perform all steps in the same AWS Region in both accounts.
These steps are to deploy and test an end-to-end solution from scratch and if you are already running some of these components, you may skip over those steps.

Make a note of the AWS account numbers in the agent and agent-kb account. In the implementation steps we will refer them as follows:

Profile
AWS account
Description

agent
111122223333
Account for the Bedrock Agent

agent-kb
999999999999
Account for the Bedrock Knowledge base

Note: These steps use example profile names and account numbers, please replace with actuals before running.
Create the Amazon Redshift Serverless workgroup in the agent-kb account:

Log on to the agent-kb account
Follow the workshop link to create the Amazon Redshift Serverless workgroup in private subnet
Make a note of the namespace, workgroup, and other details and follow the rest of the hands-on workshop instructions.

Set up your data warehouse in the agent-kb account.
Create your AI knowledge base in the agent-kb account. Make a note of the knowledge base ID.
Train your AI Assistant in the agent-kb account.
Test natural language queries in the agent-kb account. You can find the code in aws-samples git repository: sample-for-amazon-bedrock-agent-connect-cross-account-kb.
Create necessary roles and policies in both the accounts. Run the script create_bedrock_agent_kb_roles_policies.sh with the following input parameters.

Input parameter
Value
Description

–agent-kb-profile
agent-kb
The agent knowledgebase profile that you set up with the AWS CLI with aws_access_key_id, aws_secret_access_key as mentioned in the prerequisites.

–lambda-role
lambda_bedrock_kb_query_role
This is the IAM role the agent account Bedrock agent action group lambda will assume to connect to the Redshift cross account

–kb-access-role
bedrock_kb_access_role
This is the IAM role the agent-kb account which the lambda_bedrock_kb_query_role in agent account assumes to connect to the Redshift cross account

–kb-access-policy
bedrock_kb_access_policy
IAM policy attached to the IAM role bedrock_kb_access_role

–lambda-policy
lambda_bedrock_kb_query_policy
IAM policy attached to the IAM role lambda_bedrock_kb_query_role

–knowledge-base-id
XXXXXXXXXX
Replace with the actual knowledge base ID created in Step 4

–agent-account
111122223333
Replace with the 12-digit AWS account number where the Bedrock agent is running. (agent account)

–agent-kb-account
999999999999
Replace with the 12-digit AWS account number where the Bedrock knowledge base is running. (agent-kb acccount)

Download the script (create_bedrock_agent_kb_roles_policies.sh) from the aws-samples GitHub repository.
Open Terminal in Mac or similar bash shell for other platforms.
Locate and change the directory to the downloaded location, provide executable permissions:

cd /my/location
chmod +x create_bedrock_agent_kb_roles_policies.sh

If you are still not clear on the script usage or inputs, then you can run the script with the –help option and the script will display the usage: ./create_bedrock_agent_kb_roles_policies.sh –help
Run the script with the right input parameters as described in the previous table.

./create_bedrock_agent_kb_roles_policies.sh –agent-profile agent
–agent-kb-profile agent-kb
–lambda-role lambda_bedrock_kb_query_role
–kb-access-role bedrock_kb_access_role
–kb-access-policy bedrock_kb_access_policy
–lambda-policy lambda_bedrock_kb_query_policy
–knowledge-base-id XXXXXXXXXX
–agent-account 111122223333
–agent-kb-account 999999999999

The script on successful execution shows the summary of the IAM, roles and policies created in both accounts.
Log on to both the agent and agent-kb account to verify the IAM roles and policies are created.

For the agent account: Make a note of the ARN of the lambda_bedrock_kb_query_role as that will be the value of CloudFormation stack parameter AgentLambdaExecutionRoleArn in the next step.
For the agent-kb account: Make a note of the ARN of the bedrock_kb_access_role as that will be the value of CloudFormation stack parameter TargetRoleArn in the next step.

Run the AWS CloudFormation script to create a Bedrock agent:

Download the CloudFormation script: cloudformation_bedrock_agent_kb_query_cross_account.yaml from the aws-samples GitHub repository.
Log on to the agent account and navigate to the CloudFormation console, and verify you are in the us-west-2 (Oregon) Region, choose Create stack and choose With new resources (standard).
In the Specify template section choose Upload a template file and then Choose file and select the file from (1). Then, choose Next.
Enter the following stack details and choose Next.

Parameter
Value
Description

Stack name
bedrock-agent-connect-kb-cross-account-agent
You can choose any name

AgentFoundationModelId
us.amazon.nova-pro-v1:0
Do not change

AgentLambdaExecutionRoleArn
arn:aws:iam:: 111122223333:role/lambda_bedrock_kb_query_role
Replace with you agent account number

BedrockAgentDescription
Agent to query inventory data from Redshift Serverless database
Keep this as default

BedrockAgentInstructions
You are an assistant that helps users query inventory data from our Redshift Serverless database using the action group.
Do not change

BedrockAgentName
bedrock_kb_query_cross_account
Keep this as default

KBFoundationModelId
meta.llama3-1-70b-instruct-v1:0
Do not change

KnowledgeBaseId
XXXXXXXXXX
Knowledge base id from Step 4

TargetRoleArn
arn:aws:iam::999999999999:role/bedrock_kb_access_role
Replace with you agent-kb account number

Complete the acknowledgement and choose Next.
Scroll down through the page and choose Submit.
You will see the CloudFormation stack is getting created as shown by the status CREATE_IN_PROGRESS.
It will take a few minutes, and you will see the status change to CREATE_COMPLETE indicating creation of all resources. Choose the Outputs tab to make a note of the resources that were created. In summary, the CloudFormation script does the following in the agent account.

Creates a Bedrock agent
Creates an action group
Also creates a Lambda function which is invoked by the Bedrock action group
Defines the OpenAPI schema
Creates necessary roles and permissions for the Bedrock agent
Finally, it prepares the Bedrock agent so that it is ready to test.

Check for model access in Oregon (us-west-2)

Verify Nova Pro (us.amazon.nova-pro-v1:0) model access in the agent account. Navigate to the Amazon Bedrock console and choose Model access under Configure and learn. Search for Model name : Nova Pro to verify access. If not, then enable model access.
Verify access to the meta.llama3-1-70b-instruct-v1:0 model in the agent-kb account. This should already be enabled as we set up the knowledge base earlier.

Run the agent. Log on to agent account. Navigate to Amazon Bedrock console and choose Agents under Build.
Choose the name of the agent and choose Test. You can test the following questions as mentioned the workshop’s Stage 4: Test Natural Language Queries page. For example:

Who are the top 5 customers in Saudi Arabia?
Who are the top parts supplier in the United States by volume?
What is the total revenue by region for the year 1998?
Which products have the highest profit margins?
Show me orders with the highest priority from the last quarter of 1997.

Choose Show trace to investigate the agent traces.

Some recommended best practices:

Phrase your question to be more specific
Use terminology that matches your table descriptions
Try questions similar to your curated examples
Verify your question relates to data that exists in the TPCH dataset
Use Amazon Bedrock Guardrails to add configurable safeguards to questions and responses.

Clean up resources
It is recommended that you clean up any resources you do not need anymore to avoid any unnecessary charges:

Navigate to the CloudFormation console for the agent and agent-kb account, search for the stack and and choose Delete.
S3 buckets need to be deleted separately.
For deleting the roles and policies created in both accounts, download the script delete-bedrock-agent-kb-roles-policies.sh from the aws-samples GitHub repository.

Open Terminal in Mac or similar bash shell on other platforms.
Locate and change the directory to the downloaded location, provide executable permissions:

cd /my/location
chmod +x delete-bedrock-agent-kb-roles-policies.sh

If you are still not clear on the script usage or inputs, then you can run the script with the –help option then the script will display the usage: ./ delete-bedrock-agent-kb-roles-policies.sh –help
Run the script: delete-bedrock-agent-kb-roles-policies.sh with the same values for the same input parameters as in Step7 when running the create_bedrock_agent_kb_roles_policies.sh script. Note: Enter the correct account numbers for agent-account and agent-kb-account before running.

./delete-bedrock-agent-kb-roles-policies.sh –agent-profile agent
–agent-kb-profile agent-kb
–lambda-role lambda_bedrock_kb_query_role
–kb-access-role bedrock_kb_access_role
–kb-access-policy bedrock_kb_access_policy
–lambda-policy lambda_bedrock_kb_query_policy
–agent-account 111122223333
–agent-kb-account 999999999999
The script will ask for a confirmation, say yes and press enter.

Summary
This solution demonstrates how the Amazon Bedrock agent in the agent account can query the Amazon Bedrock knowledge base in the agent-kb account.
Conclusion
This solution uses Amazon Bedrock Knowledge Bases for structured data to create a more integrated approach to cross-account data access. The knowledge base in agent-kb account connects directly to Amazon Redshift Serverless in a private VPC. The Amazon Bedrock agent in the agent account invokes an AWS Lambda function as part of its action group to make a cross-account connection to retrieve response from the structured knowledge base.
This architecture offers several advantages:

Uses Amazon Bedrock Knowledge Bases capabilities for structured data
Provides a more seamless integration between the agent and the data source
Maintains proper security boundaries between accounts
Reduces the complexity of direct database access codes

As Amazon Bedrock continues to evolve, you can take advantage of future enhancements to knowledge base functionality while maintaining your multi-account architecture.

About the Authors
Kunal Ghosh is an expert in AWS technologies. He passionate about building efficient and effective solutions on AWS, especially involving generative AI, analytics, data science, and machine learning. Besides family time, he likes reading, swimming, biking, and watching movies, and he is a foodie.
Arghya Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers adopt and use the AWS Cloud. He is focused on big data, data lakes, streaming and batch analytics services, and generative AI technologies.
Indranil Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers in the hi-tech and semi-conductor sectors solve complex business problems using the AWS Cloud. His special interests are in the areas of legacy modernization and migration, building analytics platforms and helping customers adopt cutting edge technologies such as generative AI.
Vinayak Datar is Sr. Solutions Manager based in Bay Area, helping enterprise customers accelerate their AWS Cloud journey. He’s focusing on helping customers to convert ideas from concepts to working prototypes to production using AWS generative AI services.

Democratizing AI: How Thomson Reuters Open Arena supports no-code AI f …

Posted on November 8, 2025 by i-genie

This post is cowritten by Laura Skylaki, Vaibhav Goswami, Ramdev Wudali and Sahar El Khoury from Thomson Reuters.
Thomson Reuters (TR) is a leading AI and technology company dedicated to delivering trusted content and workflow automation solutions. With over 150 years of expertise, TR provides essential solutions across legal, tax, accounting, risk, trade, and media sectors in a fast-evolving world.
TR recognized early that AI adoption would fundamentally transform professional work. According to TR’s 2025 Future of Professionals Report, 80% of professionals anticipate AI significantly impacting their work within five years, with projected productivity gains of up to 12 hours per week by 2029. To unlock this immense potential, TR needed a solution to democratize AI creation across its organization.
In this blog post, we explore how TR addressed key business use cases with Open Arena, a highly scalable and flexible no-code AI solution powered by Amazon Bedrock and other AWS services such as Amazon OpenSearch Service, Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, and AWS Lambda. We’ll explain how TR used AWS services to build this solution, including how the architecture was designed, the use cases it solves, and the business profiles that use it. The system demonstrates TR’s successful approach of using existing TR services for rapid launches while supporting thousands of users, showcasing how organizations can democratize AI access and support business profiles (for example, AI explorers and SMEs) to create applications without coding expertise.
Introducing Open Arena: No-code AI for all
TR introduced Open Arena to non-technical professionals to create their own customized AI solutions. With Open Arena users can use cutting-edge AI powered by Amazon Bedrock in a no-code environment, exemplifying TR’s commitment to democratizing AI access.
Today, Open Arena supports:

High adoption: ~70% employee adoption, with 19,000 monthly active users.
Custom solutions: Thousands of customized AI solutions created without coding, used for internal workflows or integrated into TR products for customers.
Self-served functionality: 100% self-served functionality, so that users, irrespective of technical background, can develop, evaluate, and deploy generative AI solutions.

The Open Arena journey: From prototype to enterprise solution
Conceived as a rapid prototype, Open Arena was developed in under six weeks at the onset of the generative AI boom in early 2023 by TR Labs – TR’s dedicated applied research division focused on the research, development, and application of AI and emerging trends in technologies. The goal was to support internal team exploration of large language models (LLMs) and discover unique use cases by merging LLM capabilities with TR company data.
Open Arena’s introduction significantly increased AI awareness, fostered developer-SME collaboration for groundbreaking concepts, and accelerated AI capability development for TR products. The rapid success and demand for new features quickly highlighted Open Arena’s potential for AI democratization, so TR developed an enterprise version of Open Arena. Built on the TR AI Platform, Open Arena enterprise version offers secure, scalable, and standardized services covering the entire AI development lifecycle, significantly accelerating time to production.
The Open Arena enterprise version uses existing system capabilities for enhanced data access controls, standardized service access, and compliance with TR’s governance and ethical standards. This version introduced self-served capabilities so that every user, irrespective of their technical ability, can create, evaluate, and deploy customized AI solutions in a no-code environment.

“The foundation of the AI Platform has always been about empowerment; in the early days it was about empowering Data Scientists but with the rise of Gen AI, the platform adapted and evolved on empowering users of any background to leverage and create AI Solutions.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

As of July 2025, the TR Enterprise AI Platform consists of 15 services spanning the entire AI development lifecycle and user personas. Open Arena remains one of its most popular, serving 19,000 users each month, with increasing monthly usage.
Addressing key enterprise AI challenges across user types
Using the TR Enterprise AI Platform, Open Arena helped thousands of professionals transition into using generative AI. AI-powered innovation is now readily in the hands of everyone, not just AI scientists.
Open Arena successfully addresses four critical enterprise AI challenges:

Enablement: Delivers AI solution building with consistent LLM and service provider experience and support for various user personas, including non-technical.
Security and quality: Streamlines AI solution quality tracking using evaluation and monitoring services, whilst complying with data governance and ethics policies.
Speed and reusability: Automates workflows and uses existing AI solutions and prompts.
Resources and cost management: Tracks and displays generative AI solution resource consumption, supporting transparency and efficiency.

The solution currently supports several AI experiences, including tech support, content creation, coding assistance, data extraction and analysis, proof reading, project management, content summarization, personal development, translation, and problem solving, catering to different user needs across the organization.

Figure 1. Examples of Open Arena use cases.
AI explorers use Open Arena to speed up day-to-day tasks, such as summarizing documents, engaging in LLM chat, building custom workflows, and comparing AI models. AI creators and Subject Matter Experts (SMEs) use Open Arena to build custom AI workflows and experiences and to evaluate solutions without requiring coding knowledge. Meanwhile, developers can develop and deploy new AI solutions at speed, training models, creating new AI skills, and deploying AI capabilities.
Why Thomson Reuters selected AWS for Open Arena
TR strategically chose AWS as a primary cloud provider for Open Arena based on several critical factors:

Comprehensive AI/ML capabilities: Amazon Bedrock offers easy access to a choice of high-performing foundation models from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma AI, Meta, Mistral AI, OpenAI, Qwen, Stability AI, TwelveLabs, Writer, and Amazon. It supports simple chat and complex RAG workflows, and integrates seamlessly with TR’s existing Enterprise AI Platform.
Enterprise-grade security and governance: Advanced security controls, model access using RBAC, data handling with enhanced security features, single sign-on (SSO) enabled, and clear operational and user data separation across AWS accounts.
Scalable infrastructure: Serverless architecture for automatic scaling, pay-per-use pricing for cost optimization, and global availability with low latency.
Existing relationship and expertise: Strong, established relationship between TR and AWS, existing Enterprise AI Platform on AWS, and deep AWS expertise within TR’s technical teams.

“Our long-standing partnership with AWS and their robust, flexible and innovative services made them the natural choice to power Open Arena and accelerate our AI initiatives.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

Open Arena architecture: Scalability, extensibility, and security
Designed for a broad enterprise audience, Open Arena prioritizes scalability, extensibility and security while maintaining simplicity for non-technical users to create and deploy AI solutions. The following diagram illustrates the architecture of Open Arena.

Figure 2. Architecture design of Open Arena.
The architecture design facilitates enterprise-grade performance with clear separation between capability and usage, aligning with TR’s enterprise cost and usage tracking requirements.
The following are key components of the solution architecture:

No-code interface: Intuitive UI, visual workflow builder, pre-built templates, drag-and-drop functionality.
Enterprise integration: Seamless integration with TR’s Enterprise AI Platform, SSO enabled, data handling with enhanced security, clear data separation.
Solution management: Searchable repository, public/private sharing, version control, usage analytics.

TR developed Open Arena using AWS services such as Amazon Bedrock, Amazon OpenSearch, Amazon DynamoDB, Amazon API Gateway, AWS Lambda, and AWS Step Functions. It uses Amazon Bedrock for foundational model interactions, supporting simple chat and complex Retrieval-Augmented Generation (RAG) tasks. Open Arena uses Amazon Bedrock Flows as the custom workflow builder where users can drag-and-drop components like prompts, agents, knowledge bases and Lambda functions to create sophisticated AI workflows without coding. The system also integrates with AWS OpenSearch for knowledge bases and external APIs for advanced agent capabilities.
For data separation, orchestration is managed using the Enterprise AI Platform AWS account, capturing operational data. Flow instances and user-specific data reside in the user’s dedicated AWS account, stored in a database. Each user’s data and workflow executions are isolated within their respective AWS accounts, which is required for complying with Thomson Reuters data sovereignty and enterprise security policies with strict regional controls. The system integrates with Thomson Reuters SSO solution to automatically identify users and grant secure, private access to foundational models.
The orchestration layer, centrally hosted within the Enterprise AI Platform AWS account, manages AI workflow activities, including scheduling, deployment, resource provisioning, and governance across user environments.
The system features fully automated provisioning of Amazon Bedrock Flows directly within each user’s AWS account, avoiding manual setup and accelerating time to value. Using AWS Lambda for serverless compute and DynamoDB for scalable, low-latency storage, the system dynamically allocates resources based on real-time demand. This architecture makes sure prompt flows and supporting infrastructure are deployed and scaled to match workload fluctuations, optimizing performance, cost, and user experience.

“Our decision to adopt a cross-account architecture was driven by a commitment to enterprise security and operational excellence. By isolating orchestration from execution, we make sure that each user’s data remains private and secure within their own AWS account, while still delivering a seamless, centrally-managed experience. This design empowers organizations to innovate rapidly without compromising compliance or control.”
– Thomson Reuters’ architecture team

Evolution of Open Arena: From classic to Amazon Bedrock Flows-powered chain builder
Open Arena has evolved to cater to varying levels of user sophistication:

Open Arena v1 (Classic): Features a form-based interface for simple prompt customization and basic AI workflow deployment within a single AWS account. Its simplicity appeals to novice users for straightforward use cases, though with limited advanced capabilities.
Open Arena v2 (Chain Builder): Introduces a robust, visual workflow builder interface, enabling users to design complex, multi-step AI workflows using drag-and-drop components. With support for advanced node types, parallel execution, and seamless cross-account deployment, Chain Builder dramatically expands the system’s capabilities and accessibility for non-technical users.

Thomson Reuters uses Amazon Bedrock Flows as a core feature of Chain Builder. Users can define, customize, and deploy AI-driven workflows using Amazon Bedrock models. Bedrock Flows supports advanced workflows combining multiple prompt nodes, incorporating AWS Lambda functions, and supporting sophisticated RAG pipelines. Operating seamlessly across user AWS accounts, Bedrock Flows facilitates secure, scalable execution of personalized AI solutions, serving as the fundamental engine for the Chain Builder workflows and driving TR’s ability to deliver robust, enterprise-grade automation and innovation.
What’s next?
TR continues to expand Open Arena’s capabilities through the strategic partnership with AWS, focusing on:

Driving further adoption of Open Arena’s DIY capabilities.
Enhancing flexibility for workflow creation in Chain Builder with custom components, such as inline scripts.
Developing new templates to represent common tasks and workflows.
Enhancing collaboration features within Open Arena.
Extending multimodal capabilities and model integration.
Expanding into new use cases across the enterprise.

“From innovating new product ideas to reimagining daily tasks for Thomson Reuters employees, we continue to push the boundaries of what’s possible with Open Arena.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

Conclusion
In this blog post, we explored how Thomson Reuters’ Open Arena demonstrates the successful democratization of AI across an enterprise by using AWS services, particularly Amazon Bedrock and Bedrock Flows. With 19,000 monthly active users and 70% employee adoption, the system proves that no-code AI solutions can deliver enterprise-scale impact while maintaining security and governance standards.
By combining the robust infrastructure of AWS with innovative architecture design, TR has created a blueprint for AI democratization that empowers professionals across technical skill levels to harness generative AI for their daily work.
As Open Arena continues to evolve, it exemplifies how strategic cloud partnerships can accelerate AI adoption and transform how organizations approach innovation with generative AI.

About the authors
Laura Skylaki, PhD, leads the Enterprise AI Platform at Thomson Reuters, driving the development of GenAI services that accelerate the creation, testing and deployment of AI solutions, enhancing product value. A recognized expert with a doctorate in stem cell bioinformatics, her extensive experience in AI research and practical application spans legal, tax, and biotech domains. Her machine learning work is published in leading academic journals, and she is a frequent speaker on AI and machine learning
Vaibhav Goswami is a Lead Software Engineer on the AI Platform team at Thomson Reuters, where he leads the development of the Generative AI Platform that empowers users to build and deploy generative AI solutions at scale. With expertise in building production-grade AI systems, he focuses on creating tools and infrastructure that democratize access to cutting-edge AI capabilities across the enterprise.
Ramdev Wudali is a Distinguished Engineer, helping architect and build the AI/ML Platform to enable the Enterprise user, data scientists and researchers to develop Generative AI and machine learning solutions by democratizing access to tools and LLMs. In his spare time, he loves to fold paper to create origami tessellations, and wearing irreverent T-shirts
As the director of AI Platform Adoption and Training, Sahar El Khoury guides users to seamlessly onboard and successfully use the platform services, drawing on her experience in AI and data analysis across robotics (PhD), financial markets, and media.
Vu San Ha Huynh is a Solutions Architect at AWS with a PhD in Computer Science. He helps large Enterprise customers drive innovation across different domains with a focus on AI/ML and Generative AI solutions.
Paul Wright is a Senior Technical Account Manager, with over 20 years experience in the IT industry and over 7 years of dedicated cloud focus. Paul has helped some of the largest enterprise customers grow their business and improve their operational excellence. In his spare time Paul is a huge football and NFL fan.
Mike Bezak is a Senior Technical Account Manager in AWS Enterprise Support. He has over 20 years of experience in information technology, primarily disaster recovery and systems administration. Mike’s current focus is helping customers streamline and optimize their AWS Cloud journey. Outside of AWS, Mike enjoys spending time with family & friends.

Introducing structured output for Custom Model Import in Amazon Bedroc …

Posted on November 8, 2025 by i-genie

With Amazon Bedrock Custom Model Import, you can deploy and scale fine-tuned or proprietary foundation models in a fully managed, serverless environment. You can bring your own models into Amazon Bedrock, scale them securely without managing infrastructure, and integrate them with other Amazon Bedrock capabilities.
Today, we are excited to announce the addition of structured output to Custom Model Import. Structured output constrains a model’s generation process in real time so that every token it produces conforms to a schema you define. Rather than relying on prompt-engineering tricks or brittle post-processing scripts, you can now generate structured outputs directly at inference time.
For certain production applications, the predictability of model outputs is more important than their creative flexibility. A customer service chatbot might benefit from varied, natural-sounding responses, but an order processing system needs exact, structured data that conforms to predefined schemas. Structured output bridges this gap by maintaining the intelligence of foundation models while verifying their outputs meet strict formatting requirements.
This represents a shift from free-form text generation to outputs that are consistent, machine-readable, and designed for seamless integration with enterprise systems. While free-form text excels for human consumption, production applications require more precision. Businesses can’t afford the ambiguity of natural language variations when their systems depend on structured outputs to reliably interface with APIs, databases, and automated workflows.
In this post, you will learn how to implement structured output for Custom Model Import in Amazon Bedrock. We will cover what structured output is, how to enable it in your API calls, and how to apply it to real-world scenarios that require structured, predictable outputs.
Understanding structured output
Structured output, also known as constrained decoding, is a method that directs LLM outputs to conform to a predefined schema, such as valid JSON. Rather than allowing the model to freely select tokens based on probability distributions, it introduces constraints during generation that limit choices to only those that maintain structural validity. If a particular token would violate the schema by producing invalid JSON, inserting stray characters, or using an unexpected field name the structured output rejects it and requires the model to select another allowed option. This real-time validation helps keep the final output consistent, machine readable, and immediately usable by downstream applications without the need for additional post-processing.
Without structured output, developers often attempt to enforce structure through prompt instructions like “Respond only in JSON.” While this approach sometimes works, it remains unreliable due to the inherently probabilistic nature of LLMs. These models generate text by sampling from probability distributions, introducing natural variability that makes responses feel human but creates significant challenges for automated systems.
Consider a customer support application that classifies tickets: if responses vary between “This seems like a billing issue,” “I’d classify this as: Billing,” and “Category = BILLING,” downstream code cannot reliably interpret the results. What production systems require instead is predictable, structured output. For example:

{
“category”: “billing”,
“priority”: “high”,
“sentiment”: “negative”
}

With a response like this, your application can automatically route tickets, trigger workflows, or update databases without human intervention. By providing predictable, schema-aligned responses, structured output transforms LLMs from conversational tools into reliable system components that can be integrated with databases, APIs, and business logic. This capability opens new possibilities for automation while maintaining the intelligent reasoning that underpin the value of these models.
Beyond improving reliability and simplifying post-processing, structured output offers additional benefits that strengthens performance, security and safety in production environments.

Lower token usage and faster responses: By constraining generation to a defined schema, structured output removes unnecessary verbose, free-form text, resulting in reduced token count. Because token generation is sequential, shorter outputs directly translate to faster responses and lower latency, improving overall performance and cost efficiency.
Enhanced security against prompt injection: Structured output narrows the model’s expression space and helps prevent it from producing arbitrary or unsafe content. Bad actors cannot inject instructions, code or unexpected text outside the defined structure. Each field must match its expected type and format, making sure outputs remain within safe boundaries.
Safety and policy controls: Structured output enables you to design schemas that inherently help prevent harmful, toxic, or policy-violating content. By limiting fields to approved values, enforcing patterns, and restricting free-form text, schemas make sure outputs align with regulatory requirements.

In the next section, we will explore how structured output works with Custom Model Import in Amazon Bedrock and walks through an example of enabling it in your API calls.
Using structured output with Custom Model Import in Amazon Bedrock
Let’s start by assuming you have already imported a Hugging Face model into Amazon Bedrock using the Custom Model Import feature.
Prerequisites
Before proceeding, make sure you have:

An active AWS account with access to Amazon Bedrock
A custom model created in Amazon Bedrock using the Custom Model Import feature
Appropriate AWS Identity and Access Management (IAM) permissions to invoke models through the Amazon Bedrock Runtime

With these prerequisites in place, let’s explore how to implement structured output with your imported model.
To start using structured output with a Custom Model Import in Amazon Bedrock, begin by configuring your environment. In Python, this involves creating a Bedrock Runtime client and initializing a tokenizer from your imported Hugging Face model.
The Bedrock Runtime client provides access to your imported model using the Bedrock InvokeModel API. The tokenizer applies the correct chat template that aligns with the imported model, which defines how user, system, and assistant messages are combined into a single prompt, how the role markers (for example, <|user|>, <|assistant|>) are inserted, and where the model’s response should begin.
By calling tokenizer.apply_chat_template(messages, tokenize=False) you can generate a prompt that matches the exact input format your model expects, which is essential for consistent and reliable inference, especially when structured encoding is enabled.

import boto3
from transformers import AutoTokenizer
from botocore.config import Config

# HF model identifier imported into Bedrock
hf_model_id = “<<huggingface_model_id>>” # Example: “deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
model_arn = “arn:aws:bedrock:<<aws-region>>:<<account-id>>:imported-model/your-model-id”
region = “<<aws-region>>”

# Initialize tokenizer aligned with your imported model
tokenizer = AutoTokenizer.from_pretrained(hf_model_id)

# Initialize Bedrock client
bedrock_runtime = boto3.client(
service_name=”bedrock-runtime”,
region_name=region)

Implementing structured output
When you invoke a custom model on Amazon Bedrock, you have the option to enable structured output by adding a response_format block to the request payload. This block accepts a JSON schema that defines the structured of the model’s response. During inference, the model enforces this schema in real-time, making sure that each generated token conforms to the defined structure. Below is a walkthrough demonstrating how to implement structured output using a simple address extraction task.
Step 1: Define the data structure
You can define your expected output using a Pydantic model, which serves as a typed contract for the data you want to extract.

from pydantic import BaseModel, Field

class Address(BaseModel):
street_number: str = Field(description=”Street number”)
street_name: str = Field(description=”Street name including type (Ave, St, Rd, etc.)”)
city: str = Field(description=”City name”)
state: str = Field(description=”Two-letter state abbreviation”)
zip_code: str = Field(description=”5-digit ZIP code”)

Step 2: Generate the JSON schema
Pydantic can automatically convert your data model into a JSON schema:

schema = Address.model_json_schema()
address_schema = {
“name”: “Address”,
“schema”: schema
}

This schema defines each field’s type, description, and requirement, creating a blueprint that the model will follow during generation.
Step 3: Prepare your input messages
Format your input using the chat format expected by your model:

messages = [{
“role”: “user”,
“content”: “Extract the address: 456 Tech Boulevard, San Francisco, CA 94105”
}]

Step 4: Apply the chat template
Use your model’s tokenizer to generate the formatted prompt:

prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

Step 5: Build the request payload
Create your request body, including the response_format that references your schema:

request_body = {
‘prompt’: prompt,
‘temperature’: 0.1,
‘max_gen_len’: 1000,
‘top_p’: 0.9,
‘response_format’: {
“type”: “json_schema”,
“json_schema”: address_schema
}
}

Step 6: Invoke the model
Send the request using the InvokeModel API:

response = bedrock_runtime.invoke_model(
modelId=model_arn,
body=json.dumps(request_body),
accept=”application/json”,
contentType=”application/json”
)

Step 7: Parse the response
Extract the generated text from the response:

result = json.loads(response[‘body’].read().decode(‘utf-8’))
raw_output = result[‘choices’][0][‘text’]
print(raw_output)

Because the schema defines required fields, the model’s response will contain them:

{
“street_number”: “456”,
“street_name”: “Tech Boulevard”,
“city”: “San Francisco”,
“state”: “CA”,
“zip_code”: “94105”
}

The output is clean, valid JSON that can be consumed directly by your application with no extra parsing, filtering, or cleanup required.
Conclusion
Structured output with Custom Model Import in Amazon Bedrock provides an effective way to generate structures, schema-aligned outputs from your models. By shifting validation into the model inference itself, structured output reduce the need for complex post-processing workflows and error handling code.
Structured output generates outputs that are predictable and straightforward to integrate into your systems and supports a variety of use cases, for example, building financial applications that require precise data extraction, healthcare systems that need structured clinical documentation, or customer service systems that demand consistent ticket classification.
Start experimenting with structured output with your Custom Model Import today and transform how your AI applications deliver consistent, production-ready results.

About the authors
Manoj Selvakumar is a Generative AI Specialist Solutions Architect at AWS, where he helps organizations design, prototype, and scale AI-powered solutions in the cloud. With expertise in deep learning, scalable cloud-native systems, and multi-agent orchestration, he focuses on turning emerging innovations into production-ready architectures that drive measurable business value. He is passionate about making complex AI concepts practical and enabling customers to innovate responsibly at scale—from early experimentation to enterprise deployment. Before joining AWS, Manoj worked in consulting, delivering data science and AI solutions for enterprise clients, building end-to-end machine learning systems supported by strong MLOps practices for training, deployment, and monitoring in production.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Revendra Kumar is a Senior Software Development Engineer at Amazon Web Services. In his current role, he focuses on model hosting and inference MLOps on Amazon Bedrock. Prior to this, he worked as an engineer on hosting Quantum computers on the cloud and developing infrastructure solutions for on-premises cloud environments. Outside of his professional pursuits, Revendra enjoys staying active by playing tennis and hiking.
Muzart Tuman is a software engineer utilizing his experience in fields like deep learning, machine learning optimization, and AI-driven applications to help solve real-world problems in a scalable, efficient, and accessible manner. His goal is to create impactful tools that not only advance technical capabilities but also inspire meaningful change across industries and communities.

Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model th …

Posted on November 7, 2025 by i-genie

How do we design AI systems that can plan, reason, and act over long sequences of decisions without constant human guidance? Moonshot AI has released Kimi K2 Thinking, an open source thinking agent model that exposes the full reasoning stream of the Kimi K2 Mixture of Experts architecture. It targets workloads that need deep reasoning, long horizon tool use, and stable agent behavior across many steps.

https://moonshotai.github.io/Kimi-K2/thinking.html

What is Kimi K2 Thinking?

Kimi K2 Thinking is described as the latest, most capable version of Moonshot’s open source thinking model. It is built as a thinking agent that reasons step by step and dynamically invokes tools during inference. The model is designed to interleave chain of thought with function calls so it can read, think, call a tool, think again, and repeat for hundreds of steps.

The model sets a new state of the art on Humanity’s Last Exam and BrowseComp, while maintaining coherent behavior across about 200 to 300 sequential tool calls without human interference.

At the same time, K2 Thinking is released as an open weights model with a 256K token context window and native INT4 inference, which reduces latency and GPU memory usage while preserving benchmark performance.

K2 Thinking is already live on kimi.com in chat mode and is accessible through the Moonshot platform API, with a dedicated agentic mode planned to expose the full tool using behavior.

Architecture, MoE design, and context length

Kimi K2 Thinking inherits the Kimi K2 Mixture of Experts design. The model uses a MoE architecture with 1T total parameters and 32B activated parameters per token. It has 61 layers including 1 dense layer, 384 experts with 8 experts selected per token, 1 shared expert, 64 attention heads, and an attention hidden dimension of 7168. The MoE hidden dimension is 2048 per expert.

The vocabulary size is 160K tokens and the context length is 256K. The attention mechanism is Multi head Latent Attention, and the activation function is SwiGLU.

Test time scaling and long horizon thinking

Kimi K2 Thinking is explicitly optimized for test time scaling. The model is trained to expand its reasoning length and tool call depth when facing harder tasks, rather than relying on a fixed short chain of thought.

https://moonshotai.github.io/Kimi-K2/thinking.html

On Humanity’s Last Exam in the no tools setting, K2 Thinking scores 23.9. With tools, the score rises to 44.9, and in the heavy setting it reaches 51.0. On AIME25 with Python, it reports 99.1, and on HMMT25 with Python it reports 95.1. On IMO AnswerBench it scores 78.6, and on GPQA it scores 84.5.

The testing protocol caps thinking token budgets at 96K for HLE, AIME25, HMMT25, and GPQA. It uses 128K thinking tokens for IMO AnswerBench, LiveCodeBench, and OJ Bench, and 32K completion tokens for Longform Writing. On HLE, the maximum step limit is 120 with a 48K reasoning budget per step. On agentic search tasks, the limit is 300 steps with a 24K reasoning budget per step.

Benchmarks in agentic search and coding

On agentic search tasks with tools, K2 Thinking reports 60.2 on BrowseComp, 62.3 on BrowseComp ZH, 56.3 on Seal 0, 47.4 on FinSearchComp T3, and 87.0 on Frames.

On general knowledge benchmarks, it reports 84.6 on MMLU Pro, 94.4 on MMLU Redux, 73.8 on Longform Writing, and 58.0 on HealthBench.

For coding, K2 Thinking achieves 71.3 on SWE bench Verified with tools, 61.1 on SWE bench Multilingual with tools, 41.9 on Multi SWE bench with tools, 44.8 on SciCode, 83.1 on LiveCodeBenchV6, 48.7 on OJ Bench in the C plus plus setting, and 47.1 on Terminal Bench with simulated tools.

Moonshot team also defines a Heavy Mode that runs eight trajectories in parallel, then aggregates them to produce a final answer. This is used in some reasoning benchmarks to squeeze out extra accuracy from the same base model.

Native INT4 quantization and deployment

K2 Thinking is trained as a native INT4 model. The research team applies Quantization Aware Training during the post training stage and uses INT4 weight only quantization on the MoE components. This supports INT4 inference with roughly a 2x generation speed improvement in low latency mode while maintaining state of the art performance. All reported benchmark scores are obtained under INT4 precision.

The checkpoints are saved in compressed tensors format and can be unpacked to higher precision formats such as FP8 or BF16 using the official compressed tensors tools. Recommended inference engines include vLLM, SGLang, and KTransformers.

Key Takeaways

Kimi K2 Thinking is an open weights thinking agent that extends the Kimi K2 Mixture of Experts architecture with explicit long horizon reasoning and tool use, not just short chat style responses.

The model uses a trillion parameter MoE design with about tens of billions of active parameters per token, a 256K context window, and is trained as a native INT4 model with Quantization Aware Training, which gives about 2x faster inference while keeping benchmark performance stable.

K2 Thinking is optimized for test time scaling, it can carry out hundreds of sequential tool calls in a single task and is evaluated under large thinking token budgets and strict step caps, which is important when you try to reproduce its reasoning and agentic results.

On public benchmarks, it leads or is competitive on reasoning, agentic search, and coding tasks such as HLE with tools, BrowseComp, and SWE bench Verified with tools, showing that the thinking oriented variant delivers clear gains over the base non thinking K2 model.

Editorial Comments

Kimi K2 Thinking is a strong signal that test time scaling is now a first class design target for open source reasoning models. Moonshot AI is not only exposing a 1T parameter Mixture of Experts system with 32B active parameters and 256K context window, it is doing so with native INT4 quantization, Quantization Aware Training, and tool orchestration that runs for hundreds of steps in production like settings. Overall, Kimi K2 Thinking shows that open weights reasoning agents with long horizon planning and tool use are becoming practical infrastructure, not just research demos.

Check out the Model Weights and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model that can Execute up to 200–300 Sequential Tool Calls without Human Interference appeared first on MarkTechPost.

Build an Autonomous Wet-Lab Protocol Planner and Validator Using Sales …

Posted on November 7, 2025 by i-genie

In this tutorial, we build a Wet-Lab Protocol Planner & Validator that acts as an intelligent agent for experimental design and execution. We design the system using Python and integrate Salesforce’s CodeGen-350M-mono model for natural language reasoning. We structure the pipeline into modular components: ProtocolParser for extracting structured data, such as steps, durations, and temperatures, from textual protocols; InventoryManager for validating reagent availability and expiry; Schedule Planner for generating timelines and parallelization; and Safety Validator for identifying biosafety or chemical hazards. The LLM is then used to generate optimization suggestions, effectively closing the loop between perception, planning, validation, and refinement.

Copy CodeCopiedUse a different Browserimport re, json, pandas as pd
from datetime import datetime, timedelta
from collections import defaultdict
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_NAME = “Salesforce/codegen-350M-mono”
print(“Loading CodeGen model (30 seconds)…”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.float16, device_map=”auto”
)
print(“✓ Model loaded!”)

We begin by importing essential libraries and loading the Salesforce CodeGen-350M-mono model locally for lightweight, API-free inference. We initialize both the tokenizer and model with float16 precision and automatic device mapping to ensure compatibility and speed on Colab GPUs.

Copy CodeCopiedUse a different Browserclass ProtocolParser:
def read_protocol(self, text):
steps = []
lines = text.split(‘n’)
for i, line in enumerate(lines, 1):
step_match = re.search(r’^(d+).s+(.+)’, line.strip())
if step_match:
num, name = step_match.groups()
context = ‘n’.join(lines[i:min(i+4, len(lines))])
duration = self._extract_duration(context)
temp = self._extract_temp(context)
safety = self._check_safety(context)
steps.append({
‘step’: int(num), ‘name’: name, ‘duration_min’: duration,
‘temp’: temp, ‘safety’: safety, ‘line’: i, ‘details’: context[:200]
})
return steps

def _extract_duration(self, text):
text = text.lower()
if ‘overnight’ in text: return 720
match = re.search(r'(d+)s*(?:hour|hr|h)(?:s)?(?!w)’, text)
if match: return int(match.group(1)) * 60
match = re.search(r'(d+)s*(?:min|minute)(?:s)?’, text)
if match: return int(match.group(1))
match = re.search(r'(d+)-(d+)s*(?:min|minute)’, text)
if match: return (int(match.group(1)) + int(match.group(2))) // 2
return 30

def _extract_temp(self, text):
text = text.lower()
if ‘4°c’ in text or ‘4 °c’ in text or ‘4°’ in text: return ‘4C’
if ’37°c’ in text or ’37 °c’ in text: return ’37C’
if ‘-20°c’ in text or ‘-80°c’ in text: return ‘FREEZER’
if ‘room temp’ in text or ‘rt’ in text or ‘ambient’ in text: return ‘RT’
return ‘RT’

def _check_safety(self, text):
flags = []
text_lower = text.lower()
if re.search(r’bsl-[23]|biosafety’, text_lower): flags.append(‘BSL-2/3′)
if re.search(r’caution|corrosive|hazard|toxic’, text_lower): flags.append(‘HAZARD’)
if ‘sharp’ in text_lower or ‘needle’ in text_lower: flags.append(‘SHARPS’)
if ‘dark’ in text_lower or ‘light-sensitive’ in text_lower: flags.append(‘LIGHT-SENSITIVE’)
if ‘flammable’ in text_lower: flags.append(‘FLAMMABLE’)
return flags

class InventoryManager:
def __init__(self, csv_text):
from io import StringIO
self.df = pd.read_csv(StringIO(csv_text))
self.df[‘expiry’] = pd.to_datetime(self.df[‘expiry’])

def check_availability(self, reagent_list):
issues = []
for reagent in reagent_list:
reagent_clean = reagent.lower().replace(‘_’, ‘ ‘).replace(‘-‘, ‘ ‘)
matches = self.df[self.df[‘reagent’].str.lower().str.contains(
‘|’.join(reagent_clean.split()[:2]), na=False, regex=True
)]
if matches.empty:
issues.append(f” {reagent}: NOT IN INVENTORY”)
else:
row = matches.iloc[0]
if row[‘expiry’] < datetime.now():
issues.append(f” {reagent}: EXPIRED on {row[‘expiry’].date()} (lot {row[‘lot’]})”)
elif (row[‘expiry’] – datetime.now()).days < 30:
issues.append(f” {reagent}: Expires soon ({row[‘expiry’].date()}, lot {row[‘lot’]})”)
if row[‘quantity’] < 10:
issues.append(f” {reagent}: LOW STOCK ({row[‘quantity’]} {row[‘unit’]} remaining)”)
return issues

def extract_reagents(self, protocol_text):
reagents = set()
patterns = [
r’b([A-Z][a-z]+(?:s+[A-Z][a-z]+)*)s+(?:antibody|buffer|solution)’,
r’b([A-Z]{2,}(?:-[A-Z0-9]+)?)b’,
r'(?:add|use|prepare|dilute)s+([a-z-]+s*(?:antibody|buffer|substrate|solution))’,
]
for pattern in patterns:
matches = re.findall(pattern, protocol_text, re.IGNORECASE)
reagents.update(m.strip() for m in matches if len(m) > 2)
return list(reagents)[:15]

We define the ProtocolParser and InventoryManager classes to extract structured experimental details and verify reagent inventory. We parse each protocol step for duration, temperature, and safety markers, while the inventory manager validates stock levels, expiry dates, and reagent availability through fuzzy matching.

Copy CodeCopiedUse a different Browserclass SchedulePlanner:
def make_schedule(self, steps, start_time=”09:00″):
schedule = []
current = datetime.strptime(f”2025-01-01 {start_time}”, “%Y-%m-%d %H:%M”)
day = 1
for step in steps:
end = current + timedelta(minutes=step[‘duration_min’])
if step[‘duration_min’] > 480:
day += 1
current = datetime.strptime(f”2025-01-0{day} 09:00″, “%Y-%m-%d %H:%M”)
end = current
schedule.append({
‘step’: step[‘step’], ‘name’: step[‘name’][:40],
‘start’: current.strftime(“%H:%M”), ‘end’: end.strftime(“%H:%M”),
‘duration’: step[‘duration_min’], ‘temp’: step[‘temp’],
‘day’: day, ‘can_parallelize’: step[‘duration_min’] > 60,
‘safety’: ‘, ‘.join(step[‘safety’]) if step[‘safety’] else ‘None’
})
if step[‘duration_min’] <= 480:
current = end
return schedule

def optimize_parallelization(self, schedule):
parallel_groups = []
idle_time = 0
for i, step in enumerate(schedule):
if step[‘can_parallelize’] and i + 1 < len(schedule):
next_step = schedule[i+1]
if step[‘temp’] == next_step[‘temp’]:
saved = min(step[‘duration’], next_step[‘duration’])
parallel_groups.append(
f” Steps {step[‘step’]} & {next_step[‘step’]} can overlap → Save {saved} min”
)
idle_time += saved
return parallel_groups, idle_time

class SafetyValidator:
RULES = {
‘ph_range’: (5.0, 11.0),
‘temp_limits’: {‘4C’: (2, 8), ’37C’: (35, 39), ‘RT’: (20, 25)},
‘max_concurrent_instruments’: 3,
}

def validate(self, steps):
risks = []
for step in steps:
ph_match = re.search(r’phs*(d+.?d*)’, step[‘details’].lower())
if ph_match:
ph = float(ph_match.group(1))
if not (self.RULES[‘ph_range’][0] <= ph <= self.RULES[‘ph_range’][1]):
risks.append(f” Step {step[‘step’]}: pH {ph} OUT OF SAFE RANGE”)
if ‘BSL-2/3’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: BSL-2 cabinet REQUIRED”)
if ‘HAZARD’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: Full PPE + chemical hood REQUIRED”)
if ‘SHARPS’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: Sharps container + needle safety”)
if ‘LIGHT-SENSITIVE’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: Work in dark/amber tubes”)
return risks

We implement the SchedulePlanner and SafetyValidator to design efficient experiment timelines and enforce lab safety standards. We dynamically generate daily schedules, identify parallelizable steps, and validate potential risks, such as unsafe pH levels, hazardous chemicals, or biosafety-level requirements.

Copy CodeCopiedUse a different Browserdef llm_call(prompt, max_tokens=200):
try:
inputs = tokenizer(prompt, return_tensors=”pt”, truncation=True, max_length=512).to(model.device)
outputs = model.generate(
**inputs, max_new_tokens=max_tokens, do_sample=True,
temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):].strip()
except:
return “Batch similar temperature steps together. Pre-warm instruments.”

def agent_loop(protocol_text, inventory_csv, start_time=”09:00″):
print(“n AGENT STARTING PROTOCOL ANALYSIS…n”)
parser = ProtocolParser()
steps = parser.read_protocol(protocol_text)
print(f” Parsed {len(steps)} protocol steps”)
inventory = InventoryManager(inventory_csv)
reagents = inventory.extract_reagents(protocol_text)
print(f” Identified {len(reagents)} reagents: {‘, ‘.join(reagents[:5])}…”)
inv_issues = inventory.check_availability(reagents)
validator = SafetyValidator()
safety_risks = validator.validate(steps)
planner = SchedulePlanner()
schedule = planner.make_schedule(steps, start_time)
parallel_opts, time_saved = planner.optimize_parallelization(schedule)
total_time = sum(s[‘duration’] for s in schedule)
optimized_time = total_time – time_saved
opt_prompt = f”Protocol has {len(steps)} steps, {total_time} min total. Key bottleneck optimization:”
optimization = llm_call(opt_prompt, max_tokens=80)
return {
‘steps’: steps, ‘schedule’: schedule, ‘inventory_issues’: inv_issues,
‘safety_risks’: safety_risks, ‘parallelization’: parallel_opts,
‘time_saved’: time_saved, ‘total_time’: total_time,
‘optimized_time’: optimized_time, ‘ai_optimization’: optimization,
‘reagents’: reagents
}

We construct the agent loop, integrating perception, planning, validation, and revision into a single, coherent flow. We use CodeGen for reasoning-based optimization to refine step sequencing and propose practical improvements for efficiency and parallel execution.

Copy CodeCopiedUse a different Browserdef generate_checklist(results):
md = “# WET-LAB PROTOCOL CHECKLISTnn”
md += f”**Total Steps:** {len(results[‘schedule’])}n”
md += f”**Estimated Time:** {results[‘total_time’]} min ({results[‘total_time’]//60}h {results[‘total_time’]%60}m)n”
md += f”**Optimized Time:** {results[‘optimized_time’]} min (save {results[‘time_saved’]} min)nn”
md += “## TIMELINEn”
current_day = 1
for item in results[‘schedule’]:
if item[‘day’] > current_day:
md += f”n### Day {item[‘day’]}n”
current_day = item[‘day’]
parallel = ” ” if item[‘can_parallelize’] else “”
md += f”- [ ] **{item[‘start’]}-{item[‘end’]}** | Step {item[‘step’]}: {item[‘name’]} ({item[‘temp’]}){parallel}n”
md += “n## REAGENT PICK-LISTn”
for reagent in results[‘reagents’]:
md += f”- [ ] {reagent}n”
md += “n## SAFETY & INVENTORY ALERTSn”
all_issues = results[‘safety_risks’] + results[‘inventory_issues’]
if all_issues:
for risk in all_issues:
md += f”- {risk}n”
else:
md += “- No critical issues detectedn”
md += “n## OPTIMIZATION TIPSn”
for tip in results[‘parallelization’]:
md += f”- {tip}n”
md += f”- AI Suggestion: {results[‘ai_optimization’]}n”
return md

def generate_gantt_csv(schedule):
df = pd.DataFrame(schedule)
return df.to_csv(index=False)

We create output generators that transform results into human-readable Markdown checklists and Gantt-compatible CSVs. We ensure that every execution produces clear summaries of reagents, time savings, and safety or inventory alerts for streamlined lab operations.

Copy CodeCopiedUse a different BrowserSAMPLE_PROTOCOL = “””ELISA Protocol for Cytokine Detection

1. Coating (Day 1, 4°C overnight)
– Dilute capture antibody to 2 μg/mL in coating buffer (pH 9.6)
– Add 100 μL per well to 96-well plate
– Incubate at 4°C overnight (12-16 hours)
– BSL-2 cabinet required

2. Blocking (Day 2)
– Wash plate 3× with PBS-T (200 μL/well)
– Add 200 μL blocking buffer (1% BSA in PBS)
– Incubate 1 hour at room temperature

3. Sample Incubation
– Wash 3× with PBS-T
– Add 100 μL diluted samples/standards
– Incubate 2 hours at room temperature

4. Detection Antibody
– Wash 5× with PBS-T
– Add 100 μL biotinylated detection antibody (0.5 μg/mL)
– Incubate 1 hour at room temperature

5. Streptavidin-HRP
– Wash 5× with PBS-T
– Add 100 μL streptavidin-HRP (1:1000 dilution)
– Incubate 30 minutes at room temperature
– Work in dark

6. Development
– Wash 7× with PBS-T
– Add 100 μL TMB substrate
– Incubate 10-15 minutes (monitor color development)
– Add 50 μL stop solution (2M H2SO4) – CAUTION: corrosive
“””

SAMPLE_INVENTORY = “””reagent,quantity,unit,expiry,lot
capture antibody,500,μg,2025-12-31,AB123
blocking buffer,500,mL,2025-11-30,BB456
PBS-T,1000,mL,2026-01-15,PT789
detection antibody,8,μg,2025-10-15,DA321
streptavidin HRP,10,mL,2025-12-01,SH654
TMB substrate,100,mL,2025-11-20,TM987
stop solution,250,mL,2026-03-01,SS147
BSA,100,g,2024-09-30,BS741″””

results = agent_loop(SAMPLE_PROTOCOL, SAMPLE_INVENTORY, start_time=”09:00″)
print(“n” + “=”*70)
print(generate_checklist(results))
print(“n” + “=”*70)
print(“n GANTT CSV (first 400 chars):n”)
print(generate_gantt_csv(results[‘schedule’])[:400])
print(“n Time Savings:”, f”{results[‘time_saved’]} minutes via parallelization”)

We conduct a comprehensive test run using a sample ELISA protocol and a reagent inventory dataset. We visualize the agent’s outputs, optimized schedule, parallelization gains, and AI-suggested improvements, demonstrating how our planner functions as a self-contained, intelligent lab assistant.

At last, we demonstrated how agentic AI principles can enhance reproducibility and safety in wet-lab workflows. By parsing free-form experimental text into structured, actionable plans, we automated protocol validation, reagent management, and temporal optimization in a single pipeline. The integration of CodeGen enables on-device reasoning about bottlenecks and safety conditions, allowing for self-contained, data-secure operations. We concluded with a fully functional planner that generates Gantt-compatible schedules, Markdown checklists, and AI-driven optimization tips, establishing a robust foundation for autonomous laboratory planning systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Build an Autonomous Wet-Lab Protocol Planner and Validator Using Salesforce CodeGen for Agentic Experiment Design and Safety Optimization appeared first on MarkTechPost.

Google AI Introduces DS STAR: A Multi Agent Data Science System That P …

Posted on November 7, 2025 by i-genie

How do you turn a vague business style question over messy folders of CSV, JSON and text into reliable Python code without a human analyst in the loop? Google researchers introduce DS STAR (Data Science Agent via Iterative Planning and Verification), a multi agent framework that turns open ended data science questions into executable Python scripts over heterogeneous files. Instead of assuming a clean SQL database and a single query, DS STAR treats the problem as Text to Python and operates directly on mixed formats such as CSV, JSON, Markdown and unstructured text.

https://arxiv.org/pdf/2509.21825

From Text To Python Over Heterogeneous Data

Existing data science agents often rely on Text to SQL over relational databases. This constraint limits them to structured tables and simple schema, which does not match many enterprise environments where data sits across documents, spreadsheets and logs.

DS STAR changes the abstraction. It generates Python code that loads and combines whatever files the benchmark provides. The system first summarizes every file, then uses that context to plan, implement and verify a multi step solution. This design allows DS STAR to work on benchmarks such as DABStep, KramaBench and DA Code, which expect multi step analysis over mixed file types and require answers in strict formats.

https://arxiv.org/pdf/2509.21825

Stage 1: Data File Analysis With Aanalyzer

The first stage builds a structured view of the data lake. For each file (Dᵢ), the Aanalyzer agent generates a Python script (sᵢ_desc) that parses the file and prints essential information such as column names, data types, metadata and text summaries. DS STAR executes this script and captures the output as a concise description (dᵢ).

This process works for both structured and unstructured data. CSV files yield column level statistics and samples, while JSON or text files produce structural summaries and key snippets. The collection {dᵢ} becomes shared context for all later agents.

https://arxiv.org/pdf/2509.21825

Stage 2: Iterative Planning, Coding And Verification

After file analysis, DS STAR runs an iterative loop that mirrors how a human uses a notebook.

Aplanner creates an initial executable step (p₀) using the query and the file descriptions, for example loading a relevant table.

Acoder turns the current plan (p) into Python code (s). DS STAR executes this code to obtain an observation (r).

Averifier is an LLM based judge. It receives the cumulative plan, the query, the current code and its execution result and returns a binary decision, sufficient or insufficient.

If the plan is insufficient, Arouter decides how to refine it. It either outputs the token Add Step, which appends a new step, or an index of an erroneous step to truncate and regenerate from.

Aplanner is conditioned on the latest execution result (rₖ), so each new step explicitly responds to what went wrong in the previous attempt. The loop of routing, planning, coding, executing and verifying continues until Averifier marks the plan sufficient or the system hits a maximum of 20 refinement rounds.

https://arxiv.org/pdf/2509.21825

To satisfy strict benchmark formats, a separate Afinalyzer agent converts the final plan into solution code that enforces rules such as rounding and CSV output.

Robustness Modules, Adebugger And Retriever

Realistic pipelines fail on schema drift and missing columns. DS STAR adds Adebugger to repair broken scripts. When code fails, Adebugger receives the script, the traceback and the analyzer descriptions {dᵢ}. It generates a corrected script by conditioning on all three signals, which is important because many data centric bugs require knowledge of column headers, sheet names or schema, not only the stack trace.

KramaBench introduces another challenge, thousands of candidate files per domain. DS STAR handles this with a Retriever. The system embeds the user query and each description (dᵢ) using a pre trained embedding model and selects the top 100 most similar files for the agent context, or all files if there are fewer than 100. In the implementation, the research team used Gemini Embedding 001 for similarity search.

https://arxiv.org/pdf/2509.21825

Benchmark Results On DABStep, KramaBench And DA Code

All main experiments run DS STAR with Gemini 2.5 Pro as the base LLM and allow up to 20 refinement rounds per task.

On DABStep, model only Gemini 2.5 Pro achieves 12.70 percent hard level accuracy. DS STAR with the same model reaches 45.24 percent on hard tasks and 87.50 percent on easy tasks. This is an absolute gain of more than 32 percentage points on the hard split and it outperforms other agents such as ReAct, AutoGen, Data Interpreter, DA Agent and several commercial systems recorded on the public leaderboard.

https://arxiv.org/pdf/2509.21825

The Google research team reports that, compared to the best alternative system on each benchmark, DS STAR improves overall accuracy from 41.0 percent to 45.2 percent on DABStep, from 39.8 percent to 44.7 percent on KramaBench and from 37.0 percent to 38.5 percent on DA Code.

https://arxiv.org/pdf/2509.21825

For KramaBench, which requires retrieving relevant files from large domain specific data lakes, DS STAR with retrieval and Gemini 2.5 Pro achieves a total normalized score of 44.69. The strongest baseline, DA Agent with the same model, reaches 39.79.

https://arxiv.org/pdf/2509.21825

On DA Code, DS STAR again beats DA Agent. On hard tasks, DS STAR reaches 37.1 percent accuracy versus 32.0 percent for DA Agent when both use Gemini 2.5 Pro.

Key Takeaways

DS STAR reframes data science agents as Text to Python over heterogeneous files such as CSV, JSON, Markdown and text, instead of only Text to SQL over clean relational tables.

The system uses a multi agent loop with Aanalyzer, Aplanner, Acoder, Averifier, Arouter and Afinalyzer, which iteratively plans, executes and verifies Python code until the verifier marks the solution as sufficient.

Adebugger and a Retriever module improve robustness, by repairing failing scripts using rich schema descriptions and by selecting the top 100 relevant files from large domain specific data lakes.

With Gemini 2.5 Pro and 20 refinement rounds, DS STAR achieves large gains over prior agents on DABStep, KramaBench and DA Code, for example increasing DABStep hard accuracy from 12.70 percent to 45.24 percent.

Ablations show that analyzer descriptions and routing are critical, and experiments with GPT 5 confirm that the DS STAR architecture is model agnostic, while iterative refinement is essential for solving hard multi step analytics tasks.

Editorial Comments

DS STAR shows that practical data science automation needs explicit structure around large language models, not only better prompts. The combination of Aanalyzer, Averifier, Arouter and Adebugger turns free form data lakes into a controlled Text to Python loop that is measurable on DABStep, KramaBench and DA Code, and portable across Gemini 2.5 Pro and GPT 5. This work moves data agents from table demos toward benchmarked, end to end analytics systems.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Introduces DS STAR: A Multi Agent Data Science System That Plans, Codes And Verifies End To End Analytics appeared first on MarkTechPost.

Transform your MCP architecture: Unite MCP servers through AgentCore G …

Posted on November 7, 2025 by i-genie

As AI agents are adopted at scale, developer teams can create dozens to hundreds of specialized Model Context Protocol (MCP) servers, tailored for specific agent use case and domain, organization functions or teams. Organizations also need to integrate their own existing MCP servers or open source MCP servers for their AI workflows. There is a need for a way to efficiently combine these existing MCP servers–whether custom-built, publicly available, or open source–into a unified interface that AI agents can readily consume and teams can seamlessly share across the organization.
Earlier this year, we introduced Amazon Bedrock AgentCore Gateway, a fully managed service that serves as a centralized MCP tool server, providing a unified interface where agents can discover, access, and invoke tools. Today, we’re extending support for existing MCP servers as a new target type in AgentCore Gateway. With this capability, you can group multiple task-specific MCP servers aligned to agent goals behind a single, manageable MCP gateway interface. This reduces the operational complexity of maintaining separate gateways, while providing the same centralized tool and authentication management that existed for REST APIs and AWS Lambda functions.
Without a centralized approach, customers face significant challenges: discovering and sharing tools across organizations becomes fragmented, managing authentication across multiple MCP servers grows increasingly complex, and maintaining separate gateway instances for each server quickly becomes unmanageable. Amazon Bedrock AgentCore Gateway helps solves these challenges by treating existing MCP servers as native targets, giving customers a single point of control for routing, authentication, and tool management—making it as simple to integrate MCP servers as it is to add other targets to the gateway.
Breaking down MCP silos: Why enterprise teams need a unified Gateway
Let’s explore this through a real-world example of an e-commerce ordering system, where different teams maintain specialized MCP servers for their specific domains. Consider an enterprise e-commerce system where different teams have developed specialized MCP servers:

The Shopping Cart team maintains an MCP server with cart management tools
The Product Catalog team runs their MCP server for product browsing and search
The Promotions team operates an MCP server handling promotional logic

Previously, an ordering agent would need to interact with each of these MCP servers separately, managing multiple connections and authentication contexts. With the new MCP server target support in AgentCore Gateway, these specialized servers can now be unified under a single gateway while maintaining their team-specific ownership and access controls. The power of this approach lies in its organizational flexibility. Teams can group their MCP servers based on multiple logical criteria:

Business unit alignment: Organize the MCP servers by business unit
Product feature boundaries: Each product team owns their MCP server with domain-specific tools allowing them to maintain clear ownership while providing a unified interface for their agents
Security and access control: Different MCP servers require different authentication mechanisms. The gateway handles the authentication complexity, making it simple for authorized agents to access the tools they need

The following diagram illustrates how an ordering agent interacts with multiple MCP servers through AgentCore Gateway. The agent connects to the gateway and discovers the available tools. Each team maintains control over their domain-specific tools while contributing to a cohesive agent experience. The gateway handles tool naming collisions, authentication, and provides unified semantic search across the tools.

The AgentCore Gateway serves as an integration hub in modern agentic architectures, offering a unified interface for connecting diverse agent implementations with a wide array of tool providers. The architecture, as illustrated in the diagram, demonstrates how the gateway bridges the gap between agent and tool implementation approaches, now enhanced with the ability to directly integrate MCP server targets.
AgentCore Gateway integration architecture
In AgentCore Gateway, a target defines the APIs, Lambda functions, or other MCP servers that a gateway will provide as tools to an agent. Targets can be Lambda functions, OpenAPI specifications, Smithy models, MCP servers, or other tool definitions.
The target integration side of the architecture showcases the gateway’s versatility in tool integration. With the new MCP server target support, the gateway can directly incorporate tools from public MCP servers, treating them as first-class citizens alongside other target types. This capability extends to federation scenarios where one AgentCore Gateway instance can serve as a target for another, for hierarchical tool organization across organizational boundaries. The gateway can seamlessly integrate with AgentCore Runtime instances that expose agents as tools, private MCP servers maintained by customers, traditional AWS Lambda functions, and both Smithy and AWS service APIs.
Beyond target diversity, the gateway’s authentication architecture provides additional operational benefits. The gateway decouples its inbound authentication from target systems, letting agents access tools that use multiple identity providers through a single interface. This centralized approach simplifies development, deployment, and maintenance of AI agents. Now, the same approach can be used for MCP server targets, where the gateway manages the complexity of interfacing with the server using the configured identity provider for the target.
With this authentication foundation you get sophisticated tool management capabilities through a unified architecture. When an agent requests tool discovery, the gateway provides a consistent view across the integrated targets, with tools from MCP servers appearing alongside Lambda functions and traditional APIs. The semantic search capability operates uniformly across the tool types, so agents can discover relevant tools regardless of their implementation. During tool invocation, the gateway handles the necessary protocol translations, authentication flows, and data transformations, presenting a clean, consistent interface to agents while managing the complexity of different target systems behind the scenes.
The addition of MCP server target support represents a significant evolution in the gateway’s capabilities. Organizations can now directly integrate MCP-native tools while maintaining their investments in traditional APIs and Lambda functions. This flexibility allows for gradual migration strategies where teams can adopt MCP-native implementations at their own pace while facilitating continuous operation of existing integrations. The gateway’s synchronization mechanisms make sure that tool definitions remain current across the different target types, while its authentication and authorization systems provide consistent security controls regardless of the underlying tool implementation.
The gateway combines MCP servers, traditional APIs, and serverless functions into a coherent tool environment. This capability, along with enterprise-grade security and performance, makes it a beneficial infrastructure for agentic computing.

Solution Walkthrough
In this post, we’ll guide you through the steps to set up an MCP server target in AgentCore Gateway, which is as simple as adding a new MCP server type target to a new or existing MCP Gateway. Adding an MCP server to an AgentCore Gateway will allow you to centralize your tool management, security authentication, and operational best practices with managing MCP servers at scale.

Get started with adding MCP Server into AgentCore Gateway
To get started, you will create an AgentCore Gateway and add your MCP Server as a target.
Prerequisites
Verify you have the following prerequisites:

AWS account with Amazon Bedrock AgentCore access. For more information review Permissions for AgentCore Runtime documentation.
Python 3.12 or later
Basic understanding of OAuth 2.0

You can create gateways and add targets through multiple interfaces:

AWS SDK for Python (Boto3)
AWS Management Console
AWS Command Line Interface (AWS CLI)
AgentCore starter toolkit for fast and straightforward setup

The following practical examples and code snippets demonstrate how to set up and use Amazon Bedrock AgentCore Gateway. For an interactive walkthrough, you can use these Jupyter Notebook samples on GitHub.
Create a gateway
To create a gateway, you can use the AgentCore starter toolkit to create a default authorization configuration with Amazon Cognito for JWT-based inbound authorization. You can also use another OAuth 2.0-compliant authentication provider instead of Cognito.

import time
import boto3

gateway_client = boto3.client(“bedrock-agentcore-control”)

# Create an authorization configuration, that specifies what client is authorized to access this Gateway
auth_config = {
“customJWTAuthorizer”: {
“allowedClients”: [‘<cognito_client_id>’], # Client MUST match with the ClientId configured in Cognito.
“discoveryUrl”: ‘<cognito_oauth_discovery_url>’,
}
}

# Call the create_gateway API
# This operation is asynchronous so may take time for Gateway creation
# This Gateway will leverage a CUSTOM_JWT authorizer, the Cognito User Pool we reference in auth_config
def deploy_gateway(poll_interval=5):
create_response = gateway_client.create_gateway(
name=”DemoGateway”,
roleArn=”<IAM Role>”, # The IAM Role must have permissions to create/list/get/delete Gateway
protocolType=”MCP”,
authorizerType=”CUSTOM_JWT”,
authorizerConfiguration=auth_config,
description=”AgentCore Gateway with MCP Server Target”,
)
gatewayID = create_response[“gatewayId”]
gatewayURL = create_response[“gatewayUrl”]

# Wait for deployment
while True:
status_response = gateway_client.get_gateway(gatewayIdentifier=gatewayID)
status = status_response[“status”]
if status == “READY”:
print(“✅ AgentCore Gateway is READY!”)
break
elif status in [“FAILED”]:
print(f”❌ Deployment failed: {status}”)
return None
print(f”Status: {status} – waiting…”)
time.sleep(poll_interval)

if __name__ == “__main__”:
deploy_gateway()

# Values with < > needs to be replaced with real values

Create a sample MCP Server
As an example, let’s create a sample MCP server with three simple tools that return static responses. The server uses FastMCP with stateless_http=True which is required for AgentCore Runtime compatibility.

from mcp.server.fastmcp import FastMCP

mcp = FastMCP(host=”0.0.0.0″, stateless_http=True)

@mcp.tool()
def getOrder() -> int:
“””Get an order”””
return 123

@mcp.tool()
def updateOrder(orderId: int) -> int:
“””Update existing order”””
return 456

@mcp.tool()
def cancelOrder(orderId: int) -> int:
“””cancel existing order”””
return 789

if __name__ == “__main__”:
mcp.run(transport=”streamable-http”)

Configure AgentCore Runtime deployment
Next, we will use the starter toolkit to configure the AgentCore Runtime deployment. The toolkit can create the Amazon ECR repository on launch and generate a Dockerfile for deployment on AgentCore Runtime. You can use your own existing MCP server, we’re using the following only as an example. In a real-world environment, the inbound authorization for your MCP server will likely differ from the gateway configuration. Refer to this GitHub code example to create an Amazon Cognito user pool for Runtime authorization.

from bedrock_agentcore_starter_toolkit import Runtime
from boto3.session import Session

boto_session = Session()
region = boto_session.region_name
print(f”Using AWS region: {region}”)

required_files = [‘mcp_server.py’, ‘requirements.txt’]
for file in required_files:
if not os.path.exists(file):
raise FileNotFoundError(f”Required file {file} not found”)
print(“All required files found ✓”)

agentcore_runtime = Runtime()

auth_config = {
   “customJWTAuthorizer”: {
   “allowedClients”: [
   ‘<runtime_cognito_client_id>’ # Client MUST match with the ClientId configured in Cognito, and can be separate from the Gateway Cognito provider.
   ],
   “discoveryUrl”: ‘<cognito_oauth_discovery_url>’,
   }
}

print(“Configuring AgentCore Runtime…”)
response = agentcore_runtime.configure(
   entrypoint=”mcp_server.py”,
   auto_create_execution_role=True,
   auto_create_ecr=True,
   requirements_file=”requirements.txt”,
   region=region,
   authorizer_configuration=auth_config,
   protocol=”MCP”,
   agent_name=”mcp_server_agentcore”
)
print(“Configuration completed ✓”)

# Values with < > needs to be replaced with real values

Launch MCP server to AgentCore Runtime
Now that we have the Dockerfile, let’s launch the MCP server to AgentCore Runtime:

print(“Launching MCP server to AgentCore Runtime…”)
print(“This may take several minutes…”)
launch_result = agentcore_runtime.launch()
agent_arn = launch_result.agent_arn
agent_id = launch_result.agent_id
print(“Launch completed ✓”)

encoded_arn = agent_arn.replace(‘:’, ‘%3A’).replace(‘/’, ‘%2F’)
mcp_url = f”https://bedrock-agentcore.{region}.amazonaws.com/runtimes/{encoded_arn}/invocations?qualifier=DEFAULT”

print(f”Agent ARN: {launch_result.agent_arn}”)
print(f”Agent ID: {launch_result.agent_id}”)

Create MCP server as target for AgentCore Gateway
Create an AgentCore Identity Resource Credential Provider for the AgentCore Gateway to use as outbound auth to the MCP server agent in AgentCore Runtime:

identity_client = boto3.client(‘bedrock-agentcore-control’, region_name=region)

cognito_provider = identity_client.create_oauth2_credential_provider(
   name=”gateway-mcp-server-identity”,
   credentialProviderVendor=”CustomOauth2″,
   oauth2ProviderConfigInput={
   ‘customOauth2ProviderConfig’: {
   ‘oauthDiscovery’: {
   ‘discoveryUrl’: ‘<cognito_oauth_discovery_url>’,
   },
   ‘clientId’: ‘<runtime_cognito_client_id>’, # Client MUST match with the ClientId configured in Cognito for the Runtime authorizer
   ‘clientSecret’: ‘<cognito_client_secret>’
   }
   }
)
cognito_provider_arn = cognito_provider[‘credentialProviderArn’]
print(cognito_provider_arn)

# Values with < > needs to be replaced with real values

Create a gateway target pointing to the MCP server:

gateway_client = boto3.client(“bedrock-agentcore-control”, region_name=region)
create_gateway_target_response = gateway_client.create_gateway_target(
name=”mcp-server-target”,
gatewayIdentifier=gatewayID,
targetConfiguration={“mcp”: {“mcpServer”: {“endpoint”: mcp_url}}},
credentialProviderConfigurations=[
{
“credentialProviderType”: “OAUTH”,
“credentialProvider”: {
“oauthCredentialProvider”: {
“providerArn”: cognito_provider_arn,
“scopes”: [“<cognito_oauth_scopes>”],
}
},
},
],
) # Asynchronously create gateway target
gatewayTargetID = create_gateway_target_response[“targetId”]

# Values with < > needs to be replaced with real values

After creating a gateway target, implement a polling mechanism to check for the gateway target status using the get_gateway_target API call:

import time

def poll_for_status(interval=5):
# Poll for READY status
while True:
gateway_target_response = gateway_client.get_gateway_target(gatewayIdentifier=gatewayID, targetId=gatewayTargetID)
status = gateway_target_response[“status”]
if status == ‘READY’:
break
elif status in [‘FAILED’, ‘UPDATE_UNSUCCESSFUL’, ‘SYNCHRONIZE_UNSUCCESSFUL’]:
raise Exception(f”Gateway target failed with status: {status}”)
time.sleep(interval)

poll_for_status()

Test Gateway with Strands Agents framework
Let’s test the Gateway with the Strands Agents integration to list the tools from MCP server. You can also use other MCP-compatible agents built with different agentic frameworks.

from strands import Agent
from mcp.client.streamable_http import streamablehttp_client
from strands.tools.mcp.mcp_client import MCPClient

def create_streamable_http_transport():
return streamablehttp_client(gatewayURL,headers={“Authorization”: f”Bearer {token}”})

client = MCPClient(create_streamable_http_transport)

with client:
# Call the listTools
tools = client.list_tools_sync()
# Create an Agent with the model and tools
agent = Agent(model=yourmodel,tools=tools) ## you can replace with any model you like
# Invoke the agent with the sample prompt. This will only invoke MCP listTools and retrieve the list of tools the LLM has access to. The below does not actually call any tool.
agent(“Hi , can you list all tools available to you”)
# Invoke the agent with sample prompt, invoke the tool and display the response
agent(“Get the Order id”)

Refreshing tool definitions of your MCP servers in AgentCore Gateway
The SynchronizeGatewayTargets API is a new asynchronous operation that enables on-demand synchronization of tools from MCP server targets. MCP servers host tools which agents can discover and invoke. With time, these tools might need to be updated, or new tools may be introduced in an existing MCP server target. You can connect with external MCP servers through the SynchronizeGatewayTargets API that performs protocol handshakes and indexes available tools. This API provides customers with explicit control over when to refresh their tool definitions, particularly useful after making changes to their MCP server’s tool configurations.
When a target is configured with OAuth authentication, the API first interacts with the AgentCore Identity service to retrieve the necessary credentials from the specified credential provider. These credentials are validated for freshness and availability before communication with the MCP server begins. If the credential retrieval fails or returns expired tokens, the synchronization operation fails immediately with appropriate error details, transitioning the target to a FAILED state. For targets configured without authentication, the API proceeds directly to tool synchronization.
The tool processing workflow begins with an initialize call to the MCP server to establish a session. Following successful initialization, the API makes paginated calls to the MCP server’s tools/list capability, processing tools in batches of 100 to optimize performance and resource utilization. Each batch of tools undergoes normalization where the API adds target-specific prefixes to help prevent naming collisions with tools from other targets. During processing, tool definitions are normalized to facilitate consistency across different target types, while preserving the essential metadata from the original MCP server definitions.

The synchronization flow begins when:

An Ops Admin initiates the SynchronizeGatewayTargets API, triggering AgentCore Gateway to refresh the configured MCP target.
The gateway obtains an OAuth token from AgentCore Identity for secure access to the MCP target.
The gateway then initializes a secure session with the MCP server to retrieve version capabilities.
Finally, the gateway makes paginated calls to the MCP server tools/list endpoint to retrieve the tool definitions, making sure the gateway maintains a current and accurate list of tools.

The SynchronizeGatewayTargets API addresses a critical challenge in managing MCP targets within AgentCore Gateway: maintaining an accurate representation of available tools while optimizing system performance and resource utilization. Here’s why this explicit synchronization approach is valuable:
Schema consistency management: Without explicit synchronization, AgentCore Gateway would need to either make real-time calls to MCP servers during ListTools operations (impacting latency and reliability) or risk serving stale tool definitions. The SynchronizeGatewayTargets API provides a controlled mechanism where customers can refresh their tool schemas at strategic times, such as after deploying new tools or updating existing ones in their MCP server. This approach makes sure that tool definitions in the gateway accurately reflect the target MCP server’s capabilities without compromising performance.

Performance impact trade-offs: The API implements optimistic locking during synchronization to help prevent concurrent modifications that could lead to inconsistent states. While this means multiple synchronization requests might need to retry if there’s contention, this trade-off is acceptable because:

Tool schema changes are typically infrequent operational events rather than regular runtime occurrences
The performance cost of synchronization is incurred only when explicitly requested, not during regular tool invocations
The cached tool definitions facilitate consistent high performance for ListTools operations between synchronizations

Invoke the synchronize gateway API
Use the following example to invoke the synchronize gateway operation:

import requests
import json

def search_tools(gateway_url, access_token, query):
headers = {
“Content-Type”: “application/json”,
“Authorization”: f”Bearer {access_token}”
}

payload = {
“jsonrpc”: “2.0”,
“id”: “search-tools-request”,
“method”: “tools/call”,
“params”: {
“name”: “x_amz_bedrock_agentcore_search”,
“arguments”: {
“query”: query
}
}
}

response = requests.post(gateway_url, headers=headers, json=payload, timeout=5)
response.raise_for_status()
return response.json()

# Example usage
token_response = utils.get_token(user_pool_id, client_id, client_secret, scopeString, REGION)
access_token = token_response[‘access_token’]
results = search_tools(gatewayURL, access_token, “order operations”)
print(json.dumps(results, indent=2))

Implicit synchronization of tools schema
During CreateGatewayTarget and UpdateGatewayTarget operations, AgentCore Gateway performs an implicit synchronization that differs from the explicit SynchronizeGatewayTargets API. This implicit synchronization makes sure that MCP targets are created or updated with valid, current tool definitions, aligning with the assurance from AgentCore Gateway that targets in READY state are immediately usable. While this might make create/update operations take longer than with other target types, it helps prevent the complexity and potential issues of having targets without validated tool definitions.

The implicit synchronization flow begins when:

An Ops Admin creates or updates the MCP target using CreateGatewayTarget or UpdateGatewayTarget operations.
AgentCore Gateway configures the new or updated MCP target.
The gateway asynchronously triggers the synchronization process to update the tool definitions.
The gateway obtains an OAuth token from AgentCore Identity for secure access.
The gateway then initializes a secure session with the MCP server to retrieve version capabilities.
Finally, the gateway makes paginated calls to the MCP server’s tools/list endpoint to retrieve the tool definitions, making sure the gateway maintains a current and accurate list of tools.

ListTools behavior for MCP targets
The ListTools operation in AgentCore Gateway provides access to tool definitions previously synchronized from MCP targets, following a cache-first approach that prioritizes performance and reliability. Unlike traditional OpenAPI or Lambda targets where tool definitions are statically defined, MCP target tools are discovered and cached through synchronization operations. When a client calls ListTools, the gateway retrieves tool definitions from its persistent storage rather than making real-time calls to the MCP server. These definitions were previously populated either through implicit synchronization during target creation/update or through explicit SynchronizeGatewayTargets API calls. The operation returns a paginated list of normalized tool definitions.

InvokeTool (tools/call) Behavior for MCP Targets
The InvokeTool operation for MCP targets handles the actual execution of tools discovered through ListTools, managing real-time communication with the target MCP server. Unlike the cache-based ListTools operation, tools/call requires active communication with the MCP server, introducing specific authentication, session management, and error handling requirements. When a tools/call request arrives, AgentCore Gateway first validates the tool exists in its synchronized definitions. For MCP targets, AgentCore Gateway performs an initial initialize call to establish a session with the MCP server. If the target is configured with OAuth credentials, AgentCore Gateway retrieves fresh credentials from AgentCore Identity before making the initialize call. This makes sure that even if ListTools returned cached tools with expired credentials, the actual invocation uses valid authentication.

The inbound authorization flow begins when:

The MCP client initializes a request with MCP protocol version to AgentCore Gateway.
The client then sends the tools/call request to the gateway.
The gateway obtains an OAuth token from AgentCore Identity for secure access.
The gateway initializes a secure session with the MCP server to invoke and handle the actual execution of the tool.

Search tool behavior for MCP targets
The search capability in AgentCore Gateway enables semantic discovery of tools across the different target types, including MCP targets. For MCP targets, the search functionality operates on normalized tool definitions that were captured and indexed during synchronization operations, providing efficient semantic search without real-time MCP server communication.
When tool definitions are synchronized from an MCP target, AgentCore Gateway automatically generates embeddings for each tool’s name, description, and parameter descriptions. These embeddings are stored alongside the normalized tool definitions, enabling semantic search that understands the intent and context of search queries. Unlike traditional keyword matching, this allows agents to discover relevant tools even when exact terminology doesn’t match.

Search for MCP server tools through the gateway
Use the following example to search for tools through the gateway.

import requests
import json

def search_tools(gateway_url, access_token, query):
   headers = {
   “Content-Type”: “application/json”,
   “Authorization”: f”Bearer {access_token}”
   }

   payload = {
   “jsonrpc”: “2.0”,
   “id”: “search-tools-request”,
   “method”: “tools/call”,
   “params”: {
   “name”: “x_amz_bedrock_agentcore_search”,
   “arguments”: {
   “query”: query
   }
   }
   }

response = requests.post(gateway_url, headers=headers, json=payload, timeout=5)
response.raise_for_status()
return response.json()

# Example usage
token_response = utils.get_token(user_pool_id, client_id, client_secret, scopeString, REGION)
access_token = token_response[‘access_token’]
results = search_tools(gatewayURL, access_token, “math operations”)
print(json.dumps(results, indent=2))

Conclusion
Today’s announcement of MCP server support as a target type in Amazon Bedrock AgentCore Gateway is an advancement in enterprise AI agent development. This new capability addresses critical challenges in scaling MCP server implementations while maintaining security and operational efficiency. By integrating existing MCP servers alongside REST APIs and Lambda functions, AgentCore Gateway provides a more unified, secure, and manageable solution for tool integration at scale. Organizations can now manage their tools through a single, centralized interface while benefiting from unified authentication, simplified tool discovery and reduced maintenance overhead.
For more detailed information and advanced configurations, refer to the code samples on GitHub, the Amazon Bedrock AgentCore Gateway Developer Guide and Amazon AgentCore Gateway pricing.

About the authors
Frank Dallezotte is a Senior Solutions Architect at AWS and is passionate about working with independent software vendors to design and build scalable applications on AWS. He has experience creating software, implementing build pipelines, and deploying these solutions in the cloud.
Ganesh Thiyagarajan is a Senior Solutions Architect at Amazon Web Services (AWS) with over 20 years of experience in software architecture, IT consulting, and solution delivery. He helps ISVs transform and modernize their applications on AWS. He is also part of the AI/ML Technical field community, helping customers build and scale Gen AI solutions.
Dhawal Patel is a Principal Generative AI Tech lead at Amazon Web Services (AWS). He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to Agentic AI, Deep learning, distributed computing.

Generalist AI Introduces GEN-θ: A New Class of Embodied Foundation Mo …

Posted on November 6, 2025 by i-genie

How do you build a single model that can learn physical skills from chaotic real world robot data without relying on simulation? Generalist AI has unveiled GEN-θ, a family of embodied foundation models trained directly on high fidelity raw physical interaction data instead of internet video or simulation. The system is built to establish scaling laws for robotics in the same way that large language models did for text, but now grounded in continuous sensorimotor streams from real robots operating in homes, warehouses and workplaces.

Harmonic Reasoning, thinking and acting in real time

GEN-θ is introduced as an embodied foundation model architecture that builds on the strengths of vision and language models, and extends them with native support for human level reflexes and physical commonsense. The core feature is Harmonic Reasoning, where the model is trained to think and act at the same time over asynchronous, continuous time streams of sensing and acting tokens.

This design targets a robotics specific constraint. Language models can simply spend more time thinking before replying, but robots must act while physics continues to evolve. Harmonic Reasoning creates a harmonic interplay between sensing and acting streams so that GEN-θ can scale to very large model sizes without depending on System1-System2 architectures or heavy inference time guidance controllers.

GEN-θ is explicitly cross embodiment. The same architecture runs on different robots and has been tested on 6DoF, 7DoF and 16+DoF semi humanoid systems, which lets a single pre-training run serve heterogeneous fleets.

Surpassing the intelligence threshold in robotics

The Generalist AI team reports a phase transition in capability as GEN-θ scales in a high data regime. Their scaling research experiment also show that the models must be large enough to absorb vast amounts of physical interaction data.

Their behaviors are as follows:

1B models struggle to absorb complex and diverse sensorimotor data during pretraining and their weights stop absorbing new information, which the research team describe as ossification.

6B models start to benefit from pretraining and show strong multi task capabilities.

7B+ models internalize large scale robotic pretraining so that a few thousand post training steps on downstream tasks are sufficient for transfer.

https://generalistai.com/blog/nov-04-2025-GEN-0

The above image plots next action validation prediction error on a completely withheld long horizon downstream task across model sizes and pre-training compute. 1B models plateau early while 6B and 7B models continue to improve as pretraining increases. The research team connect this phase transition to Moravec’s Paradox, arguing that physical commonsense and dexterity appear to require higher compute thresholds than abstract language reasoning, and that GEN-θ is operating beyond that activation point.

Generalist AI team states that GEN-θ has been scaled to 10B+ model sizes, and that larger variants adapt to new tasks with increasingly less post training.

Scaling laws for robotics

Another focus of this research is scaling laws that relate pre-training data and compute to downstream post training performance. The research team samples checkpoints from GEN-θ training runs on different subsets of the pre-training dataset, then post trains those checkpoints on multi task, language conditioned data. This supervised fine tuning stage spans 16 task sets, covering dexterity tasks such as building Lego, industry workflows such as fast food packing, and generalization tasks that include anything style instructions.

Across various tasks, more pre-training improves validation loss and next action prediction error during post training. At sufficient model scale, the relationship between pre-training dataset size and downstream validation error is well described by a power law of the form.

L(D)=(Dc/D)αD

where (D) is the number of action trajectories in pre-training and (L(D)) is validation error on a downstream task. This formula lets robotics teams estimate how much pre-training data is needed to reach a target next action prediction error, or how much downstream labeled data can be traded for additional pre-training.

Data engine and infrastructure at robotics scale

GEN-θ is trained on an in house dataset of 270,000 hours of real world manipulation trajectories collected in thousands of homes, warehouses and workplaces worldwide. The data operation currently adds more than 10,000 new hours per week. Generalist AI team claims that GEN-θ is trained on orders of magnitude more real world manipulation data than prior large robotics datasets as of today.

To sustain this regime, the research team has built custom hardware, data-loaders and network infrastructure, including dedicated internet lines to handle uplink bandwidth from distributed sites. The pipeline uses multi cloud contracts, custom upload machines and on the order of 10,000 compute cores for continual multimodal processing. The research team reports compression of dozens of petabytes of data and data-loading techniques from frontier video foundation models, yielding a system capable of absorbing 6.85 years of real world manipulation experience per day of training.

How you pre-train GEN-θ matters as much as how big it is?

Generalist AI team runs large ablations over 8 pre-training datasets and 10 long horizon task sets. They find that different data mixtures, not just more data, produce models with different behaviors across 3 groups of tasks, dexterity, real world applications and generalization. Performance is measured using validation mean squared error on next actions and reverse Kullback Leibler divergence between the model policy and a Gaussian around ground truth actions.

Low MSE and low reverse KL models are better candidates for supervised fine-tuning. Models with higher MSE but low reverse KL are more multimodal in their action distributions and can be better starting points for reinforcement learning.

Key Takeaways

GEN-θ is an embodied foundation model trained on high fidelity raw physical interaction data, not simulation or internet video, and it uses Harmonic Reasoning to think and act simultaneously under real world physics.

Scaling experiments show an intelligence threshold around 7B parameters, where smaller models ossify under high data load and larger models keep improving with more pretraining.

GEN-θ exhibits clear scaling laws, where downstream post training performance follows a power law in the amount of pre-training data, which lets teams predict how much data and compute are needed for target error levels.

The system is trained on more than 270,000 hours of real world manipulation data, growing by about 10,000 hours per week, supported by custom multi cloud infrastructure that can absorb 6.85 years of experience per training day.

Large scale ablations over 8 pretraining datasets and 10 long horizon task sets show that data quality and mixture design, measured with validation MSE and reverse KL, are as important as scale, since different mixtures yield models better suited for supervised finetuning or reinforcement learning.

Editorial Comments

GEN-θ positions embodied foundation models as a serious attempt to bring scaling laws to robotics, using Harmonic Reasoning, large scale multimodal pre-training and explicit analysis of data mixtures. The research shows that 7B+ models, trained on 270,000 hours of real world manipulation data with 10,000 hours added weekly, can cross an intelligence threshold where more physical interaction data predictably improves downstream performance across dexterity, applications and generalization tasks.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Generalist AI Introduces GEN-θ: A New Class of Embodied Foundation Models Built for Multimodal Training Directly on High-Fidelity Raw Physical Interaction appeared first on MarkTechPost.