A Coding Implementation to Build Neural Memory Agents with Differentia …

In this tutorial, we explore how neural memory agents can learn continuously without forgetting past experiences. We design a memory-augmented neural network that integrates a Differentiable Neural Computer (DNC) with experience replay and meta-learning to adapt quickly to new tasks while retaining prior knowledge. By implementing this approach in PyTorch, we demonstrate how content-based memory addressing and prioritized replay enable the model to overcome catastrophic forgetting and maintain performance across multiple learning tasks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
from dataclasses import dataclass

@dataclass
class MemoryConfig:
memory_size: int = 128
memory_dim: int = 64
num_read_heads: int = 4
num_write_heads: int = 1

We begin by importing all the essential libraries and defining the configuration class for our neural memory system. Here, we set parameters such as memory size, dimensionality, and the number of read/write heads that shape how the differentiable memory behaves throughout training. This setup acts as the foundation upon which our memory-augmented architecture is built. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass NeuralMemoryBank(nn.Module):
def __init__(self, config: MemoryConfig):
super().__init__()
self.memory_size = config.memory_size
self.memory_dim = config.memory_dim
self.num_read_heads = config.num_read_heads
self.register_buffer(‘memory’, torch.zeros(config.memory_size, config.memory_dim))
self.register_buffer(‘usage’, torch.zeros(config.memory_size))
def content_addressing(self, key, beta):
key_norm = F.normalize(key, dim=-1)
mem_norm = F.normalize(self.memory, dim=-1)
similarity = torch.matmul(key_norm, mem_norm.t())
return F.softmax(beta * similarity, dim=-1)
def write(self, write_key, write_vector, erase_vector, write_strength):
write_weights = self.content_addressing(write_key, write_strength)
erase = torch.outer(write_weights.squeeze(), erase_vector.squeeze())
self.memory = (self.memory * (1 – erase)).detach()
add = torch.outer(write_weights.squeeze(), write_vector.squeeze())
self.memory = (self.memory + add).detach()
self.usage = (0.99 * self.usage + write_weights.squeeze()).detach()
def read(self, read_keys, read_strengths):
reads = []
for i in range(self.num_read_heads):
weights = self.content_addressing(read_keys[i], read_strengths[i])
read_vector = torch.matmul(weights, self.memory)
reads.append(read_vector)
return torch.cat(reads, dim=-1)

class MemoryController(nn.Module):
def __init__(self, input_dim, hidden_dim, memory_config: MemoryConfig):
super().__init__()
self.hidden_dim = hidden_dim
self.memory_config = memory_config
self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
total_read_dim = memory_config.num_read_heads * memory_config.memory_dim
self.read_keys = nn.Linear(hidden_dim, memory_config.num_read_heads * memory_config.memory_dim)
self.read_strengths = nn.Linear(hidden_dim, memory_config.num_read_heads)
self.write_key = nn.Linear(hidden_dim, memory_config.memory_dim)
self.write_vector = nn.Linear(hidden_dim, memory_config.memory_dim)
self.erase_vector = nn.Linear(hidden_dim, memory_config.memory_dim)
self.write_strength = nn.Linear(hidden_dim, 1)
self.output = nn.Linear(hidden_dim + total_read_dim, input_dim)
def forward(self, x, memory_bank, hidden=None):
lstm_out, hidden = self.lstm(x.unsqueeze(0), hidden)
controller_state = lstm_out.squeeze(0)
read_k = self.read_keys(controller_state).view(self.memory_config.num_read_heads, -1)
read_s = F.softplus(self.read_strengths(controller_state))
write_k = self.write_key(controller_state)
write_v = torch.tanh(self.write_vector(controller_state))
erase_v = torch.sigmoid(self.erase_vector(controller_state))
write_s = F.softplus(self.write_strength(controller_state))
read_vectors = memory_bank.read(read_k, read_s)
memory_bank.write(write_k, write_v, erase_v, write_s)
combined = torch.cat([controller_state, read_vectors], dim=-1)
output = self.output(combined)
return output, hidden

We implement the Neural Memory Bank and the Memory Controller, which together form the core of the agent’s differentiable memory mechanism. The Neural Memory Bank stores and retrieves information through content-based addressing, while the controller network dynamically interacts with this memory using read and write operations. This setup enables the agent to recall relevant information and adapt to new inputs efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ExperienceReplay:
def __init__(self, capacity=10000, alpha=0.6):
self.capacity = capacity
self.alpha = alpha
self.buffer = deque(maxlen=capacity)
self.priorities = deque(maxlen=capacity)
def push(self, experience, priority=1.0):
self.buffer.append(experience)
self.priorities.append(priority ** self.alpha)
def sample(self, batch_size, beta=0.4):
if len(self.buffer) == 0:
return [], []
probs = np.array(self.priorities)
probs = probs / probs.sum()
indices = np.random.choice(len(self.buffer), min(batch_size, len(self.buffer)), p=probs, replace=False)
samples = [self.buffer[i] for i in indices]
weights = (len(self.buffer) * probs[indices]) ** (-beta)
weights = weights / weights.max()
return samples, torch.FloatTensor(weights)

class MetaLearner(nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def adapt(self, support_x, support_y, num_steps=5, lr=0.01):
adapted_params = {name: param.clone() for name, param in self.model.named_parameters()}
for _ in range(num_steps):
pred, _ = self.model(support_x, self.model.memory_bank)
loss = F.mse_loss(pred, support_y)
grads = torch.autograd.grad(loss, self.model.parameters(), create_graph=True)
adapted_params = {name: param – lr * grad for (name, param), grad in zip(adapted_params.items(), grads)}
return adapted_params

We design the Experience Replay and Meta-Learner components to strengthen the agent’s ability to learn continuously. The replay buffer enables the model to revisit past experiences through prioritized sampling, thereby reducing forgetting, while the Meta-Learner utilizes MAML-style adaptation for rapid learning on new tasks. Together, these modules bring stability and flexibility to the agent’s training process. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ContinualLearningAgent:
def __init__(self, input_dim=64, hidden_dim=128):
self.config = MemoryConfig()
self.memory_bank = NeuralMemoryBank(self.config)
self.controller = MemoryController(input_dim, hidden_dim, self.config)
self.replay_buffer = ExperienceReplay(capacity=5000)
self.meta_learner = MetaLearner(self.controller)
self.optimizer = torch.optim.Adam(self.controller.parameters(), lr=0.001)
self.task_history = []
def train_step(self, x, y, use_replay=True):
self.optimizer.zero_grad()
pred, _ = self.controller(x, self.memory_bank)
current_loss = F.mse_loss(pred, y)
self.replay_buffer.push((x.detach().clone(), y.detach().clone()), priority=current_loss.item() + 1e-6)
total_loss = current_loss
if use_replay and len(self.replay_buffer.buffer) > 16:
samples, weights = self.replay_buffer.sample(8)
for (replay_x, replay_y), weight in zip(samples, weights):
with torch.enable_grad():
replay_pred, _ = self.controller(replay_x, self.memory_bank)
replay_loss = F.mse_loss(replay_pred, replay_y)
total_loss = total_loss + 0.3 * replay_loss * weight
total_loss.backward()
torch.nn.utils.clip_grad_norm_(self.controller.parameters(), 1.0)
self.optimizer.step()
return total_loss.item()
def evaluate(self, test_data):
self.controller.eval()
total_error = 0
with torch.no_grad():
for x, y in test_data:
pred, _ = self.controller(x, self.memory_bank)
total_error += F.mse_loss(pred, y).item()
self.controller.train()
return total_error / len(test_data)

We construct a Continual Learning Agent that integrates memory, controller, replay, and meta-learning into a single, adaptive framework. In this step, we define how the agent trains on each batch, replays past data, and evaluates its performance. The implementation ensures that the model can retain prior knowledge while learning new information without catastrophic forgetting. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef create_task_data(task_id, num_samples=100):
torch.manual_seed(task_id)
x = torch.randn(num_samples, 64)
if task_id == 0:
y = torch.sin(x.mean(dim=1, keepdim=True).expand(-1, 64))
elif task_id == 1:
y = torch.cos(x.mean(dim=1, keepdim=True).expand(-1, 64)) * 0.5
else:
y = torch.tanh(x * 0.5 + task_id)
return [(x[i], y[i]) for i in range(num_samples)]

def run_continual_learning_demo():
print(” Neural Memory Agent – Continual Learning Demon”)
print(“=” * 60)
agent = ContinualLearningAgent()
num_tasks = 4
results = {‘tasks’: [], ‘without_memory’: [], ‘with_memory’: []}
for task_id in range(num_tasks):
print(f”n Learning Task {task_id + 1}/{num_tasks}”)
train_data = create_task_data(task_id, num_samples=50)
test_data = create_task_data(task_id, num_samples=20)
for epoch in range(20):
total_loss = 0
for x, y in train_data:
loss = agent.train_step(x, y, use_replay=(task_id > 0))
total_loss += loss
if epoch % 5 == 0:
avg_loss = total_loss / len(train_data)
print(f” Epoch {epoch:2d}: Loss = {avg_loss:.4f}”)
print(f”n Evaluation on all tasks:”)
for eval_task_id in range(task_id + 1):
eval_data = create_task_data(eval_task_id, num_samples=20)
error = agent.evaluate(eval_data)
print(f” Task {eval_task_id + 1}: Error = {error:.4f}”)
if eval_task_id == task_id:
results[‘tasks’].append(eval_task_id + 1)
results[‘with_memory’].append(error)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
ax = axes[0]
memory_matrix = agent.memory_bank.memory.detach().numpy()
im = ax.imshow(memory_matrix, aspect=’auto’, cmap=’viridis’)
ax.set_title(‘Neural Memory Bank State’, fontsize=14, fontweight=’bold’)
ax.set_xlabel(‘Memory Dimension’)
ax.set_ylabel(‘Memory Slots’)
plt.colorbar(im, ax=ax)
ax = axes[1]
ax.plot(results[‘tasks’], results[‘with_memory’], marker=’o’, linewidth=2, markersize=8, label=’With Memory Replay’)
ax.set_title(‘Continual Learning Performance’, fontsize=14, fontweight=’bold’)
ax.set_xlabel(‘Task Number’)
ax.set_ylabel(‘Test Error’)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(‘neural_memory_results.png’, dpi=150, bbox_inches=’tight’)
print(“n Results saved to ‘neural_memory_results.png'”)
plt.show()
print(“n” + “=” * 60)
print(” Key Insights:”)
print(” • Memory bank stores compressed task representations”)
print(” • Experience replay mitigates catastrophic forgetting”)
print(” • Agent maintains performance on earlier tasks”)
print(” • Content-based addressing enables efficient retrieval”)

if __name__ == “__main__”:
run_continual_learning_demo()

We conduct a comprehensive demonstration of the continual learning process, generating synthetic tasks to evaluate the agent’s adaptability across multiple environments. As we train and visualize the results, we observe how memory replay improves stability and maintains accuracy across tasks. The experiment concludes with graphical insights that highlight how differentiable memory enhances the agent’s long-term learning capability.

In conclusion, we built and trained a neural memory agent capable of continual adaptation across evolving tasks. We observed how the differentiable memory enables efficient storage and retrieval of learned representations, while the replay mechanism reinforces stability and knowledge retention. By combining these components with meta-learning, we saw how such agents pave the way for more resilient, self-adapting neural systems that can remember, reason, and evolve without losing what they’ve already mastered.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation to Build Neural Memory Agents with Differentiable Memory, Meta-Learning, and Experience Replay for Continual Adaptation in Dynamic Environments appeared first on MarkTechPost.

AI Interview Series #1: Explain Some LLM Text Generation Strategies Us …

Every time you prompt an LLM, it doesn’t generate a complete answer all at once — it builds the response one word (or token) at a time. At each step, the model predicts the probability of what the next token could be based on everything written so far. But knowing probabilities alone isn’t enough — the model also needs a strategy to decide which token to actually pick next.

Different strategies can completely change how the final output looks — some make it more focused and precise, while others make it more creative or varied. In this article, we’ll explore four popular text generation strategies used in LLMs: Greedy Search, Beam Search, Nucleus Sampling, and Temperature Sampling — explaining how each one works.

Greedy Search

Greedy Search is the simplest decoding strategy where, at each step, the model picks the token with the highest probability given the current context. While it’s fast and easy to implement, it doesn’t always produce the most coherent or meaningful sequence — similar to making the best local choice without considering the overall outcome. Because it only follows one path in the probability tree, it can miss better sequences that require short-term trade-offs. As a result, greedy search often leads to repetitive, generic, or dull text, making it unsuitable for open-ended text generation tasks.

Beam Search

Beam Search is an improved decoding strategy over greedy search that keeps track of multiple possible sequences (called beams) at each generation step instead of just one. It expands the top K most probable sequences, allowing the model to explore several promising paths in the probability tree and potentially discover higher-quality completions that greedy search might miss. The parameter K (beam width) controls the trade-off between quality and computation — larger beams produce better text but are slower. 

While beam search works well in structured tasks like machine translation, where accuracy matters more than creativity, it tends to produce repetitive, predictable, and less diverse text in open-ended generation. This happens because the algorithm favors high-probability continuations, leading to less variation and “neural text degeneration,” where the model overuses certain words or phrases.

https://arxiv.org/pdf/1904.09751

Greedy Search:

Beam Search:

Greedy Search (K=1) always takes the highest local probability:

T2: Chooses “slow” (0.6) over “fast” (0.4).

Resulting path: “The slow dog barks.” (Final Probability: 0.1680)

Beam Search (K=2) keeps both “slow” and “fast” paths alive:

At T3, it realizes the path starting with “fast” has a higher potential for a good ending.

Resulting path: “The fast cat purrs.” (Final Probability: 0.1800)

Beam Search successfully explores a path that had a slightly lower probability early on, leading to a better overall sentence score.

Top-p Sampling (Nucleus Sampling)

Top-p Sampling (Nucleus Sampling) is a probabilistic decoding strategy that dynamically adjusts how many tokens are considered for generation at each step. Instead of picking from a fixed number of top tokens like in top-k sampling, top-p sampling selects the smallest set of tokens whose cumulative probability adds up to a chosen threshold p (for example, 0.7). These tokens form the “nucleus,” from which the next token is randomly sampled after normalizing their probabilities. 

This allows the model to balance diversity and coherence — sampling from a broader range when many tokens have similar probabilities (flat distribution) and narrowing down to the most likely tokens when the distribution is sharp (peaky). As a result, top-p sampling produces more natural, varied, and contextually appropriate text compared to fixed-size methods like greedy or beam search.

Temperature Sampling

Temperature Sampling controls the level of randomness in text generation by adjusting the temperature parameter (t) in the softmax function that converts logits into probabilities. A lower temperature (t < 1) makes the distribution sharper, increasing the chance of selecting the most probable tokens — resulting in more focused but often repetitive text. At t = 1, the model samples directly from its natural probability distribution, known as pure or ancestral sampling. 

Higher temperatures (t > 1) flatten the distribution, introducing more randomness and diversity but at the cost of coherence. In practice, temperature sampling allows fine-tuning the balance between creativity and precision: low temperatures yield deterministic, predictable outputs, while higher ones generate more varied and imaginative text. 

The optimal temperature often depends on the task — for instance, creative writing benefits from higher values, while technical or factual responses perform better with lower ones.

The post AI Interview Series #1: Explain Some LLM Text Generation Strategies Used in LLMs appeared first on MarkTechPost.

StepFun AI Releases Step-Audio-EditX: A New Open-Source 3B LLM-Grade A …

How can speech editing become as direct and controllable as simply rewriting a line of text? StepFun AI has open sourced Step-Audio-EditX, a 3B parameter LLM based audio model that turns expressive speech editing into a token level text like operation, instead of a waveform level signal processing task.

https://arxiv.org/pdf/2511.03601

Why developers care about controllable TTS?

Most zero shot TTS systems copy emotion, style, accent, and timbre directly from a short reference audio. They can sound natural, but control is weak. Style prompts in text help only for in domain voices, and the cloned voice often ignores the requested emotion or speaking style.

Past work tries to disentangle factors with extra encoders, adversarial losses, or complex architectures. Step-Audio-EditX keeps a relatively entangled representation and instead changes the data and post training objective. The model learns control by seeing many pairs and triplets where text is fixed, but one attribute changes with a large margin.

Architecture, dual codebook tokenizer plus compact audio LLM

Step-Audio-EditX reuses the Step-Audio dual codebook tokenizer. Speech is mapped into two token streams, a linguistic stream at 16.7 Hz with a 1024 entry codebook, and a semantic stream at 25 Hz with a 4096 entry codebook. Tokens are interleaved with a 2 to 3 ratio. The tokenizer keeps prosody and emotion information, so it is not fully disentangled.

On top of this tokenizer, the StepFun research team builds a 3B parameter audio LLM. The model is initialized from a text LLM, then trained on a blended corpus with a 1 to 1 ratio of pure text and dual codebook audio tokens in chat style prompts. The audio LLM reads text tokens, audio tokens, or both, and always generates dual codebook audio tokens as output.

A separate audio decoder handles reconstruction. A diffusion transformer based flow matching module predicts Mel spectrograms from audio tokens, reference audio, and a speaker embedding, and a BigVGANv2 vocoder converts Mel spectrograms to waveform. The flow matching module is trained on about 200000 hours of high quality speech, which improves pronunciation and timbre similarity.

https://arxiv.org/pdf/2511.03601

Large margin synthetic data instead of complicated encoders

The key idea is large margin learning. The model is post trained on triplets and quadruplets that keep text fixed and change only one attribute with a clear gap.

For zero shot TTS, Step-Audio-EditX uses a high quality in house dataset, mainly Chinese and English, with a small amount of Cantonese and Sichuanese, and about 60000 speakers. The data covers wide intra speaker and inter speaker variation in style and emotion.(arXiv)

For emotion and speaking style editing, the team builds synthetic large margin triplets (text, audio neutral, audio emotion or style). Voice actors record about 10 second clips for each emotion and style. StepTTS zero shot cloning then produces neutral and emotional versions for the same text and speaker. A margin scoring model, trained on a small human labeled set, scores pairs on a 1 to 10 scale, and only samples with score at least 6 are kept.

Paralinguistic editing, which covers breathing, laughter, filled pauses and other tags, uses a semi synthetic strategy on top of the NVSpeech dataset. The research team builds quadruplets where the target is the original NVSpeech audio and transcript, and the input is a cloned version with tags removed from the text. This gives time domain editing supervision without a margin model.

Reinforcement learning data uses two preference sources. Human annotators rate 20 candidates per prompt on a 5 point scale for correctness, prosody, and naturalness, and pairs with margin greater than 3 are kept. A comprehension model scores emotion and speaking style on a 1 to 10 scale, and pairs with margin greater than 8 are kept.

Post training, SFT plus PPO on token sequences

Post training has two stages, supervised fine tuning followed by PPO.

In supervised fine tuning, system prompts define zero shot TTS and editing tasks in a unified chat format. For TTS, the prompt waveform is encoded to dual codebook tokens, converted to string form, and inserted into the system prompt as speaker information. The user message is the target text, and the model returns new audio tokens. For editing, the user message includes original audio tokens plus a natural language instruction, and the model outputs edited tokens.

Reinforcement learning then refines instruction following. A 3B reward model is initialized from the SFT checkpoint and trained with Bradley Terry loss on large margin preference pairs. The reward is computed directly on dual codebook token sequences, without decoding to waveform. PPO training uses this reward model, a clip threshold, and a KL penalty to balance quality and deviation from the SFT policy.

Step-Audio-Edit-Test, iterative editing and generalization

To quantify control, the research team introduced Step-Audio-Edit-Test. It uses Gemini 2.5 Pro as an LLM as a judge to evaluate emotion, speaking style, and paralinguistic accuracy. The benchmark has 8 speakers, drawn from Wenet Speech4TTS, GLOBE V2, and Libri Light, with 4 speakers per language.

The emotion set has 5 categories with 50 Chinese and 50 English prompts per category. The speaking style set has 7 styles with 50 prompts per language per style. The paralinguistic set has 10 labels such as breathing, laughter, surprise oh, and uhm, with 50 prompts per label and language.

Editing is evaluated iteratively. Iteration 0 is the initial zero shot clone. Then the model applies 3 rounds of editing with text instructions. In Chinese, emotion accuracy rises from 57.0 at iteration 0 to 77.7 at iteration 3. Speaking style accuracy rises from 41.6 to 69.2. English shows similar behavior, and a prompt fixed ablation, where the same prompt audio is used for all iterations, still improves accuracy, which supports the large margin learning hypothesis.

https://arxiv.org/pdf/2511.03601

The same editing model is applied to four closed source TTS systems, GPT 4o mini TTS, ElevenLabs v2, Doubao Seed TTS 2.0, and MiniMax speech 2.6 hd. For all of them, one editing iteration with Step-Audio-EditX improves both emotion and style accuracy, and further iterations continue to help.

Paralinguistic editing is scored on a 1 to 3 scale. The average score rises from 1.91 at iteration 0 to 2.89 after a single edit, in both Chinese and English, which is comparable to native paralinguistic synthesis in strong commercial systems.

https://arxiv.org/pdf/2511.03601

Key Takeaways

Step Audio EditX uses a dual codebook tokenizer and a 3B parameter audio LLM so it can treat speech as discrete tokens and edit audio in a text like way.

The model relies on large margin synthetic data for emotion, speaking style, paralinguistic cues, speed, and noise, rather than adding extra disentangling encoders.

Supervised fine tuning plus PPO with a token level reward model aligns the audio LLM to follow natural language editing instructions for both TTS and editing tasks.

The Step Audio Edit Test benchmark with Gemini 2.5 Pro as a judge shows clear accuracy gains over 3 editing iterations for emotion, style, and paralinguistic control in both Chinese and English.

Step Audio EditX can post process and improve speech from closed source TTS systems, and the full stack, including code and checkpoints, is available as open source for developers.

Editorial Comments

Step Audio EditX is a precise step forward in controllable speech synthesis, because it keeps the Step Audio tokenizer, adds a compact 3B audio LLM, and optimizes control through large margin data and PPO. The introduction of Step Audio Edit Test with Gemini 2.5 Pro as a judge makes the evaluation story concrete for emotion, speaking style, and paralinguistic control, and the open release lowers the barrier for practical audio editing research. Overall, this release makes audio editing feel much closer to text editing.

Check out the Paper, Repo and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post StepFun AI Releases Step-Audio-EditX: A New Open-Source 3B LLM-Grade Audio Editing Model Excelling at Expressive and Iterative Audio Editing appeared first on MarkTechPost.

Nested Learning: A New Machine Learning Approach for Continual Learnin …

How can we build AI systems that keep learning new information over time without forgetting what they learned before or retraining from scratch? Google Researchers has introduced Nested Learning, a machine learning approach that treats a model as a collection of smaller nested optimization problems, instead of a single network trained by one outer loop. The goal is to attack catastrophic forgetting and move large models toward continual learning, closer to how biological brains manage memory and adaptation over time.

https://abehrouz.github.io/files/NL.pdf

What is Nested Learning?

The research paper from Google ‘Nested Learning, The Illusion of Deep Learning Architectures’ models a complex neural network as a set of coherent optimization problems, nested or running in parallel, that are optimized together. Each internal problem has its own context flow, the sequence of inputs, gradients, or states that this component observes, and its own update frequency.

Instead of seeing training as a flat stack of layers plus one optimizer, Nested Learning imposes an ordering by update frequency. Parameters that update often sit at inner levels, while slowly updated parameters form outer levels. This hierarchy defines a Neural Learning Module, where every level compresses its own context flow into its parameters. The research team show that this view covers standard back-propagation on an MLP, linear attention, and common optimizers, all as instances of associative memory.

In this framework, associative memory is any operator that maps keys to values and is trained with an internal objective. The research team formalizes associative memory and then shows that back-propagation itself can be written as a one step gradient descent update that learns a mapping from inputs to local surprise signals, the gradient of the loss with respect to the output.

https://abehrouz.github.io/files/NL.pdf

Deep Optimizers as Associative Memory

Once optimizers are treated as learning modules, Nested Learning suggests redesigning them with richer internal objectives. Standard momentum can be written as a linear associative memory over past gradients, trained with a dot product similarity objective. This internal objective produces a Hebbian like update rule that does not model dependencies between data samples.

The researcher team replaced this similarity objective with an L2 regression loss over gradient features, which yields an update rule that better manages limited memory capacity and better memorizes gradient sequences. They then generalize the momentum memory from a linear map to an MLP and define Deep Momentum Gradient Descent, where the momentum state is produced by a neural memory and can pass through a non linear function such as Newton Schulz. This perspective also recovers the Muon optimizer as a special case.

https://abehrouz.github.io/files/NL.pdf

Continuum Memory System

In typical sequence models, attention acts as working memory over the current context window, while feedforward blocks store pre training knowledge as long term memory that is rarely updated after training. The Nested Learning researchers extend this binary view to a Continuum Memory System, or CMS.

CMS is defined as a chain of MLP blocks, MLP(f₁) through MLP(fₖ), where each block has its own update frequency and chunk size. For an input sequence, the output is obtained by sequentially applying these blocks. The parameters of each block are updated only every C^(ℓ) steps, so each block compresses a different time scale of context into its parameters. A standard Transformer with one feedforward block is recovered as the special case with k equal to 1.

This construction turns long term memory into a spectrum of levels across frequency, instead of a single static feedforward layer. The research connects this directly to multi time scale synaptic and system consolidation processes in the brain, where different parts of the system learn at different rates while sharing a common architecture.

HOPE, A Self Modifying Architecture Built On Titans

To show that Nested Learning is practical, the research team designed HOPE, a self referential sequence model that applies the paradigm to a recurrent architecture. HOPE is built as a variant of Titans, a long term memory architecture where a neural memory module learns to memorize surprising events at test time and helps attention attend to long past tokens.

Titans has only 2 levels of parameter update, which yields first order in context learning. HOPE extends Titans in 2 ways. First, it is self modifying, it can optimize its own memory through a self referential process and can in principle support unbounded levels of in context learning. Second, it integrates Continuum Memory System blocks so that memory updates occur at multiple frequencies and scale to longer context windows.

https://abehrouz.github.io/files/NL.pdf

Understanding the Results

The research team evaluates HOPE and baselines on language modeling and common sense reasoning tasks at 3 parameter scales, 340M, 760M, and 1.3B parameters. Benchmarks include Wiki and LMB perplexity for language modeling and PIQA, HellaSwag, WinoGrande, ARC Easy, ARC Challenge, Social IQa, and BoolQ accuracy for reasoning. The below given Table 1 reports results for HOPE, Transformer++, RetNet, Gated DeltaNet, TTT, Samba, and Titans.

https://abehrouz.github.io/files/NL.pdf

Key Takeaways

Nested Learning treats a model as multiple nested optimization problems with different update frequencies, which directly targets catastrophic forgetting in continual learning.

The framework reinterprets backpropagation, attention, and optimizers as associative memory modules that compress their own context flow, giving a unified view of architecture and optimization.

Deep optimizers in Nested Learning replace simple dot product similarity with richer objectives such as L2 regression and use neural memories, which leads to more expressive and context aware update rules.

The Continuum Memory System models memory as a spectrum of MLP blocks that update at different rates, creating short, medium, and long range memory rather than one static feedforward layer.

The HOPE architecture, a self modifying variant of Titans built using Nested Learning principles, shows improved language modeling, long context reasoning, and continual learning performance compared to strong Transformer and recurrent baselines.

Editorial Comments

Nested Learning is a useful reframing of deep networks as Neural Learning Modules that integrate architecture and optimization into one system. The introduction of Deep Momentum Gradient Descent, Continuum Memory System, and the HOPE architecture gives a concrete path to richer associative memory and better continual learning. Overall, this work turns continual learning from an afterthought into a primary design axis.

Check out the Paper and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Nested Learning: A New Machine Learning Approach for Continual Learning that Views Models as Nested Optimization Problems to Enhance Long Context Processing appeared first on MarkTechPost.

Prior Labs Releases TabPFN-2.5: The Latest Version of TabPFN that Unlo …

Tabular data is still where many important models run in production. Finance, healthcare, energy and industry teams work with tables of rows and columns, not images or long text. Prior Labs now extends this space with TabPFN-2.5, a new tabular foundation model that scales in context learning to 50,000 samples and 2,000 features while keeping a training free workflow.

https://priorlabs.ai/technical-reports/tabpfn-2-5-model-report

From TabPFN And TabPFNv2 To TabPFN-2.5

The first TabPFN showed that a transformer can learn a Bayesian like inference procedure on synthetic tabular tasks. It handled up to about 1,000 samples and clean numerical features. TabPFNv2 extended this to messy real world data. It added support for categorical features, missing values and outliers, and was practical up to 10,000 samples and 500 features.

TabPFN-2.5 is the next generation in this line. Prior Labs describes it as best for datasets with up to 50,000 samples and 2,000 features, which is a 5 times increase in rows and a 4 times increase in columns over TabPFNv2. That gives roughly 20 times more data cells in the supported regime. The model is exposed through the tabpfn Python package and also through an API.

AspectTabPFN (v1)TabPFNv2TabPFN-2.5Max Rows (recommended)1,00010,00050,000Max Features (recommended)1005002,000Supported data typesNumeric onlyMixedMixed

In Context Learning For Tables

TabPFN-2.5 follows the same prior data fitted network idea as earlier versions. It is a transformer based foundation model that uses in context learning to solve tabular prediction problems in a forward pass. At training time, the model is meta trained on large synthetic distributions of tabular tasks. At inference time, you pass training rows and labels and the test rows together. The model runs one forward pass and outputs predictions, so there is no dataset specific gradient descent or hyperparameter search.

https://priorlabs.ai/technical-reports/tabpfn-2-5-model-report

Benchmark Results On TabArena And RealCause

The research team uses the TabArena Lite benchmark to measure medium sized tasks up to 10,000 samples and 500 features. TabPFN-2.5 in a forward pass outperforms any other model in the comparison. When the Real-TabPFN-2.5 variant is fine tuned on real datasets, the lead increases further. AutoGluon 1.4 in extreme mode is the baseline ensemble, tuned for 4 hours and even including TabPFNv2.

On industry standard benchmarks with up to 50,000 data points and 2,000 features, TabPFN-2.5 substantially outperforms tuned tree based models such as XGBoost and CatBoost. On the same benchmarks it matches the accuracy of AutoGluon 1.4, which runs a complex four hour tuned ensemble that includes previous methods.

Model Architecture And Training Setup

The model architecture follows TabPFNv2 with alternating attention and 18 to 24 layers. Alternating attention means that the network attends along the sample axis and along the feature axis in separate stages, which enforces permutation invariance over rows and columns. This design is important for tabular data where the order of rows and the order of columns do not carry information.

The training setup keeps the prior data based learning idea. TabPFN-2.5 uses synthetic tabular tasks with different priors over functions and data distributions as its meta training source. Real-TabPFN-2.5 uses continued pre training on a set of real world tabular datasets from repositories like OpenML and Kaggle, while the team carefully avoids overlap with evaluation benchmarks.

Key Takeaways

TabPFN 2.5 scales prior data fitted tabular transformers to about 50,000 samples and 2,000 features while keeping a one forward pass, no tuning workflow.

The model is trained on synthetic tabular tasks and evaluated on TabArena, internal industry benchmarks and RealCause, where it substantially outperforms tuned tree based baselines and matches AutoGluon 1.4 on benchmarks in this size range.

TabPFN 2.5 keeps the TabPFNv2 style alternating attention transformer for rows and features, which enables permutation invariance over tables and in context learning without task specific training.

A distillation engine turns TabPFN 2.5 into compact MLP or tree ensemble students that preserve most of the accuracy while giving much lower latency and plug in deployment in existing tabular stacks.

Editorial Comments

TabPFN 2.5 is an important release for tabular machine learning because it turns model selection and hyperparameter tuning into a single forward pass workflow on datasets with up to 50,000 samples and 2,000 features. It combines synthetic meta training, Real-TabPFN-2.5 fine tuning and a distillation engine into MLP and TreeEns students, with a clear non commercial license and enterprise path. Overall, this release makes prior data fitted networks practical for real tabular problems.

Check out the Paper, Model Weights, Repo and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Prior Labs Releases TabPFN-2.5: The Latest Version of TabPFN that Unlocks Scale and Speed for Tabular Foundation Models appeared first on MarkTechPost.

Anthropic Turns MCP Agents Into Code First Systems With ‘Code Execut …

Agents that use the Model Context Protocol MCP have a scaling problem. Every tool definition and every intermediate result is pushed through the context window, which means large workflows burn tokens and hit latency and cost limits fast. Anthropic’s new ‘code execution with MCP’ pattern restructures this pipeline by turning MCP tools into code level APIs and asking the model to write and run code instead of calling tools directly.

The problem, MCP tools as direct model calls

MCP is an open standard that lets AI applications connect to external systems through MCP servers that expose tools. These tools let a model query databases, call APIs, or work with files through a unified interface.

In the default pattern, an agent loads many tool definitions into the model context. Each tool definition contains schema information and metadata. Intermediate results from each tool call are also streamed back into the context so the model can decide the next call.

Anthropic describes a typical case where an agent uses an MCP server for Google Drive to fetch a long sales meeting transcript and then uses another MCP server for Salesforce to update a record with that transcript. The full transcript is first returned through the model, then sent back again when the Salesforce tool is called. For a long meeting this can add tens of thousands of extra tokens that do not change the logic of the task.

When there are many MCP servers and many tools, this pattern does not scale. The model pays to read large tool catalogs and to move large payloads between tools. Latency increases, costs grow, and context limits become a hard cap on system behavior.

The shift, represent MCP servers as code APIs

Anthropic’s proposal is to place MCP inside a code execution loop. Instead of letting the model call tools directly, the MCP client exposes each server as a set of code modules in a filesystem. The model writes TypeScript code that imports and composes those modules, and this code runs in a sandboxed environment.

The pattern has three main steps.

The MCP client generates a directory such as servers that mirrors the available MCP servers and tools.

For each MCP tool, it creates a thin wrapper function implemented in a source file, for example servers/google-drive/getDocument.ts, that internally calls the MCP tool with typed parameters.

The model is instructed to write TypeScript code that imports these functions, runs them, and handles control flow and data movement inside the execution environment.

The earlier Google Drive and Salesforce workflow becomes a short script. The script calls the Google Drive wrapper once, manipulates or inspects the data locally, then calls the Salesforce wrapper. The large transcript does not pass through the model, only the final status and any small samples or summaries do.

Cloudflare’s ‘Code Mode’ work uses the same idea in its Workers platform. It converts MCP tools into TypeScript APIs and runs model generated code inside an isolate with restricted bindings.

Quantitative impact, token usage drops by 98.7 percent

Anthropic reports a concrete example. A workflow that previously consumed about 150,000 tokens when tools and intermediate data were passed directly through the model was reimplemented with code execution and filesystem based MCP APIs. The new pattern used about 2,000 tokens. That is a 98.7 percent reduction in token usage for that scenario, which also reduces cost and latency.

Design benefits for agent builders

Code execution with MCP introduces several practical benefits for engineers who design agents:

Progressive tool discovery: The agent does not need all tool definitions in context. It can explore the generated filesystem, list available servers, and read specific tool modules only when needed. This shifts tool catalogs from the model context into code, so tokens are spent only on relevant interfaces.

Context efficient data handling: Large datasets remain inside the execution environment. For example, TypeScript code can read a large spreadsheet through an MCP tool, filter rows, compute aggregates, and log only small samples and summary statistics back to the model. The model sees a compact view of the data while the heavy lifting happens in code.

Privacy preserving operations: Anthropic describes a pattern where sensitive fields such as email or phone are tokenized inside the execution environment. The model sees placeholders, while the MCP client maintains a secure mapping and restores real values when calling downstream tools. This lets data move between MCP servers without exposing raw identifiers to the model.

State and reusable skills: The filesystem lets agents store intermediate files and reusable scripts. A helper script that transforms a sheet into a report can be saved in a skills directory and imported in later sessions. Anthropic connects this idea to Claude Skills, where collections of scripts and metadata define higher level capabilities.

Editorial Comments

Anthropic’s ‘code execution with MCP’ approach is a sensible next step for MCP powered agents. It directly attacks the token costs of loading tool definitions and routing large intermediate results through the context, by presenting MCP servers as code APIs and pushing work into a sandboxed TypeScript runtime. This makes agents more efficient, while also forcing teams to take code execution security seriously. This launch turns MCP from a tool list into an executable API surface.
The post Anthropic Turns MCP Agents Into Code First Systems With ‘Code Execution With MCP’ Approach appeared first on MarkTechPost.

Google AI Releases ADK Go: A New Open-Source Toolkit Designed to Empow …

How do you build reliable AI agents that plug into your existing Go services without bolting on a separate language stack? Google has just released Agent Development Kit for Go. Go developers can now build AI agents with the same framework that already supports Python and Java, while keeping everything inside a familiar Go toolchain and deployment model.

For AI devs and backend developers who already use Go for services, this closes a gap. You no longer need a separate Python based stack for agents. You can express agent logic, orchestration, and tool use directly in Go code, then move the same agents into Vertex AI Agent Builder and Agent Engine when you are ready for production.

What Agent Development Kit Provides?

Agent Development Kit, or ADK, is an open source framework for developing and deploying AI agents. It is optimized for Gemini and Google Cloud, but the design is model agnostic and deployment agnostic.

In practical terms, ADK gives you:

A code first programming model where agent behavior, tools, and orchestration live in normal source files

Workflow agents for sequential, parallel, and loop style control flow inside an agent system

A rich tool ecosystem with built in tools, custom function tools, OpenAPI tools, Google Cloud tools, and ecosystem tools

Deployment paths that cover local runs, containers, Cloud Run, and Vertex AI Agent Engine

Built in evaluation and safety patterns, integrated with Vertex AI Agent Builder

For a developer, ADK turns an agent into a normal service. You run it locally, inspect traces, and deploy it to a managed runtime, instead of treating it as a one off script that calls an LLM.

What ADK for Go Adds?

The Go release keeps the same core feature set as the Python and Java SDKs but exposes it through an idiomatic Go API. The Google AI team describes ADK for Go as an idiomatic and performant way to build agents that use Go concurrency and strong typing.

Here are some key points:

ADK for Go is installed with go get google.golang.org/adk

The project is open source and hosted at github.com/google/adk-go

It supports building, evaluating, and deploying sophisticated AI agents with flexibility and control

It uses the same abstractions for agents, tools, and workflows as the other ADK languages

This means a Go service can embed agent behavior without switching languages. You can build a multi agent architecture where each agent is a Go component that composes with others using the same framework.

A2A Protocol Support in Go

ADK for Go ships with native support for the Agent2Agent protocol, or A2A.

The A2A protocol defines a way for agents to call other agents over a standard interface. In the Go release, Google highlights that a primary agent can orchestrate and delegate tasks to specialized sub agents. Those sub agents can run locally or as remote deployments. A2A keeps these interactions secure and opaque, so an agent does not need to expose internal memory or proprietary logic to participate.

Google also contributed an A2A Go SDK to the main A2A project. That gives Go developers a protocol level entry point if they want agents that interoperate with other runtimes and frameworks that also support A2A.

MCP Toolbox for Databases and Tooling

A key detail in the official Google announcement is native integration with MCP Toolbox for Databases. It states that ADK Go has out of the box support for more than 30 databases through this toolbox.

MCP Toolbox for Databases is an open source MCP server for databases. It handles connection pooling, authentication, and other concerns, and exposes database operations as tools using the Model Context Protocol.

Within ADK, that means:

You register MCP Toolbox for Databases as an MCP tool provider

The agent calls database operations through MCP tools rather than constructing raw SQL

The toolbox enforces a set of safe, predefined actions that the agent can perform

This fits the ADK model for tools in general, where agents use a mix of built in tools, Google Cloud tools, ecosystem tools, and MCP tools, all described in the Vertex AI Agent Builder documentation.

Integration with Vertex AI Agent Builder and Agent Engine

ADK is the primary framework supported in Vertex AI Agent Builder for building multi agent systems.

The latest Agent Builder updates describe a build path where you:

Develop the agent locally using ADK, now including ADK for Go

Use the ADK quickstart and dev UI to test the agent with multiple tools

Deploy the agent to Vertex AI Agent Engine as a managed runtime

For Go teams, this means the language used in services and infrastructure is now available across the full agent lifecycle, from local development to managed production deployment.

Editorial Comments

This launch positions Agent Development Kit for Go as a practical bridge between AI agents and existing Go services, using the same open source, code first toolkit that underpins Python and Java agents. It brings A2A protocol support and MCP Toolbox for Databases into a Go native environment, aligned with Vertex AI Agent Builder and Vertex AI Agent Engine for deployment, evaluation, and observability. Overall, this release makes Go a first class language for building production ready, interoperable AI agents in Google’s ecosystem.

Check out the Repo, Samples and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Releases ADK Go: A New Open-Source Toolkit Designed to Empower Go Developers to Build Powerful AI Agents appeared first on MarkTechPost.

Why Spatial Supersensing is Emerging as the Core Capability for Multim …

Even strong ‘long-context’ AI models fail badly when they must track objects and counts over long, messy video streams, so the next competitive edge will come from models that predict what comes next and selectively remember only surprising, important events, not from just buying more compute and bigger context windows. A team of researchers from New York University and Stanford introduce Cambrian-S, a spatially grounded video multimodal large language model family, together with the VSI Super benchmark and the VSI 590K dataset to test and train spatial supersensing in long videos.

https://arxiv.org/pdf/2511.04670

From video question answering to spatial supersensing

The research team frames spatial supersensing as a progression of capabilities beyond linguistic only reasoning. The stages are semantic perception, streaming event cognition, implicit 3D spatial cognition and predictive world modeling.

Most current video MLLMs sample sparse frames and rely on language priors. They often answer benchmark questions using captions or single frames rather than continuous visual evidence. Diagnostic tests show that several popular video benchmarks are solvable with limited or text only input, so they do not strongly test spatial sensing.

Cambrian-S targets the higher stages of this hierarchy, where the model must remember spatial layouts across time, reason about object locations and counts and anticipate changes in a 3D world.

VSI Super, a stress test for continual spatial sensing

To expose the gap between current systems and spatial supersensing, the research team designed VSI Super, a two part benchmark that runs on arbitrarily long indoor videos.

https://arxiv.org/pdf/2511.04670

VSI Super Recall, or VSR, evaluates long horizon spatial observation and recall. Human annotators take indoor walkthrough videos from ScanNet, ScanNet++ and ARKitScenes and use Gemini to insert an unusual object, such as a Teddy Bear, into four frames at different spatial locations. These edited sequences are concatenated into streams up to 240 minutes. The model must report the order of locations where the object appears, which is a visual needle in a haystack task with sequential recall.

https://arxiv.org/pdf/2511.04670

VSI Super Count, or VSC, measures continual counting under changing viewpoints and rooms. The benchmark concatenates room tour clips from VSI Bench and asks for the total number of instances of a target object across all rooms. The model must handle viewpoint changes, revisits and scene transitions and maintain a cumulative count. Evaluation uses mean relative accuracy for durations from 10 to 120 minutes.

When Cambrian-S 7B is evaluated on VSI Super in a streaming setup at 1 frame per second, accuracy on VSR drops from 38.3 percent at 10 minutes to 6.0 percent at 60 minutes and becomes zero beyond 60 minutes. VSC accuracy is near zero across lengths. Gemini 2.5 Flash also degrades on VSI Super despite a long context window, which shows that brute force context scaling is not sufficient for continual spatial sensing.

VSI 590K, spatially focused instruction data

To test whether data scaling can help, the research team construct VSI 590K, a spatial instruction corpus with 5,963 videos, 44,858 images and 590,667 question answer pairs from 10 sources.

Sources include 3D annotated real indoor scans such as ScanNet, ScanNet++ V2, ARKitScenes, S3DIS and Aria Digital Twin, simulated scenes from ProcTHOR and Hypersim and pseudo annotated web data such as YouTube RoomTour and robot datasets Open X Embodiment and AgiBot World.

The dataset defines 12 spatial question types, such as object count, absolute and relative distance, object size, room size and appearance order. Questions are generated from 3D annotations or reconstructions so that spatial relationships are grounded in geometry rather than text heuristics. Ablations show that annotated real videos contribute the largest gains on VSI Bench, followed by simulated data and then pseudo annotated images and that training on the full mix gives the best spatial performance.

https://arxiv.org/pdf/2511.04670

Cambrian-S model family and spatial performance

Cambrian-S builds on Cambrian-1 and uses Qwen2.5 language backbones at 0.5B, 1.5B, 3B and 7B parameters with a SigLIP2 SO400M vision encoder and a two layer MLP connector.

Training follows a four stage pipeline. Stage 1 performs vision language alignment on image text pairs. Stage 2 applies image instruction tuning, equivalent to the improved Cambrian-1 setup. Stage 3 extends to video with general video instruction tuning on a 3 million sample mixture called Cambrian-S 3M. Stage 4 performs spatial video instruction tuning on a mixture of VSI 590K and a subset of the stage 3 data.

https://arxiv.org/pdf/2511.04670

On VSI Bench, Cambrian-S 7B reaches 67.5 percent accuracy and outperforms open source baselines like InternVL3.5 8B and Qwen VL 2.5 7B as well as proprietary Gemini 2.5 Pro by more than 16 absolute points. The model also maintains strong performance on Perception Test, EgoSchema and other general video benchmarks, so the focus on spatial sensing does not destroy general capabilities.

Predictive sensing with latent frame prediction and surprise

To go beyond static context expansion, the research team propose predictive sensing. They add a Latent Frame Prediction head, which is a two layer MLP that predicts the latent representation of the next video frame in parallel with next token prediction.

Training modifies stage 4. The model uses mean squared error and cosine distance losses between predicted and ground truth latent features, weighted against the language modeling loss. A subset of 290,000 videos from VSI 590K, sampled at 1 frame per second, is reserved for this objective. During this stage the connector, language model and both output heads are trained jointly, while the SigLIP vision encoder remains frozen.

https://arxiv.org/pdf/2511.04670

At inference time the cosine distance between predicted and actual features becomes a surprise score. Frames with low surprise are compressed before being stored in long term memory and high surprise frames are retained with more detail. A fixed size memory buffer uses surprise to decide which frames to consolidate or drop and queries retrieve frames that are most relevant to the question.

https://arxiv.org/pdf/2511.04670

For VSR, this surprise driven memory system lets Cambrian-S maintain accuracy as video length increases while keeping GPU memory usage stable. It outperforms Gemini 1.5 Flash and Gemini 2.5 Flash on VSR at all tested durations and avoids the sharp degradation seen in models that only extend context.

For VSC, the research team designed a surprise driven event segmentation scheme. The model accumulates features in an event buffer and when a high surprise frame signals a scene change, it summarizes that buffer into a segment level answer and resets the buffer. Aggregating segment answers gives the final count. In streaming evaluation, Gemini Live and GPT Realtime achieve less than 15 percent mean relative accuracy and drop near zero on 120 minute streams, while Cambrian-S with surprise segmentation reaches about 38 percent at 10 minutes and maintains around 28 percent at 120 minutes.

Key Takeaways

Cambrian-S and VSI 590K show that careful spatial data design and strong video MLLMs can significantly improve spatial cognition on VSI Bench, but they still fail on VSI Super, so scale alone does not solve spatial supersensing.

VSI Super, through VSR and VSC, is intentionally built from arbitrarily long indoor videos to stress continual spatial observation, recall and counting, which makes it resistant to brute force context window expansion and standard sparse frame sampling.

Benchmarking shows that frontier models, including Gemini 2.5 Flash and Cambrian S, degrade sharply on VSI Super even when video lengths remain within their nominal context limits, revealing a structural weakness in current long context multimodal architectures.

The Latent Frame Prediction based predictive sensing module uses next latent frame prediction error, or surprise, to drive memory compression and event segmentation, which yields substantial gains on VSI Super compared to long context baselines while keeping GPU memory usage stable.

The research work positions spatial supersensing as a hierarchy from semantic perception to predictive world modeling and argues that future video MLLMs must incorporate explicit predictive objectives and surprise driven memory, not only larger models and datasets, to handle unbounded streaming video in real applications.

Editorial Comments

Cambrian-S is a useful stress test of current video MLLMs because it shows that VSI SUPER is not just a harder benchmark, it exposes a structural failure of long context architectures that still rely on reactive perception. The predictive sensing module, based on Latent Frame Prediction and surprise driven memory, is an important step because it couples spatial sensing with internal world modeling rather than only scaling data and parameters. This research signals a shift from passive video understanding to predictive spatial supersensing as the next design target for multimodal models.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Why Spatial Supersensing is Emerging as the Core Capability for Multimodal AI Systems? appeared first on MarkTechPost.

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Large language models are now limited less by training and more by how fast and cheaply we can serve tokens under real traffic. That comes down to three implementation details: how the runtime batches requests, how it overlaps prefill and decode, and how it stores and reuses the KV cache. Different engines make different tradeoffs on these axes, which show up directly as differences in tokens per second, P50/P99 latency, and GPU memory usage.

This article compares six runtimes that show up repeatedly in production stacks:

vLLM

TensorRT LLM

Hugging Face Text Generation Inference (TGI v3)

LMDeploy

SGLang

DeepSpeed Inference / ZeRO Inference

1. vLLM

Design

vLLM is built around PagedAttention. Instead of storing each sequence’s KV cache in a large contiguous buffer, it partitions KV into fixed size blocks and uses an indirection layer so each sequence points to a list of blocks.

This gives:

Very low KV fragmentation (reported <4% waste vs 60–80% in naïve allocators)

High GPU utilization with continuous batching

Native support for prefix sharing and KV reuse at block level

Recent versions add KV quantization (FP8) and integrate FlashAttention style kernels.

Performance

vLLM evaluation:

vLLM achieves 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs.

KV and memory behavior

PagedAttention provides a KV layout that is both GPU friendly and fragmentation resistant.

FP8 KV quantization reduces KV size and improves decode throughput when compute is not the bottleneck.

Where it fits

Default high performance engine when you need a general LLM serving backend with good throughput, good TTFT, and hardware flexibility.

2. TensorRT LLM

Design

TensorRT LLM is a compilation based engine on top of NVIDIA TensorRT. It generates fused kernels per model and shape, and exposes an executor API used by frameworks such as Triton.

Its KV subsystem is explicit and feature rich:

Paged KV cache

Quantized KV cache (INT8, FP8, with some combinations still evolving)

Circular buffer KV cache

KV cache reuse, including offloading KV to CPU and reusing it across prompts to reduce TTFT

NVIDIA reports that CPU based KV reuse can reduce time to first token by up to 14× on H100 and even more on GH200 in specific scenarios.

Performance

TensorRT LLM is highly tunable, so results vary. Common patterns from public comparisons and vendor benchmarks:

Very low single request latency on NVIDIA GPUs when engines are compiled for the exact model and configuration.

At moderate concurrency, it can be tuned either for low TTFT or for high throughput; at very high concurrency, throughput optimized profiles push P99 up due to aggressive batching.

KV and memory behavior

Paged KV plus quantized KV gives strong control over memory use and bandwidth.

Executor and memory APIs let you design cache aware routing policies at the application layer.

Where it fits

Latency critical workloads and NVIDIA only environments, where teams can invest in engine builds and per model tuning.

3. Hugging Face TGI v3

Design

Text Generation Inference (TGI) is a server focused stack with:

Rust based HTTP and gRPC server

Continuous batching, streaming, safety hooks

Backends for PyTorch and TensorRT and tight Hugging Face Hub integration

TGI v3 adds a new long context pipeline:

Chunked prefill for long inputs

Prefix KV caching so long conversation histories are not recomputed on each request

Performance

For conventional prompts, recent third party work shows:

vLLM often edges out TGI on raw tokens per second at high concurrency due to PagedAttention, but the difference is not huge on many setups.

TGI v3 processes around 3× more tokens and is up to 13× faster than vLLM on long prompts, under a setup with very long histories and prefix caching enabled.

Latency profile:

P50 for short and mid length prompts is similar to vLLM when both are tuned with continuous batching.

For long chat histories, prefill dominates in naive pipelines; TGI v3’s reuse of earlier tokens gives a large win in TTFT and P50.

KV and memory behavior

TGI uses KV caching with paged attention style kernels and reduces memory footprint through chunking of prefill and other runtime changes.

It integrates quantization through bits and bytes and GPTQ and runs across several hardware backends.

Where it fits

Production stacks already on Hugging Face, especially for chat style workloads with long histories where prefix caching gives large real world gains.

4. LMDeploy

Design

LMDeploy is a toolkit for compression and deployment from the InternLM ecosystem. It exposes two engines:

TurboMind: high performance CUDA kernels for NVIDIA GPUs

PyTorch engine: flexible fallback

Key runtime features:

Persistent, continuous batching

Blocked KV cache with a manager for allocation and reuse

Dynamic split and fuse for attention blocks

Tensor parallelism

Weight only and KV quantization (including AWQ and online INT8 / INT4 KV quant)

LMDeploy delivers up to 1.8× higher request throughput than vLLM, attributing this to persistent batching, blocked KV and optimized kernels.

Performance

Evaluations show:

For 4 bit Llama style models on A100, LMDeploy can reach higher tokens per second than vLLM under comparable latency constraints, especially at high concurrency.

It also reports that 4 bit inference is about 2.4× faster than FP16 for supported models.

Latency:

Single request TTFT is in the same ballpark as other optimized GPU engines when configured without extreme batch limits.

Under heavy concurrency, persistent batching plus blocked KV let LMDeploy sustain high throughput without TTFT collapse.

KV and memory behavior

Blocked KV cache trades contiguous per sequence buffers for a grid of KV chunks managed by the runtime, similar in spirit to vLLM’s PagedAttention but with a different internal layout.

Support for weight and KV quantization targets large models on constrained GPUs.

Where it fits

NVIDIA centric deployments that want maximum throughput and are comfortable using TurboMind and LMDeploy specific tooling.

5. SGLang

Design

SGLang is both:

A DSL for building structured LLM programs such as agents, RAG workflows and tool pipelines

A runtime that implements RadixAttention, a KV reuse mechanism that shares prefixes using a radix tree structure rather than simple block hashes.

RadixAttention:

Stores KV for many requests in a prefix tree keyed by tokens

Enables high KV hit rates when many calls share prefixes, such as few shot prompts, multi turn chat, or tool chains

Performance

Key Insights:

SGLang achieves up to 6.4× higher throughput and up to 3.7× lower latency than baseline systems such as vLLM, LMQL and others on structured workloads.

Improvements are largest when there is heavy prefix reuse, for example multi turn chat or evaluation workloads with repeated context.

Reported KV cache hit rates range from roughly 50% to 99%, and cache aware schedulers get close to the optimal hit rate on the measured benchmarks.

KV and memory behavior

RadixAttention sits on top of paged attention style kernels and focuses on reuse rather than just allocation.

SGLang integrates well with hierarchical context caching systems that move KV between GPU and CPU when sequences are long, although those systems are usually implemented as separate projects.

Where it fits

Agentic systems, tool pipelines, and heavy RAG applications where many calls share large prompt prefixes and KV reuse matters at the application level.

6. DeepSpeed Inference / ZeRO Inference

Design

DeepSpeed provides two pieces relevant for inference:

DeepSpeed Inference: optimized transformer kernels plus tensor and pipeline parallelism

ZeRO Inference / ZeRO Offload: techniques that offload model weights, and in some setups KV cache, to CPU or NVMe to run very large models on limited GPU memory

ZeRO Inference focuses on:

Keeping little or no model weights resident in GPU

Streaming tensors from CPU or NVMe as needed

Targeting throughput and model size rather than low latency

Performance

In the ZeRO Inference OPT 30B example on a single V100 32GB:

Full CPU offload reaches about 43 tokens per second

Full NVMe offload reaches about 30 tokens per second

Both are 1.3–2.4× faster than partial offload configurations, because full offload enables larger batch sizes

These numbers are small compared to GPU resident LLM runtimes on A100 or H100, but they apply to a model that does not fit natively in 32GB.

A recent I/O characterization of DeepSpeed and FlexGen confirms that offload based systems are dominated by small 128 KiB reads and that I/O behavior becomes the main bottleneck.

KV and memory behavior

Model weights and sometimes KV blocks are offloaded to CPU or SSD to fit models beyond GPU capacity.

TTFT and P99 are high compared to pure GPU engines, but the tradeoff is the ability to run very large models that otherwise would not fit.

Where it fits

Offline or batch inference, or low QPS services where model size matters more than latency and GPU count.

Comparison Tables

This table summarizes the main tradeoffs qualitatively:

RuntimeMain design ideaRelative strengthKV strategyTypical use casevLLMPagedAttention, continuous batchingHigh tokens per second at given TTFTPaged KV blocks, FP8 KV supportGeneral purpose GPU serving, multi hardwareTensorRT LLMCompiled kernels on NVIDIA + KV reuseVery low latency and high throughput on NVIDIAPaged, quantized KV, reuse and offloadNVIDIA only, latency sensitiveTGI v3HF serving layer with long prompt pathStrong long prompt performance, integrated stackPaged KV, chunked prefill, prefix cachingHF centric APIs, long chat historiesLMDeployTurboMind kernels, blocked KV, quantUp to 1.8× vLLM throughput in vendor testsBlocked KV cache, weight and KV quantNVIDIA deployments focused on raw throughputSGLangRadixAttention and structured programsUp to 6.4× throughput and 3.7× lower latency on structured workloadsRadix tree KV reuse over prefixesAgents, RAG, high prefix reuseDeepSpeedGPU CPU NVMe offload for huge modelsEnables large models on small GPU; throughput orientedOffloaded weights and sometimes KVVery large models, offline or low QPS

Choosing a runtime in practice

For a production system, the choice tends to collapse to a few simple patterns:

You want a strong default engine with minimal custom work: You can start with vLLM. It gives you good throughput, reasonable TTFT, and solid KV handling on common hardware.

You are committed to NVIDIA and want fine grained control over latency and KV: You can use TensorRT LLM, likely behind Triton or TGI. Plan for model specific engine builds and tuning.

Your stack is already on Hugging Face and you care about long chats: You can use TGI v3. Its long prompt pipeline and prefix caching are very effective for conversation style traffic.

You want maximum throughput per GPU with quantized models: You can use LMDeploy with TurboMind and blocked KV, especially for 4 bit Llama family models.

You are building agents, tool chains or heavy RAG systems: You can use SGLang and design prompts so that KV reuse via RadixAttention is high.

You must run very large models on limited GPUs: You can use DeepSpeed Inference / ZeRO Inference, accept higher latency, and treat the GPU as a throughput engine with SSD in the loop.

Overall, all these engines are converging on the same idea: KV cache is the real bottleneck resource. The winners are the runtimes that treat KV as a first class data structure to be paged, quantized, reused and offloaded, not just a big tensor slapped into GPU memory.
The post Comparing the Top 6 Inference Runtimes for LLM Serving in 2025 appeared first on MarkTechPost.

Connect Amazon Bedrock agents to cross-account knowledge bases

Organizations need seamless access to their structured data repositories to power intelligent AI agents. However, when these resources span multiple AWS accounts integration challenges can arise. This post explores a practical solution for connecting Amazon Bedrock agents to knowledge bases in Amazon Redshift clusters residing in different AWS accounts.
The challenge
Organizations that build AI agents using Amazon Bedrock can maintain their structured data in Amazon Redshift clusters. When these data repositories exist in separate AWS accounts from their AI agents, they face a significant limitation: Amazon Bedrock Knowledge Bases doesn’t natively support cross-account Redshift integration.
This creates a challenge for enterprises with multi-account architectures who want to:

Leverage existing structured data in Redshift for their AI agents.
Maintain separation of concerns across different AWS accounts.
Avoid duplicating data across accounts.
Ensure proper security and access controls.

Solution overview
Our solution enables cross-account knowledge base integration through a secure, serverless architecture that maintains secure access controls while allowing AI agents to query structured data. The approach uses AWS Lambda as an intermediary to facilitate secure cross-account data access.

The action flow as shown above:

Users enter their natural language question in Amazon Bedrock Agents which is configured in the agent account.
Amazon Bedrock Agents invokes a Lambda function through action groups which provides access to the Amazon Bedrock knowledge base configured in the agent-kb account above.
Action group Lambda function running in agent account assumes an IAM role created in agent-kb account above to connect to the knowledge base in the agent-kb account.
Amazon Bedrock Knowledge Base in the agent-kb account uses an IAM role created in the same account to access Amazon Redshift data warehouse and query data in the data warehouse.

The solution follows these key components:

Amazon Bedrock agent in the agent account that handles user interactions.
Amazon Redshift serverless workgroup in VPC and private subnet in the agent-kb account containing structured data.
Amazon Bedrock Knowledge base using the Amazon Redshift serverless workgroup as structured data source.
Lambda function in the agent account.
Action group configuration to connect the agent in the agent account to the Lambda function.
IAM roles and policies that enable secure cross-account access.

Prerequisites
This solution requires you to have the following:

Two AWS accounts. Create an AWS account if you do not have one. Specific permissions required for both account which will be set up in subsequent steps.
Install the AWS CLI (2.24.22 – current version)
Set up authentication using IAM user credentials for the AWS CLI for each account
Make sure you have jq installed, jq is lightweight command-line JSON processor. For example, in Mac you can use the command brew install jq (jq-1.7.1-apple – current version) to install it.
Navigate to the Amazon Bedrock console and make sure you enable access to the meta.llama3-1-70b-instruct-v1:0 model for the agent-kb account and access for us.amazon.nova-pro-v1:0 model in the agent account in the us-west-2, US West (Oregon) AWS Region.

Assumption
Let’s call the AWS account profile, agent profile that has the Amazon Bedrock agent. Similarly, the AWS account profile be called agent-kb that has the Amazon Bedrock knowledge base with Amazon Redshift Serverless and the structured data source. We will use the us-west-2 US West (Oregon) AWS Region but feel free to choose another AWS Region as necessary (the prerequisites will be applicable to the AWS Region you choose to deploy this solution in). We will use the meta.llama3-1-70b-instruct-v1:0 model for the agent-kb. This is an available on-demand model in us-west-2. You are free to choose other models with cross-Region inference but that would mean changing the roles and polices accordingly and enable model access in all Regions they are available in. Based on our model choice for this solution the AWS Region must be us-west-2. For the agent we will be using an Amazon Bedrock agent optimized model like us.amazon.nova-pro-v1:0.
Implementation walkthrough
The following is a step-by-step implementation guide. Make sure to perform all steps in the same AWS Region in both accounts.
These steps are to deploy and test an end-to-end solution from scratch and if you are already running some of these components, you may skip over those steps.

Make a note of the AWS account numbers in the agent and agent-kb account. In the implementation steps we will refer them as follows:

Profile
AWS account
Description

agent
111122223333
Account for the Bedrock Agent

agent-kb
999999999999
Account for the Bedrock Knowledge base

Note: These steps use example profile names and account numbers, please replace with actuals before running.
Create the Amazon Redshift Serverless workgroup in the agent-kb account:

Log on to the agent-kb account
Follow the workshop link to create the Amazon Redshift Serverless workgroup in private subnet
Make a note of the namespace, workgroup, and other details and follow the rest of the hands-on workshop instructions.

Set up your data warehouse in the agent-kb account.
Create your AI knowledge base in the agent-kb account. Make a note of the knowledge base ID.
Train your AI Assistant in the agent-kb account.
Test natural language queries in the agent-kb account. You can find the code in aws-samples git repository: sample-for-amazon-bedrock-agent-connect-cross-account-kb.
Create necessary roles and policies in both the accounts. Run the script create_bedrock_agent_kb_roles_policies.sh with the following input parameters.

Input parameter
Value
Description

–agent-kb-profile
agent-kb
The agent knowledgebase profile that you set up with the AWS CLI with aws_access_key_id, aws_secret_access_key as mentioned in the prerequisites.

–lambda-role
lambda_bedrock_kb_query_role
This is the IAM role the agent account Bedrock agent action group lambda will assume to connect to the Redshift cross account

–kb-access-role
bedrock_kb_access_role
This is the IAM role the agent-kb account which the lambda_bedrock_kb_query_role in agent account assumes to connect to the Redshift cross account

–kb-access-policy
bedrock_kb_access_policy
IAM policy attached to the IAM role bedrock_kb_access_role

–lambda-policy
lambda_bedrock_kb_query_policy
IAM policy attached to the IAM role lambda_bedrock_kb_query_role

–knowledge-base-id
XXXXXXXXXX
Replace with the actual knowledge base ID created in Step 4

–agent-account
111122223333
Replace with the 12-digit AWS account number where the Bedrock agent is running. (agent account)

–agent-kb-account
999999999999
Replace with the 12-digit AWS account number where the Bedrock knowledge base is running. (agent-kb acccount)

Download the script (create_bedrock_agent_kb_roles_policies.sh) from the aws-samples GitHub repository.
Open Terminal in Mac or similar bash shell for other platforms.
Locate and change the directory to the downloaded location, provide executable permissions:

cd /my/location
chmod +x create_bedrock_agent_kb_roles_policies.sh

If you are still not clear on the script usage or inputs, then you can run the script with the –help option and the script will display the usage: ./create_bedrock_agent_kb_roles_policies.sh –help
Run the script with the right input parameters as described in the previous table.

./create_bedrock_agent_kb_roles_policies.sh –agent-profile agent
–agent-kb-profile agent-kb
–lambda-role lambda_bedrock_kb_query_role
–kb-access-role bedrock_kb_access_role
–kb-access-policy bedrock_kb_access_policy
–lambda-policy lambda_bedrock_kb_query_policy
–knowledge-base-id XXXXXXXXXX
–agent-account 111122223333
–agent-kb-account 999999999999

The script on successful execution shows the summary of the IAM, roles and policies created in both accounts.
Log on to both the agent and agent-kb account to verify the IAM roles and policies are created.

For the agent account: Make a note of the ARN of the lambda_bedrock_kb_query_role as that will be the value of CloudFormation stack parameter AgentLambdaExecutionRoleArn in the next step.
For the agent-kb account: Make a note of the ARN of the bedrock_kb_access_role as that will be the value of CloudFormation stack parameter TargetRoleArn in the next step.

Run the AWS CloudFormation script to create a Bedrock agent:

Download the CloudFormation script: cloudformation_bedrock_agent_kb_query_cross_account.yaml from the aws-samples GitHub repository.
Log on to the agent account and navigate to the CloudFormation console, and verify you are in the us-west-2 (Oregon) Region, choose Create stack and choose With new resources (standard).
In the Specify template section choose Upload a template file and then Choose file and select the file from (1). Then, choose Next.
Enter the following stack details and choose Next.

Parameter
Value
Description

Stack name
bedrock-agent-connect-kb-cross-account-agent
You can choose any name

AgentFoundationModelId
us.amazon.nova-pro-v1:0
Do not change

AgentLambdaExecutionRoleArn
arn:aws:iam:: 111122223333:role/lambda_bedrock_kb_query_role
Replace with you agent account number

BedrockAgentDescription
Agent to query inventory data from Redshift Serverless database
Keep this as default

BedrockAgentInstructions
You are an assistant that helps users query inventory data from our Redshift Serverless database using the action group.
Do not change

BedrockAgentName
bedrock_kb_query_cross_account
Keep this as default

KBFoundationModelId
meta.llama3-1-70b-instruct-v1:0
Do not change

KnowledgeBaseId
XXXXXXXXXX
Knowledge base id from Step 4

TargetRoleArn
arn:aws:iam::999999999999:role/bedrock_kb_access_role
Replace with you agent-kb account number

Complete the acknowledgement and choose Next.
Scroll down through the page and choose Submit.
You will see the CloudFormation stack is getting created as shown by the status CREATE_IN_PROGRESS.
It will take a few minutes, and you will see the status change to CREATE_COMPLETE indicating creation of all resources. Choose the Outputs tab to make a note of the resources that were created. In summary, the CloudFormation script does the following in the agent account.

Creates a Bedrock agent
Creates an action group
Also creates a Lambda function which is invoked by the Bedrock action group
Defines the OpenAPI schema
Creates necessary roles and permissions for the Bedrock agent
Finally, it prepares the Bedrock agent so that it is ready to test.

Check for model access in Oregon (us-west-2)

Verify Nova Pro (us.amazon.nova-pro-v1:0) model access in the agent account. Navigate to the Amazon Bedrock console and choose Model access under Configure and learn. Search for Model name : Nova Pro to verify access. If not, then enable model access.
Verify access to the meta.llama3-1-70b-instruct-v1:0 model in the agent-kb account. This should already be enabled as we set up the knowledge base earlier.

Run the agent. Log on to agent account. Navigate to Amazon Bedrock console and choose Agents under Build.
Choose the name of the agent and choose Test. You can test the following questions as mentioned the workshop’s Stage 4: Test Natural Language Queries page. For example:

Who are the top 5 customers in Saudi Arabia?
Who are the top parts supplier in the United States by volume?
What is the total revenue by region for the year 1998?
Which products have the highest profit margins?
Show me orders with the highest priority from the last quarter of 1997.

Choose Show trace to investigate the agent traces.

Some recommended best practices:

Phrase your question to be more specific
Use terminology that matches your table descriptions
Try questions similar to your curated examples
Verify your question relates to data that exists in the TPCH dataset
Use Amazon Bedrock Guardrails to add configurable safeguards to questions and responses.

Clean up resources
It is recommended that you clean up any resources you do not need anymore to avoid any unnecessary charges:

Navigate to the CloudFormation console for the agent and agent-kb account, search for the stack and and choose Delete.
S3 buckets need to be deleted separately.
For deleting the roles and policies created in both accounts, download the script delete-bedrock-agent-kb-roles-policies.sh from the aws-samples GitHub repository.

Open Terminal in Mac or similar bash shell on other platforms.
Locate and change the directory to the downloaded location, provide executable permissions:

cd /my/location
chmod +x delete-bedrock-agent-kb-roles-policies.sh

If you are still not clear on the script usage or inputs, then you can run the script with the –help option then the script will display the usage: ./ delete-bedrock-agent-kb-roles-policies.sh –help
Run the script: delete-bedrock-agent-kb-roles-policies.sh with the same values for the same input parameters as in Step7 when running the create_bedrock_agent_kb_roles_policies.sh script. Note: Enter the correct account numbers for agent-account and agent-kb-account before running.

./delete-bedrock-agent-kb-roles-policies.sh –agent-profile agent
–agent-kb-profile agent-kb
–lambda-role lambda_bedrock_kb_query_role
–kb-access-role bedrock_kb_access_role
–kb-access-policy bedrock_kb_access_policy
–lambda-policy lambda_bedrock_kb_query_policy
–agent-account 111122223333
–agent-kb-account 999999999999
The script will ask for a confirmation, say yes and press enter.

Summary
This solution demonstrates how the Amazon Bedrock agent in the agent account can query the Amazon Bedrock knowledge base in the agent-kb account.
Conclusion
This solution uses Amazon Bedrock Knowledge Bases for structured data to create a more integrated approach to cross-account data access. The knowledge base in agent-kb account connects directly to Amazon Redshift Serverless in a private VPC. The Amazon Bedrock agent in the agent account invokes an AWS Lambda function as part of its action group to make a cross-account connection to retrieve response from the structured knowledge base.
This architecture offers several advantages:

Uses Amazon Bedrock Knowledge Bases capabilities for structured data
Provides a more seamless integration between the agent and the data source
Maintains proper security boundaries between accounts
Reduces the complexity of direct database access codes

As Amazon Bedrock continues to evolve, you can take advantage of future enhancements to knowledge base functionality while maintaining your multi-account architecture.

About the Authors
Kunal Ghosh is an expert in AWS technologies. He passionate about building efficient and effective solutions on AWS, especially involving generative AI, analytics, data science, and machine learning. Besides family time, he likes reading, swimming, biking, and watching movies, and he is a foodie.
Arghya Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers adopt and use the AWS Cloud. He is focused on big data, data lakes, streaming and batch analytics services, and generative AI technologies.
Indranil Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers in the hi-tech and semi-conductor sectors solve complex business problems using the AWS Cloud. His special interests are in the areas of legacy modernization and migration, building analytics platforms and helping customers adopt cutting edge technologies such as generative AI.
Vinayak Datar is Sr. Solutions Manager based in Bay Area, helping enterprise customers accelerate their AWS Cloud journey. He’s focusing on helping customers to convert ideas from concepts to working prototypes to production using AWS generative AI services.

Democratizing AI: How Thomson Reuters Open Arena supports no-code AI f …

This post is cowritten by Laura Skylaki, Vaibhav Goswami, Ramdev Wudali and Sahar El Khoury from Thomson Reuters.
Thomson Reuters (TR) is a leading AI and technology company dedicated to delivering trusted content and workflow automation solutions. With over 150 years of expertise, TR provides essential solutions across legal, tax, accounting, risk, trade, and media sectors in a fast-evolving world.
TR recognized early that AI adoption would fundamentally transform professional work. According to TR’s 2025 Future of Professionals Report, 80% of professionals anticipate AI significantly impacting their work within five years, with projected productivity gains of up to 12 hours per week by 2029. To unlock this immense potential, TR needed a solution to democratize AI creation across its organization.
In this blog post, we explore how TR addressed key business use cases with Open Arena, a highly scalable and flexible no-code AI solution powered by Amazon Bedrock and other AWS services such as Amazon OpenSearch Service, Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, and AWS Lambda. We’ll explain how TR used AWS services to build this solution, including how the architecture was designed, the use cases it solves, and the business profiles that use it. The system demonstrates TR’s successful approach of using existing TR services for rapid launches while supporting thousands of users, showcasing how organizations can democratize AI access and support business profiles (for example, AI explorers and SMEs) to create applications without coding expertise.
Introducing Open Arena: No-code AI for all
TR introduced Open Arena to non-technical professionals to create their own customized AI solutions. With Open Arena users can use cutting-edge AI powered by Amazon Bedrock in a no-code environment, exemplifying TR’s commitment to democratizing AI access.
Today, Open Arena supports:

High adoption: ~70% employee adoption, with 19,000 monthly active users.
Custom solutions: Thousands of customized AI solutions created without coding, used for internal workflows or integrated into TR products for customers.
Self-served functionality: 100% self-served functionality, so that users, irrespective of technical background, can develop, evaluate, and deploy generative AI solutions.

The Open Arena journey: From prototype to enterprise solution
Conceived as a rapid prototype, Open Arena was developed in under six weeks at the onset of the generative AI boom in early 2023 by TR Labs – TR’s dedicated applied research division focused on the research, development, and application of AI and emerging trends in technologies. The goal was to support internal team exploration of large language models (LLMs) and discover unique use cases by merging LLM capabilities with TR company data.
Open Arena’s introduction significantly increased AI awareness, fostered developer-SME collaboration for groundbreaking concepts, and accelerated AI capability development for TR products. The rapid success and demand for new features quickly highlighted Open Arena’s potential for AI democratization, so TR developed an enterprise version of Open Arena. Built on the TR AI Platform, Open Arena enterprise version offers secure, scalable, and standardized services covering the entire AI development lifecycle, significantly accelerating time to production.
The Open Arena enterprise version uses existing system capabilities for enhanced data access controls, standardized service access, and compliance with TR’s governance and ethical standards. This version introduced self-served capabilities so that every user, irrespective of their technical ability, can create, evaluate, and deploy customized AI solutions in a no-code environment.

“The foundation of the AI Platform has always been about empowerment; in the early days it was about empowering Data Scientists but with the rise of Gen AI, the platform adapted and evolved on empowering users of any background to leverage and create AI Solutions.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

As of July 2025, the TR Enterprise AI Platform consists of 15 services spanning the entire AI development lifecycle and user personas. Open Arena remains one of its most popular, serving 19,000 users each month, with increasing monthly usage.
Addressing key enterprise AI challenges across user types
Using the TR Enterprise AI Platform, Open Arena helped thousands of professionals transition into using generative AI. AI-powered innovation is now readily in the hands of everyone, not just AI scientists.
Open Arena successfully addresses four critical enterprise AI challenges:

Enablement: Delivers AI solution building with consistent LLM and service provider experience and support for various user personas, including non-technical.
Security and quality: Streamlines AI solution quality tracking using evaluation and monitoring services, whilst complying with data governance and ethics policies.
Speed and reusability: Automates workflows and uses existing AI solutions and prompts.
Resources and cost management: Tracks and displays generative AI solution resource consumption, supporting transparency and efficiency.

The solution currently supports several AI experiences, including tech support, content creation, coding assistance, data extraction and analysis, proof reading, project management, content summarization, personal development, translation, and problem solving, catering to different user needs across the organization.

Figure 1. Examples of Open Arena use cases.
AI explorers use Open Arena to speed up day-to-day tasks, such as summarizing documents, engaging in LLM chat, building custom workflows, and comparing AI models. AI creators and Subject Matter Experts (SMEs) use Open Arena to build custom AI workflows and experiences and to evaluate solutions without requiring coding knowledge. Meanwhile, developers can develop and deploy new AI solutions at speed, training models, creating new AI skills, and deploying AI capabilities.
Why Thomson Reuters selected AWS for Open Arena
TR strategically chose AWS as a primary cloud provider for Open Arena based on several critical factors:

Comprehensive AI/ML capabilities: Amazon Bedrock offers easy access to a choice of high-performing foundation models from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma AI, Meta, Mistral AI, OpenAI, Qwen, Stability AI, TwelveLabs, Writer, and Amazon. It supports simple chat and complex RAG workflows, and integrates seamlessly with TR’s existing Enterprise AI Platform.
Enterprise-grade security and governance: Advanced security controls, model access using RBAC, data handling with enhanced security features, single sign-on (SSO) enabled, and clear operational and user data separation across AWS accounts.
Scalable infrastructure: Serverless architecture for automatic scaling, pay-per-use pricing for cost optimization, and global availability with low latency.
Existing relationship and expertise: Strong, established relationship between TR and AWS, existing Enterprise AI Platform on AWS, and deep AWS expertise within TR’s technical teams.

“Our long-standing partnership with AWS and their robust, flexible and innovative services made them the natural choice to power Open Arena and accelerate our AI initiatives.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

Open Arena architecture: Scalability, extensibility, and security
Designed for a broad enterprise audience, Open Arena prioritizes scalability, extensibility and security while maintaining simplicity for non-technical users to create and deploy AI solutions. The following diagram illustrates the architecture of Open Arena.

Figure 2. Architecture design of Open Arena.
The architecture design facilitates enterprise-grade performance with clear separation between capability and usage, aligning with TR’s enterprise cost and usage tracking requirements.
The following are key components of the solution architecture:

No-code interface: Intuitive UI, visual workflow builder, pre-built templates, drag-and-drop functionality.
Enterprise integration: Seamless integration with TR’s Enterprise AI Platform, SSO enabled, data handling with enhanced security, clear data separation.
Solution management: Searchable repository, public/private sharing, version control, usage analytics.

TR developed Open Arena using AWS services such as Amazon Bedrock, Amazon OpenSearch, Amazon DynamoDB, Amazon API Gateway, AWS Lambda, and AWS Step Functions. It uses Amazon Bedrock for foundational model interactions, supporting simple chat and complex Retrieval-Augmented Generation (RAG) tasks. Open Arena uses Amazon Bedrock Flows as the custom workflow builder where users can drag-and-drop components like prompts, agents, knowledge bases and Lambda functions to create sophisticated AI workflows without coding. The system also integrates with AWS OpenSearch for knowledge bases and external APIs for advanced agent capabilities.
For data separation, orchestration is managed using the Enterprise AI Platform AWS account, capturing operational data. Flow instances and user-specific data reside in the user’s dedicated AWS account, stored in a database. Each user’s data and workflow executions are isolated within their respective AWS accounts, which is required for complying with Thomson Reuters data sovereignty and enterprise security policies with strict regional controls. The system integrates with Thomson Reuters SSO solution to automatically identify users and grant secure, private access to foundational models.
The orchestration layer, centrally hosted within the Enterprise AI Platform AWS account, manages AI workflow activities, including scheduling, deployment, resource provisioning, and governance across user environments.
The system features fully automated provisioning of  Amazon Bedrock Flows directly within each user’s AWS account, avoiding manual setup and accelerating time to value. Using AWS Lambda for serverless compute and DynamoDB for scalable, low-latency storage, the system dynamically allocates resources based on real-time demand. This architecture makes sure prompt flows and supporting infrastructure are deployed and scaled to match workload fluctuations, optimizing performance, cost, and user experience.

“Our decision to adopt a cross-account architecture was driven by a commitment to enterprise security and operational excellence. By isolating orchestration from execution, we make sure that each user’s data remains private and secure within their own AWS account, while still delivering a seamless, centrally-managed experience. This design empowers organizations to innovate rapidly without compromising compliance or control.”
– Thomson Reuters’ architecture team

Evolution of Open Arena: From classic to Amazon Bedrock Flows-powered chain builder
Open Arena has evolved to cater to varying levels of user sophistication:

Open Arena v1 (Classic): Features a form-based interface for simple prompt customization and basic AI workflow deployment within a single AWS account. Its simplicity appeals to novice users for straightforward use cases, though with limited advanced capabilities.
Open Arena v2 (Chain Builder): Introduces a robust, visual workflow builder interface, enabling users to design complex, multi-step AI workflows using drag-and-drop components. With support for advanced node types, parallel execution, and seamless cross-account deployment, Chain Builder dramatically expands the system’s capabilities and accessibility for non-technical users.

Thomson Reuters uses Amazon Bedrock Flows as a core feature of Chain Builder. Users can define, customize, and deploy AI-driven workflows using Amazon Bedrock models. Bedrock Flows supports advanced workflows combining multiple prompt nodes, incorporating AWS Lambda functions, and supporting sophisticated RAG pipelines. Operating seamlessly across user AWS accounts, Bedrock Flows facilitates secure, scalable execution of personalized AI solutions, serving as the fundamental engine for the Chain Builder workflows and driving TR’s ability to deliver robust, enterprise-grade automation and innovation.
What’s next?
TR continues to expand Open Arena’s capabilities through the strategic partnership with AWS, focusing on:

Driving further adoption of Open Arena’s DIY capabilities.
Enhancing flexibility for workflow creation in Chain Builder with custom components, such as inline scripts.
Developing new templates to represent common tasks and workflows.
Enhancing collaboration features within Open Arena.
Extending multimodal capabilities and model integration.
Expanding into new use cases across the enterprise.

“From innovating new product ideas to reimagining daily tasks for Thomson Reuters employees, we continue to push the boundaries of what’s possible with Open Arena.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

Conclusion
In this blog post, we explored how Thomson Reuters’ Open Arena demonstrates the successful democratization of AI across an enterprise by using AWS services, particularly Amazon Bedrock and Bedrock Flows. With 19,000 monthly active users and 70% employee adoption, the system proves that no-code AI solutions can deliver enterprise-scale impact while maintaining security and governance standards.
By combining the robust infrastructure of AWS with innovative architecture design, TR has created a blueprint for AI democratization that empowers professionals across technical skill levels to harness generative AI for their daily work.
As Open Arena continues to evolve, it exemplifies how strategic cloud partnerships can accelerate AI adoption and transform how organizations approach innovation with generative AI.

About the authors
Laura Skylaki, PhD, leads the Enterprise AI Platform at Thomson Reuters, driving the development of GenAI services that accelerate the creation, testing and deployment of AI solutions, enhancing product value. A recognized expert with a doctorate in stem cell bioinformatics, her extensive experience in AI research and practical application spans legal, tax, and biotech domains. Her machine learning work is published in leading academic journals, and she is a frequent speaker on AI and machine learning
Vaibhav Goswami is a Lead Software Engineer on the AI Platform team at Thomson Reuters, where he leads the development of the Generative AI Platform that empowers users to build and deploy generative AI solutions at scale. With expertise in building production-grade AI systems, he focuses on creating tools and infrastructure that democratize access to cutting-edge AI capabilities across the enterprise.
Ramdev Wudali is a Distinguished Engineer, helping architect and build the AI/ML Platform to enable the Enterprise user, data scientists and researchers to develop Generative AI and machine learning solutions by democratizing access to tools and LLMs. In his spare time, he loves to fold paper to create origami tessellations, and wearing irreverent T-shirts
As the director of AI Platform Adoption and Training, Sahar El Khoury guides users to seamlessly onboard and successfully use the platform services, drawing on her experience in AI and data analysis across robotics (PhD), financial markets, and media.
Vu San Ha Huynh is a Solutions Architect at AWS with a PhD in Computer Science. He helps large Enterprise customers drive innovation across different domains with a focus on AI/ML and Generative AI solutions.
Paul Wright is a Senior Technical Account Manager, with over 20 years experience in the IT industry and over 7 years of dedicated cloud focus. Paul has helped some of the largest enterprise customers grow their business and improve their operational excellence. In his spare time Paul is a huge football and NFL fan.
Mike Bezak is a Senior Technical Account Manager in AWS Enterprise Support. He has over 20 years of experience in information technology, primarily disaster recovery and systems administration. Mike’s current focus is helping customers streamline and optimize their AWS Cloud journey. Outside of AWS, Mike enjoys spending time with family & friends.

Introducing structured output for Custom Model Import in Amazon Bedroc …

With Amazon Bedrock Custom Model Import, you can deploy and scale fine-tuned or proprietary foundation models in a fully managed, serverless environment. You can bring your own models into Amazon Bedrock, scale them securely without managing infrastructure, and integrate them with other Amazon Bedrock capabilities.
Today, we are excited to announce the addition of structured output to Custom Model Import. Structured output constrains a model’s generation process in real time so that every token it produces conforms to a schema you define. Rather than relying on prompt-engineering tricks or brittle post-processing scripts, you can now generate structured outputs directly at inference time.
For certain production applications, the predictability of model outputs is more important than their creative flexibility. A customer service chatbot might benefit from varied, natural-sounding responses, but an order processing system needs exact, structured data that conforms to predefined schemas. Structured output bridges this gap by maintaining the intelligence of foundation models while verifying their outputs meet strict formatting requirements.
This represents a shift from free-form text generation to outputs that are consistent, machine-readable, and designed for seamless integration with enterprise systems. While free-form text excels for human consumption, production applications require more precision. Businesses can’t afford the ambiguity of natural language variations when their systems depend on structured outputs to reliably interface with APIs, databases, and automated workflows.
In this post, you will learn how to implement structured output for Custom Model Import in Amazon Bedrock. We will cover what structured output is, how to enable it in your API calls, and how to apply it to real-world scenarios that require structured, predictable outputs.
Understanding structured output
Structured output, also known as constrained decoding, is a method that directs LLM outputs to conform to a predefined schema, such as valid JSON. Rather than allowing the model to freely select tokens based on probability distributions, it introduces constraints during generation that limit choices to only those that maintain structural validity. If a particular token would violate the schema by producing invalid JSON, inserting stray characters, or using an unexpected field name the structured output rejects it and requires the model to select another allowed option. This real-time validation helps keep the final output consistent, machine readable, and immediately usable by downstream applications without the need for additional post-processing.
Without structured output, developers often attempt to enforce structure through prompt instructions like “Respond only in JSON.” While this approach sometimes works, it remains unreliable due to the inherently probabilistic nature of LLMs. These models generate text by sampling from probability distributions, introducing natural variability that makes responses feel human but creates significant challenges for automated systems.
Consider a customer support application that classifies tickets: if responses vary between “This seems like a billing issue,” “I’d classify this as: Billing,” and “Category = BILLING,” downstream code cannot reliably interpret the results. What production systems require instead is predictable, structured output. For example:

{
“category”: “billing”,
“priority”: “high”,
“sentiment”: “negative”
}

With a response like this, your application can automatically route tickets, trigger workflows, or update databases without human intervention. By providing predictable, schema-aligned responses, structured output transforms LLMs from conversational tools into reliable system components that can be integrated with databases, APIs, and business logic. This capability opens new possibilities for automation while maintaining the intelligent reasoning that underpin the value of these models.
Beyond improving reliability and simplifying post-processing, structured output offers additional benefits that strengthens performance, security and safety in production environments.

Lower token usage and faster responses: By constraining generation to a defined schema, structured output removes unnecessary verbose, free-form text, resulting in reduced token count. Because token generation is sequential, shorter outputs directly translate to faster responses and lower latency, improving overall performance and cost efficiency.
Enhanced security against prompt injection: Structured output narrows the model’s expression space and helps prevent it from producing arbitrary or unsafe content. Bad actors cannot inject instructions, code or unexpected text outside the defined structure. Each field must match its expected type and format, making sure outputs remain within safe boundaries.
Safety and policy controls: Structured output enables you to design schemas that inherently help prevent harmful, toxic, or policy-violating content. By limiting fields to approved values, enforcing patterns, and restricting free-form text, schemas make sure outputs align with regulatory requirements.

In the next section, we will explore how structured output works with Custom Model Import in Amazon Bedrock and walks through an example of enabling it in your API calls.
Using structured output with Custom Model Import in Amazon Bedrock
Let’s start by assuming you have already imported a Hugging Face model into Amazon Bedrock using the Custom Model Import feature.
Prerequisites
Before proceeding, make sure you have:

An active AWS account with access to Amazon Bedrock
A custom model created in Amazon Bedrock using the Custom Model Import feature
Appropriate AWS Identity and Access Management (IAM) permissions to invoke models through the Amazon Bedrock Runtime

With these prerequisites in place, let’s explore how to implement structured output with your imported model.
To start using structured output with a Custom Model Import in Amazon Bedrock, begin by configuring your environment. In Python, this involves creating a Bedrock Runtime client and initializing a tokenizer from your imported Hugging Face model.
The Bedrock Runtime client provides access to your imported model using the Bedrock InvokeModel API. The tokenizer applies the correct chat template that aligns with the imported model, which defines how user, system, and assistant messages are combined into a single prompt, how the role markers (for example, <|user|>, <|assistant|>) are inserted, and where the model’s response should begin.
By calling tokenizer.apply_chat_template(messages, tokenize=False) you can generate a prompt that matches the exact input format your model expects, which is essential for consistent and reliable inference, especially when structured encoding is enabled.

import boto3
from transformers import AutoTokenizer
from botocore.config import Config

# HF model identifier imported into Bedrock
hf_model_id = “<<huggingface_model_id>>” # Example: “deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
model_arn = “arn:aws:bedrock:<<aws-region>>:<<account-id>>:imported-model/your-model-id”
region = “<<aws-region>>”

# Initialize tokenizer aligned with your imported model
tokenizer = AutoTokenizer.from_pretrained(hf_model_id)

# Initialize Bedrock client
bedrock_runtime = boto3.client(
service_name=”bedrock-runtime”,
region_name=region)

Implementing structured output
When you invoke a custom model on Amazon Bedrock, you have the option to enable structured output by adding a response_format block to the request payload. This block accepts a JSON schema that defines the structured of the model’s response. During inference, the model enforces this schema in real-time, making sure that each generated token conforms to the defined structure. Below is a walkthrough demonstrating how to implement structured output using a simple address extraction task.
Step 1: Define the data structure
You can define your expected output using a Pydantic model, which serves as a typed contract for the data you want to extract.

from pydantic import BaseModel, Field

class Address(BaseModel):
street_number: str = Field(description=”Street number”)
street_name: str = Field(description=”Street name including type (Ave, St, Rd, etc.)”)
city: str = Field(description=”City name”)
state: str = Field(description=”Two-letter state abbreviation”)
zip_code: str = Field(description=”5-digit ZIP code”)

Step 2: Generate the JSON schema
Pydantic can automatically convert your data model into a JSON schema:

schema = Address.model_json_schema()
address_schema = {
“name”: “Address”,
“schema”: schema
}

This schema defines each field’s type, description, and requirement, creating a blueprint that the model will follow during generation.
Step 3: Prepare your input messages
Format your input using the chat format expected by your model:

messages = [{
“role”: “user”,
“content”: “Extract the address: 456 Tech Boulevard, San Francisco, CA 94105”
}]

Step 4: Apply the chat template
Use your model’s tokenizer to generate the formatted prompt:

prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

Step 5: Build the request payload
Create your request body, including the response_format that references your schema:

request_body = {
‘prompt’: prompt,
‘temperature’: 0.1,
‘max_gen_len’: 1000,
‘top_p’: 0.9,
‘response_format’: {
“type”: “json_schema”,
“json_schema”: address_schema
}
}

Step 6: Invoke the model
Send the request using the InvokeModel API:

response = bedrock_runtime.invoke_model(
modelId=model_arn,
body=json.dumps(request_body),
accept=”application/json”,
contentType=”application/json”
)

Step 7: Parse the response
Extract the generated text from the response:

result = json.loads(response[‘body’].read().decode(‘utf-8’))
raw_output = result[‘choices’][0][‘text’]
print(raw_output)

Because the schema defines required fields, the model’s response will contain them:

{
“street_number”: “456”,
“street_name”: “Tech Boulevard”,
“city”: “San Francisco”,
“state”: “CA”,
“zip_code”: “94105”
}

The output is clean, valid JSON that can be consumed directly by your application with no extra parsing, filtering, or cleanup required.
Conclusion
Structured output with Custom Model Import in Amazon Bedrock provides an effective way to generate structures, schema-aligned outputs from your models. By shifting validation into the model inference itself, structured output reduce the need for complex post-processing workflows and error handling code.
Structured output generates outputs that are predictable and straightforward to integrate into your systems and supports a variety of use cases, for example, building financial applications that require precise data extraction, healthcare systems that need structured clinical documentation, or customer service systems that demand consistent ticket classification.
Start experimenting with structured output with your Custom Model Import today and transform how your AI applications deliver consistent, production-ready results.

About the authors
Manoj Selvakumar is a Generative AI Specialist Solutions Architect at AWS, where he helps organizations design, prototype, and scale AI-powered solutions in the cloud. With expertise in deep learning, scalable cloud-native systems, and multi-agent orchestration, he focuses on turning emerging innovations into production-ready architectures that drive measurable business value. He is passionate about making complex AI concepts practical and enabling customers to innovate responsibly at scale—from early experimentation to enterprise deployment. Before joining AWS, Manoj worked in consulting, delivering data science and AI solutions for enterprise clients, building end-to-end machine learning systems supported by strong MLOps practices for training, deployment, and monitoring in production.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Revendra Kumar is a Senior Software Development Engineer at Amazon Web Services. In his current role, he focuses on model hosting and inference MLOps on Amazon Bedrock. Prior to this, he worked as an engineer on hosting Quantum computers on the cloud and developing infrastructure solutions for on-premises cloud environments. Outside of his professional pursuits, Revendra enjoys staying active by playing tennis and hiking.
Muzart Tuman is a software engineer utilizing his experience in fields like deep learning, machine learning optimization, and AI-driven applications to help solve real-world problems in a scalable, efficient, and accessible manner. His goal is to create impactful tools that not only advance technical capabilities but also inspire meaningful change across industries and communities.

Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model th …

How do we design AI systems that can plan, reason, and act over long sequences of decisions without constant human guidance? Moonshot AI has released Kimi K2 Thinking, an open source thinking agent model that exposes the full reasoning stream of the Kimi K2 Mixture of Experts architecture. It targets workloads that need deep reasoning, long horizon tool use, and stable agent behavior across many steps.

https://moonshotai.github.io/Kimi-K2/thinking.html

What is Kimi K2 Thinking?

Kimi K2 Thinking is described as the latest, most capable version of Moonshot’s open source thinking model. It is built as a thinking agent that reasons step by step and dynamically invokes tools during inference. The model is designed to interleave chain of thought with function calls so it can read, think, call a tool, think again, and repeat for hundreds of steps.

The model sets a new state of the art on Humanity’s Last Exam and BrowseComp, while maintaining coherent behavior across about 200 to 300 sequential tool calls without human interference.

At the same time, K2 Thinking is released as an open weights model with a 256K token context window and native INT4 inference, which reduces latency and GPU memory usage while preserving benchmark performance.

K2 Thinking is already live on kimi.com in chat mode and is accessible through the Moonshot platform API, with a dedicated agentic mode planned to expose the full tool using behavior.

Architecture, MoE design, and context length

Kimi K2 Thinking inherits the Kimi K2 Mixture of Experts design. The model uses a MoE architecture with 1T total parameters and 32B activated parameters per token. It has 61 layers including 1 dense layer, 384 experts with 8 experts selected per token, 1 shared expert, 64 attention heads, and an attention hidden dimension of 7168. The MoE hidden dimension is 2048 per expert.

The vocabulary size is 160K tokens and the context length is 256K. The attention mechanism is Multi head Latent Attention, and the activation function is SwiGLU.

Test time scaling and long horizon thinking

Kimi K2 Thinking is explicitly optimized for test time scaling. The model is trained to expand its reasoning length and tool call depth when facing harder tasks, rather than relying on a fixed short chain of thought.

https://moonshotai.github.io/Kimi-K2/thinking.html

On Humanity’s Last Exam in the no tools setting, K2 Thinking scores 23.9. With tools, the score rises to 44.9, and in the heavy setting it reaches 51.0. On AIME25 with Python, it reports 99.1, and on HMMT25 with Python it reports 95.1. On IMO AnswerBench it scores 78.6, and on GPQA it scores 84.5.

The testing protocol caps thinking token budgets at 96K for HLE, AIME25, HMMT25, and GPQA. It uses 128K thinking tokens for IMO AnswerBench, LiveCodeBench, and OJ Bench, and 32K completion tokens for Longform Writing. On HLE, the maximum step limit is 120 with a 48K reasoning budget per step. On agentic search tasks, the limit is 300 steps with a 24K reasoning budget per step.

Benchmarks in agentic search and coding

On agentic search tasks with tools, K2 Thinking reports 60.2 on BrowseComp, 62.3 on BrowseComp ZH, 56.3 on Seal 0, 47.4 on FinSearchComp T3, and 87.0 on Frames.

On general knowledge benchmarks, it reports 84.6 on MMLU Pro, 94.4 on MMLU Redux, 73.8 on Longform Writing, and 58.0 on HealthBench.

For coding, K2 Thinking achieves 71.3 on SWE bench Verified with tools, 61.1 on SWE bench Multilingual with tools, 41.9 on Multi SWE bench with tools, 44.8 on SciCode, 83.1 on LiveCodeBenchV6, 48.7 on OJ Bench in the C plus plus setting, and 47.1 on Terminal Bench with simulated tools.

Moonshot team also defines a Heavy Mode that runs eight trajectories in parallel, then aggregates them to produce a final answer. This is used in some reasoning benchmarks to squeeze out extra accuracy from the same base model.

Native INT4 quantization and deployment

K2 Thinking is trained as a native INT4 model. The research team applies Quantization Aware Training during the post training stage and uses INT4 weight only quantization on the MoE components. This supports INT4 inference with roughly a 2x generation speed improvement in low latency mode while maintaining state of the art performance. All reported benchmark scores are obtained under INT4 precision.

The checkpoints are saved in compressed tensors format and can be unpacked to higher precision formats such as FP8 or BF16 using the official compressed tensors tools. Recommended inference engines include vLLM, SGLang, and KTransformers.

Key Takeaways

Kimi K2 Thinking is an open weights thinking agent that extends the Kimi K2 Mixture of Experts architecture with explicit long horizon reasoning and tool use, not just short chat style responses.

The model uses a trillion parameter MoE design with about tens of billions of active parameters per token, a 256K context window, and is trained as a native INT4 model with Quantization Aware Training, which gives about 2x faster inference while keeping benchmark performance stable.

K2 Thinking is optimized for test time scaling, it can carry out hundreds of sequential tool calls in a single task and is evaluated under large thinking token budgets and strict step caps, which is important when you try to reproduce its reasoning and agentic results.

On public benchmarks, it leads or is competitive on reasoning, agentic search, and coding tasks such as HLE with tools, BrowseComp, and SWE bench Verified with tools, showing that the thinking oriented variant delivers clear gains over the base non thinking K2 model.

Editorial Comments

Kimi K2 Thinking is a strong signal that test time scaling is now a first class design target for open source reasoning models. Moonshot AI is not only exposing a 1T parameter Mixture of Experts system with 32B active parameters and 256K context window, it is doing so with native INT4 quantization, Quantization Aware Training, and tool orchestration that runs for hundreds of steps in production like settings. Overall, Kimi K2 Thinking shows that open weights reasoning agents with long horizon planning and tool use are becoming practical infrastructure, not just research demos.

Check out the Model Weights and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model that can Execute up to 200–300 Sequential Tool Calls without Human Interference appeared first on MarkTechPost.

Build an Autonomous Wet-Lab Protocol Planner and Validator Using Sales …

In this tutorial, we build a Wet-Lab Protocol Planner & Validator that acts as an intelligent agent for experimental design and execution. We design the system using Python and integrate Salesforce’s CodeGen-350M-mono model for natural language reasoning. We structure the pipeline into modular components: ProtocolParser for extracting structured data, such as steps, durations, and temperatures, from textual protocols; InventoryManager for validating reagent availability and expiry; Schedule Planner for generating timelines and parallelization; and Safety Validator for identifying biosafety or chemical hazards. The LLM is then used to generate optimization suggestions, effectively closing the loop between perception, planning, validation, and refinement.

Copy CodeCopiedUse a different Browserimport re, json, pandas as pd
from datetime import datetime, timedelta
from collections import defaultdict
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_NAME = “Salesforce/codegen-350M-mono”
print(“Loading CodeGen model (30 seconds)…”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.float16, device_map=”auto”
)
print(“✓ Model loaded!”)

We begin by importing essential libraries and loading the Salesforce CodeGen-350M-mono model locally for lightweight, API-free inference. We initialize both the tokenizer and model with float16 precision and automatic device mapping to ensure compatibility and speed on Colab GPUs.

Copy CodeCopiedUse a different Browserclass ProtocolParser:
def read_protocol(self, text):
steps = []
lines = text.split(‘n’)
for i, line in enumerate(lines, 1):
step_match = re.search(r’^(d+).s+(.+)’, line.strip())
if step_match:
num, name = step_match.groups()
context = ‘n’.join(lines[i:min(i+4, len(lines))])
duration = self._extract_duration(context)
temp = self._extract_temp(context)
safety = self._check_safety(context)
steps.append({
‘step’: int(num), ‘name’: name, ‘duration_min’: duration,
‘temp’: temp, ‘safety’: safety, ‘line’: i, ‘details’: context[:200]
})
return steps

def _extract_duration(self, text):
text = text.lower()
if ‘overnight’ in text: return 720
match = re.search(r'(d+)s*(?:hour|hr|h)(?:s)?(?!w)’, text)
if match: return int(match.group(1)) * 60
match = re.search(r'(d+)s*(?:min|minute)(?:s)?’, text)
if match: return int(match.group(1))
match = re.search(r'(d+)-(d+)s*(?:min|minute)’, text)
if match: return (int(match.group(1)) + int(match.group(2))) // 2
return 30

def _extract_temp(self, text):
text = text.lower()
if ‘4°c’ in text or ‘4 °c’ in text or ‘4°’ in text: return ‘4C’
if ’37°c’ in text or ’37 °c’ in text: return ’37C’
if ‘-20°c’ in text or ‘-80°c’ in text: return ‘FREEZER’
if ‘room temp’ in text or ‘rt’ in text or ‘ambient’ in text: return ‘RT’
return ‘RT’

def _check_safety(self, text):
flags = []
text_lower = text.lower()
if re.search(r’bsl-[23]|biosafety’, text_lower): flags.append(‘BSL-2/3′)
if re.search(r’caution|corrosive|hazard|toxic’, text_lower): flags.append(‘HAZARD’)
if ‘sharp’ in text_lower or ‘needle’ in text_lower: flags.append(‘SHARPS’)
if ‘dark’ in text_lower or ‘light-sensitive’ in text_lower: flags.append(‘LIGHT-SENSITIVE’)
if ‘flammable’ in text_lower: flags.append(‘FLAMMABLE’)
return flags

class InventoryManager:
def __init__(self, csv_text):
from io import StringIO
self.df = pd.read_csv(StringIO(csv_text))
self.df[‘expiry’] = pd.to_datetime(self.df[‘expiry’])

def check_availability(self, reagent_list):
issues = []
for reagent in reagent_list:
reagent_clean = reagent.lower().replace(‘_’, ‘ ‘).replace(‘-‘, ‘ ‘)
matches = self.df[self.df[‘reagent’].str.lower().str.contains(
‘|’.join(reagent_clean.split()[:2]), na=False, regex=True
)]
if matches.empty:
issues.append(f” {reagent}: NOT IN INVENTORY”)
else:
row = matches.iloc[0]
if row[‘expiry’] < datetime.now():
issues.append(f” {reagent}: EXPIRED on {row[‘expiry’].date()} (lot {row[‘lot’]})”)
elif (row[‘expiry’] – datetime.now()).days < 30:
issues.append(f” {reagent}: Expires soon ({row[‘expiry’].date()}, lot {row[‘lot’]})”)
if row[‘quantity’] < 10:
issues.append(f” {reagent}: LOW STOCK ({row[‘quantity’]} {row[‘unit’]} remaining)”)
return issues

def extract_reagents(self, protocol_text):
reagents = set()
patterns = [
r’b([A-Z][a-z]+(?:s+[A-Z][a-z]+)*)s+(?:antibody|buffer|solution)’,
r’b([A-Z]{2,}(?:-[A-Z0-9]+)?)b’,
r'(?:add|use|prepare|dilute)s+([a-z-]+s*(?:antibody|buffer|substrate|solution))’,
]
for pattern in patterns:
matches = re.findall(pattern, protocol_text, re.IGNORECASE)
reagents.update(m.strip() for m in matches if len(m) > 2)
return list(reagents)[:15]

We define the ProtocolParser and InventoryManager classes to extract structured experimental details and verify reagent inventory. We parse each protocol step for duration, temperature, and safety markers, while the inventory manager validates stock levels, expiry dates, and reagent availability through fuzzy matching.

Copy CodeCopiedUse a different Browserclass SchedulePlanner:
def make_schedule(self, steps, start_time=”09:00″):
schedule = []
current = datetime.strptime(f”2025-01-01 {start_time}”, “%Y-%m-%d %H:%M”)
day = 1
for step in steps:
end = current + timedelta(minutes=step[‘duration_min’])
if step[‘duration_min’] > 480:
day += 1
current = datetime.strptime(f”2025-01-0{day} 09:00″, “%Y-%m-%d %H:%M”)
end = current
schedule.append({
‘step’: step[‘step’], ‘name’: step[‘name’][:40],
‘start’: current.strftime(“%H:%M”), ‘end’: end.strftime(“%H:%M”),
‘duration’: step[‘duration_min’], ‘temp’: step[‘temp’],
‘day’: day, ‘can_parallelize’: step[‘duration_min’] > 60,
‘safety’: ‘, ‘.join(step[‘safety’]) if step[‘safety’] else ‘None’
})
if step[‘duration_min’] <= 480:
current = end
return schedule

def optimize_parallelization(self, schedule):
parallel_groups = []
idle_time = 0
for i, step in enumerate(schedule):
if step[‘can_parallelize’] and i + 1 < len(schedule):
next_step = schedule[i+1]
if step[‘temp’] == next_step[‘temp’]:
saved = min(step[‘duration’], next_step[‘duration’])
parallel_groups.append(
f” Steps {step[‘step’]} & {next_step[‘step’]} can overlap → Save {saved} min”
)
idle_time += saved
return parallel_groups, idle_time

class SafetyValidator:
RULES = {
‘ph_range’: (5.0, 11.0),
‘temp_limits’: {‘4C’: (2, 8), ’37C’: (35, 39), ‘RT’: (20, 25)},
‘max_concurrent_instruments’: 3,
}

def validate(self, steps):
risks = []
for step in steps:
ph_match = re.search(r’phs*(d+.?d*)’, step[‘details’].lower())
if ph_match:
ph = float(ph_match.group(1))
if not (self.RULES[‘ph_range’][0] <= ph <= self.RULES[‘ph_range’][1]):
risks.append(f” Step {step[‘step’]}: pH {ph} OUT OF SAFE RANGE”)
if ‘BSL-2/3’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: BSL-2 cabinet REQUIRED”)
if ‘HAZARD’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: Full PPE + chemical hood REQUIRED”)
if ‘SHARPS’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: Sharps container + needle safety”)
if ‘LIGHT-SENSITIVE’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: Work in dark/amber tubes”)
return risks

We implement the SchedulePlanner and SafetyValidator to design efficient experiment timelines and enforce lab safety standards. We dynamically generate daily schedules, identify parallelizable steps, and validate potential risks, such as unsafe pH levels, hazardous chemicals, or biosafety-level requirements.

Copy CodeCopiedUse a different Browserdef llm_call(prompt, max_tokens=200):
try:
inputs = tokenizer(prompt, return_tensors=”pt”, truncation=True, max_length=512).to(model.device)
outputs = model.generate(
**inputs, max_new_tokens=max_tokens, do_sample=True,
temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):].strip()
except:
return “Batch similar temperature steps together. Pre-warm instruments.”

def agent_loop(protocol_text, inventory_csv, start_time=”09:00″):
print(“n AGENT STARTING PROTOCOL ANALYSIS…n”)
parser = ProtocolParser()
steps = parser.read_protocol(protocol_text)
print(f” Parsed {len(steps)} protocol steps”)
inventory = InventoryManager(inventory_csv)
reagents = inventory.extract_reagents(protocol_text)
print(f” Identified {len(reagents)} reagents: {‘, ‘.join(reagents[:5])}…”)
inv_issues = inventory.check_availability(reagents)
validator = SafetyValidator()
safety_risks = validator.validate(steps)
planner = SchedulePlanner()
schedule = planner.make_schedule(steps, start_time)
parallel_opts, time_saved = planner.optimize_parallelization(schedule)
total_time = sum(s[‘duration’] for s in schedule)
optimized_time = total_time – time_saved
opt_prompt = f”Protocol has {len(steps)} steps, {total_time} min total. Key bottleneck optimization:”
optimization = llm_call(opt_prompt, max_tokens=80)
return {
‘steps’: steps, ‘schedule’: schedule, ‘inventory_issues’: inv_issues,
‘safety_risks’: safety_risks, ‘parallelization’: parallel_opts,
‘time_saved’: time_saved, ‘total_time’: total_time,
‘optimized_time’: optimized_time, ‘ai_optimization’: optimization,
‘reagents’: reagents
}

We construct the agent loop, integrating perception, planning, validation, and revision into a single, coherent flow. We use CodeGen for reasoning-based optimization to refine step sequencing and propose practical improvements for efficiency and parallel execution.

Copy CodeCopiedUse a different Browserdef generate_checklist(results):
md = “# WET-LAB PROTOCOL CHECKLISTnn”
md += f”**Total Steps:** {len(results[‘schedule’])}n”
md += f”**Estimated Time:** {results[‘total_time’]} min ({results[‘total_time’]//60}h {results[‘total_time’]%60}m)n”
md += f”**Optimized Time:** {results[‘optimized_time’]} min (save {results[‘time_saved’]} min)nn”
md += “## TIMELINEn”
current_day = 1
for item in results[‘schedule’]:
if item[‘day’] > current_day:
md += f”n### Day {item[‘day’]}n”
current_day = item[‘day’]
parallel = ” ” if item[‘can_parallelize’] else “”
md += f”- [ ] **{item[‘start’]}-{item[‘end’]}** | Step {item[‘step’]}: {item[‘name’]} ({item[‘temp’]}){parallel}n”
md += “n## REAGENT PICK-LISTn”
for reagent in results[‘reagents’]:
md += f”- [ ] {reagent}n”
md += “n## SAFETY & INVENTORY ALERTSn”
all_issues = results[‘safety_risks’] + results[‘inventory_issues’]
if all_issues:
for risk in all_issues:
md += f”- {risk}n”
else:
md += “- No critical issues detectedn”
md += “n## OPTIMIZATION TIPSn”
for tip in results[‘parallelization’]:
md += f”- {tip}n”
md += f”- AI Suggestion: {results[‘ai_optimization’]}n”
return md

def generate_gantt_csv(schedule):
df = pd.DataFrame(schedule)
return df.to_csv(index=False)

We create output generators that transform results into human-readable Markdown checklists and Gantt-compatible CSVs. We ensure that every execution produces clear summaries of reagents, time savings, and safety or inventory alerts for streamlined lab operations.

Copy CodeCopiedUse a different BrowserSAMPLE_PROTOCOL = “””ELISA Protocol for Cytokine Detection

1. Coating (Day 1, 4°C overnight)
– Dilute capture antibody to 2 μg/mL in coating buffer (pH 9.6)
– Add 100 μL per well to 96-well plate
– Incubate at 4°C overnight (12-16 hours)
– BSL-2 cabinet required

2. Blocking (Day 2)
– Wash plate 3× with PBS-T (200 μL/well)
– Add 200 μL blocking buffer (1% BSA in PBS)
– Incubate 1 hour at room temperature

3. Sample Incubation
– Wash 3× with PBS-T
– Add 100 μL diluted samples/standards
– Incubate 2 hours at room temperature

4. Detection Antibody
– Wash 5× with PBS-T
– Add 100 μL biotinylated detection antibody (0.5 μg/mL)
– Incubate 1 hour at room temperature

5. Streptavidin-HRP
– Wash 5× with PBS-T
– Add 100 μL streptavidin-HRP (1:1000 dilution)
– Incubate 30 minutes at room temperature
– Work in dark

6. Development
– Wash 7× with PBS-T
– Add 100 μL TMB substrate
– Incubate 10-15 minutes (monitor color development)
– Add 50 μL stop solution (2M H2SO4) – CAUTION: corrosive
“””

SAMPLE_INVENTORY = “””reagent,quantity,unit,expiry,lot
capture antibody,500,μg,2025-12-31,AB123
blocking buffer,500,mL,2025-11-30,BB456
PBS-T,1000,mL,2026-01-15,PT789
detection antibody,8,μg,2025-10-15,DA321
streptavidin HRP,10,mL,2025-12-01,SH654
TMB substrate,100,mL,2025-11-20,TM987
stop solution,250,mL,2026-03-01,SS147
BSA,100,g,2024-09-30,BS741″””

results = agent_loop(SAMPLE_PROTOCOL, SAMPLE_INVENTORY, start_time=”09:00″)
print(“n” + “=”*70)
print(generate_checklist(results))
print(“n” + “=”*70)
print(“n GANTT CSV (first 400 chars):n”)
print(generate_gantt_csv(results[‘schedule’])[:400])
print(“n Time Savings:”, f”{results[‘time_saved’]} minutes via parallelization”)

We conduct a comprehensive test run using a sample ELISA protocol and a reagent inventory dataset. We visualize the agent’s outputs, optimized schedule, parallelization gains, and AI-suggested improvements, demonstrating how our planner functions as a self-contained, intelligent lab assistant.

At last, we demonstrated how agentic AI principles can enhance reproducibility and safety in wet-lab workflows. By parsing free-form experimental text into structured, actionable plans, we automated protocol validation, reagent management, and temporal optimization in a single pipeline. The integration of CodeGen enables on-device reasoning about bottlenecks and safety conditions, allowing for self-contained, data-secure operations. We concluded with a fully functional planner that generates Gantt-compatible schedules, Markdown checklists, and AI-driven optimization tips, establishing a robust foundation for autonomous laboratory planning systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Build an Autonomous Wet-Lab Protocol Planner and Validator Using Salesforce CodeGen for Agentic Experiment Design and Safety Optimization appeared first on MarkTechPost.

Google AI Introduces DS STAR: A Multi Agent Data Science System That P …

How do you turn a vague business style question over messy folders of CSV, JSON and text into reliable Python code without a human analyst in the loop? Google researchers introduce DS STAR (Data Science Agent via Iterative Planning and Verification), a multi agent framework that turns open ended data science questions into executable Python scripts over heterogeneous files. Instead of assuming a clean SQL database and a single query, DS STAR treats the problem as Text to Python and operates directly on mixed formats such as CSV, JSON, Markdown and unstructured text.

https://arxiv.org/pdf/2509.21825

From Text To Python Over Heterogeneous Data

Existing data science agents often rely on Text to SQL over relational databases. This constraint limits them to structured tables and simple schema, which does not match many enterprise environments where data sits across documents, spreadsheets and logs.

DS STAR changes the abstraction. It generates Python code that loads and combines whatever files the benchmark provides. The system first summarizes every file, then uses that context to plan, implement and verify a multi step solution. This design allows DS STAR to work on benchmarks such as DABStep, KramaBench and DA Code, which expect multi step analysis over mixed file types and require answers in strict formats.

https://arxiv.org/pdf/2509.21825

Stage 1: Data File Analysis With Aanalyzer

The first stage builds a structured view of the data lake. For each file (Dᵢ), the Aanalyzer agent generates a Python script (sᵢ_desc) that parses the file and prints essential information such as column names, data types, metadata and text summaries. DS STAR executes this script and captures the output as a concise description (dᵢ).

This process works for both structured and unstructured data. CSV files yield column level statistics and samples, while JSON or text files produce structural summaries and key snippets. The collection {dᵢ} becomes shared context for all later agents.

https://arxiv.org/pdf/2509.21825

Stage 2: Iterative Planning, Coding And Verification

After file analysis, DS STAR runs an iterative loop that mirrors how a human uses a notebook.

Aplanner creates an initial executable step (p₀) using the query and the file descriptions, for example loading a relevant table.

Acoder turns the current plan (p) into Python code (s). DS STAR executes this code to obtain an observation (r).

Averifier is an LLM based judge. It receives the cumulative plan, the query, the current code and its execution result and returns a binary decision, sufficient or insufficient.

If the plan is insufficient, Arouter decides how to refine it. It either outputs the token Add Step, which appends a new step, or an index of an erroneous step to truncate and regenerate from.

Aplanner is conditioned on the latest execution result (rₖ), so each new step explicitly responds to what went wrong in the previous attempt. The loop of routing, planning, coding, executing and verifying continues until Averifier marks the plan sufficient or the system hits a maximum of 20 refinement rounds.

https://arxiv.org/pdf/2509.21825

To satisfy strict benchmark formats, a separate Afinalyzer agent converts the final plan into solution code that enforces rules such as rounding and CSV output.

Robustness Modules, Adebugger And Retriever

Realistic pipelines fail on schema drift and missing columns. DS STAR adds Adebugger to repair broken scripts. When code fails, Adebugger receives the script, the traceback and the analyzer descriptions {dᵢ}. It generates a corrected script by conditioning on all three signals, which is important because many data centric bugs require knowledge of column headers, sheet names or schema, not only the stack trace.

KramaBench introduces another challenge, thousands of candidate files per domain. DS STAR handles this with a Retriever. The system embeds the user query and each description (dᵢ) using a pre trained embedding model and selects the top 100 most similar files for the agent context, or all files if there are fewer than 100. In the implementation, the research team used Gemini Embedding 001 for similarity search.

https://arxiv.org/pdf/2509.21825

Benchmark Results On DABStep, KramaBench And DA Code

All main experiments run DS STAR with Gemini 2.5 Pro as the base LLM and allow up to 20 refinement rounds per task.

On DABStep, model only Gemini 2.5 Pro achieves 12.70 percent hard level accuracy. DS STAR with the same model reaches 45.24 percent on hard tasks and 87.50 percent on easy tasks. This is an absolute gain of more than 32 percentage points on the hard split and it outperforms other agents such as ReAct, AutoGen, Data Interpreter, DA Agent and several commercial systems recorded on the public leaderboard.

https://arxiv.org/pdf/2509.21825

The Google research team reports that, compared to the best alternative system on each benchmark, DS STAR improves overall accuracy from 41.0 percent to 45.2 percent on DABStep, from 39.8 percent to 44.7 percent on KramaBench and from 37.0 percent to 38.5 percent on DA Code.

https://arxiv.org/pdf/2509.21825

For KramaBench, which requires retrieving relevant files from large domain specific data lakes, DS STAR with retrieval and Gemini 2.5 Pro achieves a total normalized score of 44.69. The strongest baseline, DA Agent with the same model, reaches 39.79.

https://arxiv.org/pdf/2509.21825

On DA Code, DS STAR again beats DA Agent. On hard tasks, DS STAR reaches 37.1 percent accuracy versus 32.0 percent for DA Agent when both use Gemini 2.5 Pro.

Key Takeaways

DS STAR reframes data science agents as Text to Python over heterogeneous files such as CSV, JSON, Markdown and text, instead of only Text to SQL over clean relational tables.

The system uses a multi agent loop with Aanalyzer, Aplanner, Acoder, Averifier, Arouter and Afinalyzer, which iteratively plans, executes and verifies Python code until the verifier marks the solution as sufficient.

Adebugger and a Retriever module improve robustness, by repairing failing scripts using rich schema descriptions and by selecting the top 100 relevant files from large domain specific data lakes.

With Gemini 2.5 Pro and 20 refinement rounds, DS STAR achieves large gains over prior agents on DABStep, KramaBench and DA Code, for example increasing DABStep hard accuracy from 12.70 percent to 45.24 percent.

Ablations show that analyzer descriptions and routing are critical, and experiments with GPT 5 confirm that the DS STAR architecture is model agnostic, while iterative refinement is essential for solving hard multi step analytics tasks.

Editorial Comments

DS STAR shows that practical data science automation needs explicit structure around large language models, not only better prompts. The combination of Aanalyzer, Averifier, Arouter and Adebugger turns free form data lakes into a controlled Text to Python loop that is measurable on DABStep, KramaBench and DA Code, and portable across Gemini 2.5 Pro and GPT 5. This work moves data agents from table demos toward benchmarked, end to end analytics systems.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Introduces DS STAR: A Multi Agent Data Science System That Plans, Codes And Verifies End To End Analytics appeared first on MarkTechPost.