NVIDIA and Mistral AI Bring 10x Faster Inference for the Mistral 3 Fam …

NVIDIA announced today a significant expansion of its strategic collaboration with Mistral AI. This partnership coincides with the release of the new Mistral 3 frontier open model family, marking a pivotal moment where hardware acceleration and open-source model architecture have converged to redefine performance benchmarks.

This collaboration is a massive leap in inference speed: the new models now run up to 10x faster on NVIDIA GB200 NVL72 systems compared to the previous generation H200 systems. This breakthrough unlocks unprecedented efficiency for enterprise-grade AI, promising to solve the latency and cost bottlenecks that have historically plagued the large-scale deployment of reasoning models.

A Generational Leap: 10x Faster on Blackwell

As enterprise demand shifts from simple chatbots to high-reasoning, long-context agents, inference efficiency has become the critical bottleneck. The collaboration between NVIDIA and Mistral AI addresses this head-on by optimizing the Mistral 3 family specifically for the NVIDIA Blackwell architecture.

Where production AI systems must deliver both strong user experience (UX) and cost-efficient scale, the NVIDIA GB200 NVL72 provides up to 10x higher performance than the previous-generation H200. This is not merely a gain in raw speed; it translates to significantly higher energy efficiency. The system exceeds 5,000,000 tokens per second per megawatt (MW) at user interactivity rates of 40 tokens per second.

Created by MarkTechpost.com and source NVIDIA

For data centers grappling with power constraints, this efficiency gain is as critical as the performance boost itself. This generational leap ensures a lower per-token cost while maintaining the high throughput required for real-time applications.

A New Mistral 3 Family

The engine driving this performance is the newly released Mistral 3 family. This suite of models delivers industry-leading accuracy, efficiency, and customization capabilities, covering the spectrum from massive data center workloads to edge device inference.

Mistral Large 3: The Flagship MoE

At the top of the hierarchy sits Mistral Large 3, a state-of-the-art sparse Multimodal and Multilingual Mixture-of-Experts (MoE) model.

Total Parameters: 675 Billion

Active Parameters: 41 Billion

Context Window: 256K tokens

Trained on NVIDIA Hopper GPUs, Mistral Large 3 is designed to handle complex reasoning tasks, offering parity with top-tier closed models while retaining the flexibility of open weights.

Ministral 3: Dense Power at the Edge

Complementing the large model is the Ministral 3 series, a suite of small, dense, high-performance models designed for speed and versatility.

Sizes: 3B, 8B, and 14B parameters.

Variants: Base, Instruct, and Reasoning for each size (nine models total).

Context Window: 256K tokens across the board.

The Ministral 3 series excel at GPQA Diamond Accuracy benchmark by utilizing 100 less tokens while delivery higher accuracy :

Significant Engineering Behind the Speed: A Comprehensive Optimization Stack

The “10x” performance claim is driven by a comprehensive stack of optimizations co-developed by Mistral and NVIDIA engineers. The teams adopted an “extreme co-design” approach, merging hardware capabilities with model architecture adjustments.

TensorRT-LLM Wide Expert Parallelism (Wide-EP)

To fully exploit the massive scale of the GB200 NVL72, NVIDIA employed Wide Expert Parallelism within TensorRT-LLM. This technology provides optimized MoE GroupGEMM kernels, expert distribution, and load balancing.

Crucially, Wide-EP exploits the NVL72’s coherent memory domain and NVLink fabric. It is highly resilient to architectural variations across large MoEs. For instance, Mistral Large 3 utilizes roughly 128 experts per layer, about half as many as comparable models like DeepSeek-R1. Despite this difference, Wide-EP enables the model to realize the high-bandwidth, low-latency, non-blocking benefits of the NVLink fabric, ensuring that the model’s massive size does not result in communication bottlenecks.

Native NVFP4 Quantization

One of the most significant technical advancements in this release is the support for NVFP4, a quantization format native to the Blackwell architecture.

For Mistral Large 3, developers can deploy a compute-optimized NVFP4 checkpoint quantized offline using the open-source llm-compressor library.

This approach reduces compute and memory costs while strictly maintaining accuracy. It leverages NVFP4’s higher-precision FP8 scaling factors and finer-grained block scaling to control quantization error. The recipe specifically targets the MoE weights while keeping other components at original precision, allowing the model to deploy seamlessly on the GB200 NVL72 with minimal accuracy loss.

Disaggregated Serving with NVIDIA Dynamo

Mistral Large 3 utilizes NVIDIA Dynamo, a low-latency distributed inference framework, to disaggregate the prefill and decode phases of inference.

In traditional setups, the prefill phase (processing the input prompt) and the decode phase (generating the output) compete for resources. By rate-matching and disaggregating these phases, Dynamo significantly boosts performance for long-context workloads, such as 8K input/1K output configurations. This ensures high throughput even when utilizing the model’s massive 256K context window.

From Cloud to Edge: Ministral 3 Performance

The optimization efforts extend beyond the massive data centers. Recognizing the growing need for local AI, the Ministral 3 series is engineered for edge deployment, offering flexibility for a variety of needs.

RTX and Jetson Acceleration

The dense Ministral models are optimized for platforms like the NVIDIA GeForce RTX AI PC and NVIDIA Jetson robotics modules.

RTX 5090: The Ministral-3B variants can reach blistering inference speeds of 385 tokens per second on the NVIDIA RTX 5090 GPU. This brings workstation-class AI performance to local PCs, enabling fast iteration and greater data privacy.

Jetson Thor: For robotics and edge AI, developers can use the vLLM container on NVIDIA Jetson Thor. The Ministral-3-3B-Instruct model achieves 52 tokens per second for single concurrency, scaling up to 273 tokens per second with a concurrency of 8.

Broad Framework Support

NVIDIA has collaborated with the open-source community to ensure these models are usable everywhere.

Llama.cpp & Ollama: NVIDIA collaborated with these popular frameworks to ensure faster iteration and lower latency for local development.

SGLang: NVIDIA collaborated with SGLang to create an implementation of Mistral Large 3 that supports both disaggregation and speculative decoding.

vLLM: NVIDIA worked with vLLM to expand support for kernel integrations, including speculative decoding (EAGLE), Blackwell support, and expanded parallelism.

Production-Ready with NVIDIA NIM

To streamline enterprise adoption, the new models will be available through NVIDIA NIM microservices.

Mistral Large 3 and Ministral-14B-Instruct are currently available through the NVIDIA API catalog and preview API. Soon, enterprise developers will be able to use downloadable NVIDIA NIM microservices. This provides a containerized, production-ready solution that allows enterprises to deploy the Mistral 3 family with minimal setup on any GPU-accelerated infrastructure.

This availability ensures that the specific “10x” performance advantage of the GB200 NVL72 can be realized in production environments without complex custom engineering, democratizing access to frontier-class intelligence.

Conclusion: A New Standard for Open Intelligence

The release of the NVIDIA-accelerated Mistral 3 open model family represents a major leap for AI in the open-source community. By offering frontier-level performance under an open source license, and backing it with a robust hardware optimization stack, Mistral and NVIDIA are meeting developers where they are.

From the massive scale of the GB200 NVL72 utilizing Wide-EP and NVFP4, to the edge-friendly density of Ministral on an RTX 5090, this partnership delivers a scalable, efficient path for artificial intelligence. With upcoming optimizations such as speculative decoding with multitoken prediction (MTP) and EAGLE-3 expected to push performance even further, the Mistral 3 family is poised to become a foundational element of the next generation of AI applications.

Available to test!

If you are a developer looking to benchmark these performance gains, you can download the Mistral 3 models directly from Hugging Face or test the deployment-free hosted versions on build.nvidia.com/mistralai to evaluate the latency and throughput for your specific use case.

Check out the Models on Hugging Face. You can find details on Corporate Blog and Technical/Developer Blog.

Thanks to the NVIDIA AI team for the thought leadership/ Resources for this article. NVIDIA AI team has supported this content/article.
The post NVIDIA and Mistral AI Bring 10x Faster Inference for the Mistral 3 Family on GB200 NVL72 GPU Systems appeared first on MarkTechPost.

How We Learn Step-Level Rewards from Preferences to Solve Sparse-Rewar …

In this tutorial, we explore Online Process Reward Learning (OPRL) and demonstrate how we can learn dense, step-level reward signals from trajectory preferences to solve sparse-reward reinforcement learning tasks. We walk through each component, from the maze environment and reward-model network to preference generation, training loops, and evaluation, while observing how the agent gradually improves its behaviour through online preference-driven shaping. By running this end-to-end implementation, we gain a practical understanding of how OPRL enables better credit assignment, faster learning, and more stable policy optimization in challenging environments where the agent would otherwise struggle to discover meaningful rewards. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserimport numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
import matplotlib.pyplot as plt
from collections import deque
import random

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

class MazeEnv:
def __init__(self, size=8):
self.size = size
self.start = (0, 0)
self.goal = (size-1, size-1)
self.obstacles = set([(i, size//2) for i in range(1, size-2)])
self.reset()

def reset(self):
self.pos = self.start
self.steps = 0
return self._get_state()

def _get_state(self):
state = np.zeros(self.size * self.size)
state[self.pos[0] * self.size + self.pos[1]] = 1
return state

def step(self, action):
moves = [(-1,0), (0,1), (1,0), (0,-1)]
new_pos = (self.pos[0] + moves[action][0],
self.pos[1] + moves[action][1])
if (0 <= new_pos[0] < self.size and
0 <= new_pos[1] < self.size and
new_pos not in self.obstacles):
self.pos = new_pos
self.steps += 1
done = self.pos == self.goal or self.steps >= 60
reward = 10.0 if self.pos == self.goal else 0.0
return self._get_state(), reward, done

def render(self):
grid = [[‘.’ for _ in range(self.size)] for _ in range(self.size)]
for obs in self.obstacles:
grid[obs[0]][obs[1]] = ‘█’
grid[self.goal[0]][self.goal[1]] = ‘G’
grid[self.pos[0]][self.pos[1]] = ‘A’
return ‘n’.join([”.join(row) for row in grid])

class ProcessRewardModel(nn.Module):
def __init__(self, state_dim, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.LayerNorm(hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.LayerNorm(hidden),
nn.ReLU(),
nn.Linear(hidden, 1),
nn.Tanh()
)
def forward(self, states):
return self.net(states)
def trajectory_reward(self, states):
return self.forward(states).sum()

class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden=128):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU()
)
self.actor = nn.Linear(hidden, action_dim)
self.critic = nn.Linear(hidden, 1)
def forward(self, state):
features = self.backbone(state)
return self.actor(features), self.critic(features)

We set up the entire foundation of our OPRL system by importing libraries, defining the maze environment, and building the reward and policy networks. We establish how states are represented, how obstacles block movement, and how the sparse reward structure works. We also design the core neural models that will later learn process rewards and drive the policy’s decisions. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserclass OPRLAgent:
def __init__(self, state_dim, action_dim, lr=3e-4):
self.policy = PolicyNetwork(state_dim, action_dim)
self.reward_model = ProcessRewardModel(state_dim)
self.policy_opt = Adam(self.policy.parameters(), lr=lr)
self.reward_opt = Adam(self.reward_model.parameters(), lr=lr)
self.trajectories = deque(maxlen=200)
self.preferences = deque(maxlen=500)
self.action_dim = action_dim

def select_action(self, state, epsilon=0.1):
if random.random() < epsilon:
return random.randint(0, self.action_dim – 1)
state_t = torch.FloatTensor(state).unsqueeze(0)
with torch.no_grad():
logits, _ = self.policy(state_t)
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, 1).item()

def collect_trajectory(self, env, epsilon=0.1):
states, actions, rewards = [], [], []
state = env.reset()
done = False
while not done:
action = self.select_action(state, epsilon)
next_state, reward, done = env.step(action)
states.append(state)
actions.append(action)
rewards.append(reward)
state = next_state
traj = {
‘states’: torch.FloatTensor(np.array(states)),
‘actions’: torch.LongTensor(actions),
‘rewards’: torch.FloatTensor(rewards),
‘return’: float(sum(rewards))
}
self.trajectories.append(traj)
return traj

We begin constructing the OPRL agent by implementing action selection and trajectory collection. We use an ε-greedy strategy to ensure exploration and gather sequences of states, actions, and returns. As we run the agent through the maze, we store entire trajectories that will later serve as preference data for shaping the reward model. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browser def generate_preference(self):
if len(self.trajectories) < 2:
return
t1, t2 = random.sample(list(self.trajectories), 2)
label = 1.0 if t1[‘return’] > t2[‘return’] else 0.0
self.preferences.append({‘t1’: t1, ‘t2’: t2, ‘label’: label})

def train_reward_model(self, n_updates=5):
if len(self.preferences) < 32:
return 0.0
total_loss = 0.0
for _ in range(n_updates):
batch = random.sample(list(self.preferences), 32)
loss = 0.0
for item in batch:
r1 = self.reward_model.trajectory_reward(item[‘t1’][‘states’])
r2 = self.reward_model.trajectory_reward(item[‘t2’][‘states’])
logit = r1 – r2
pred_prob = torch.sigmoid(logit)
label = item[‘label’]
loss += -(label * torch.log(pred_prob + 1e-8) +
(1-label) * torch.log(1 – pred_prob + 1e-8))
loss = loss / len(batch)
self.reward_opt.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.reward_model.parameters(), 1.0)
self.reward_opt.step()
total_loss += loss.item()
return total_loss / n_updates

We generate preference pairs from collected trajectories and train the process reward model using the Bradley–Terry formulation. We compare trajectory-level scores, compute probabilities, and update the reward model to reflect which behaviours appear better. This allows us to learn dense, differentiable, step-level rewards that guide the agent even when the environment itself is sparse. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browser def train_policy(self, n_updates=3, gamma=0.98):
if len(self.trajectories) < 5:
return 0.0
total_loss = 0.0
for _ in range(n_updates):
traj = random.choice(list(self.trajectories))
with torch.no_grad():
process_rewards = self.reward_model(traj[‘states’]).squeeze()
shaped_rewards = traj[‘rewards’] + 0.1 * process_rewards
returns = []
G = 0
for r in reversed(shaped_rewards.tolist()):
G = r + gamma * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
returns = (returns – returns.mean()) / (returns.std() + 1e-8)
logits, values = self.policy(traj[‘states’])
log_probs = F.log_softmax(logits, dim=-1)
action_log_probs = log_probs.gather(1, traj[‘actions’].unsqueeze(1))
advantages = returns – values.squeeze().detach()
policy_loss = -(action_log_probs.squeeze() * advantages).mean()
value_loss = F.mse_loss(values.squeeze(), returns)
entropy = -(F.softmax(logits, dim=-1) * log_probs).sum(-1).mean()
loss = policy_loss + 0.5 * value_loss – 0.01 * entropy
self.policy_opt.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
self.policy_opt.step()
total_loss += loss.item()
return total_loss / n_updates

def train_oprl(episodes=500, render_interval=100):
env = MazeEnv(size=8)
agent = OPRLAgent(state_dim=64, action_dim=4, lr=3e-4)
returns, reward_losses, policy_losses = [], [], []
success_rate = []
for ep in range(episodes):
epsilon = max(0.05, 0.5 – ep / 1000)
traj = agent.collect_trajectory(env, epsilon)
returns.append(traj[‘return’])
if ep % 2 == 0 and ep > 10:
agent.generate_preference()
if ep > 20 and ep % 2 == 0:
rew_loss = agent.train_reward_model(n_updates=3)
reward_losses.append(rew_loss)
if ep > 10:
pol_loss = agent.train_policy(n_updates=2)
policy_losses.append(pol_loss)
success = 1 if traj[‘return’] > 5 else 0
success_rate.append(success)
if ep % render_interval == 0 and ep > 0:
test_env = MazeEnv(size=8)
agent.collect_trajectory(test_env, epsilon=0)
print(test_env.render())
return returns, reward_losses, policy_losses, success_rate

We train the policy using shaped rewards produced by the learned process reward model. We compute returns, advantages, value estimates, and entropy bonuses, enabling the agent to improve its strategy over time. We then build a full training loop in which exploration decays, preferences accumulate, and both the reward model and the policy are updated continuously. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserprint(“Training OPRL Agent on Sparse Reward Maze…n”)
returns, rew_losses, pol_losses, success = train_oprl(episodes=500, render_interval=250)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0,0].plot(returns, alpha=0.3)
axes[0,0].plot(np.convolve(returns, np.ones(20)/20, mode=’valid’), linewidth=2)
axes[0,0].set_xlabel(‘Episode’)
axes[0,0].set_ylabel(‘Return’)
axes[0,0].set_title(‘Agent Performance’)
axes[0,0].grid(alpha=0.3)

success_smooth = np.convolve(success, np.ones(20)/20, mode=’valid’)
axes[0,1].plot(success_smooth, linewidth=2, color=’green’)
axes[0,1].set_xlabel(‘Episode’)
axes[0,1].set_ylabel(‘Success Rate’)
axes[0,1].set_title(‘Goal Success Rate’)
axes[0,1].grid(alpha=0.3)

axes[1,0].plot(rew_losses, linewidth=2, color=’orange’)
axes[1,0].set_xlabel(‘Update Step’)
axes[1,0].set_ylabel(‘Loss’)
axes[1,0].set_title(‘Reward Model Loss’)
axes[1,0].grid(alpha=0.3)

axes[1,1].plot(pol_losses, linewidth=2, color=’red’)
axes[1,1].set_xlabel(‘Update Step’)
axes[1,1].set_ylabel(‘Loss’)
axes[1,1].set_title(‘Policy Loss’)
axes[1,1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(“OPRL Training Complete!”)
print(“Process rewards, preference learning, reward shaping, and online updates demonstrated.”)

We visualize the learning dynamics by plotting returns, success rates, reward-model loss, and policy loss. We monitor how the agent’s performance evolves as OPRL shapes the reward landscape. By the end of the visualization, we clearly see the impact of process rewards on solving a challenging, sparse-reward maze.

In conclusion, we see how OPRL transforms sparse terminal outcomes into rich online feedback that continuously guides the agent’s behaviour. We watch the process reward model learn preferences, shape the return signal, and accelerate the policy’s ability to reach the goal. With larger mazes, varying shaping strengths, or even real human preference feedback, we appreciate how OPRL provides a flexible and powerful framework for credit assignment in complex decision-making tasks. We finish with a clear, hands-on understanding of how OPRL operates and how we can extend it to more advanced agentic RL settings.

Check out the FULL CODE NOTEBOOK and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How We Learn Step-Level Rewards from Preferences to Solve Sparse-Reward Environments Using Online Process Reward Learning appeared first on MarkTechPost.

Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem F …

Large language model agents are starting to store everything they see, but can they actually improve their policies at test time from those experiences rather than just replaying context windows?

Researchers from University of Illinois Urbana Champaign and Google DeepMind propose Evo-Memory, a streaming benchmark and agent framework that targets this exact gap. Evo-Memory evaluates test-time learning with self-evolving memory, asking whether agents can accumulate and reuse strategies from continuous task streams instead of relying only on static conversational logs.

https://arxiv.org/pdf/2511.20857

Conversational Recall vs Experience Reuse

Most current agents implement conversational recall. They store dialogue history, tool traces, and retrieved documents, which are then reintegrated into the context window for future queries. This type of memory serves as a passive buffer, capable of recovering facts or recalling previous steps, but it does not actively modify the agent’s approach for related tasks.

Evo-Memory instead focuses on experience reuse. Here each interaction is treated as an experience that encodes not only inputs and outputs, but also whether a task succeeded and which strategies were effective. The benchmark checks if agents can retrieve those experiences in later tasks, apply them as reusable procedures, and refine the memory over time.

Benchmark Design and Task Streams

The research team formalizes a memory augmented agent as a tuple ((F, U, R, C)). The base model (F) generates outputs. The retrieval module (R) searches a memory store. The context constructor (C) synthesizes a working prompt from the current input and retrieved items. The update function (U) writes new experience entries and evolves the memory after every step.

Evo-Memory restructures conventional benchmarks into sequential task streams. Each dataset becomes an ordered sequence of tasks where early items carry strategies that are useful for later ones. The suite covers AIME 24, AIME 25, GPQA Diamond, MMLU-Pro economics, engineering, philosophy, and ToolBench for tool use, along with multi turn environments from AgentBoard including AlfWorld, BabyAI, ScienceWorld, Jericho, and PDDL planning.

Evaluation is done along four axes. Single turn tasks use exact match or answer accuracy. Embodied environments report success rate and progress rate. Step efficiency measures average steps per successful task. Sequence robustness tests whether performance is stable when task order changes.

https://arxiv.org/pdf/2511.20857

ExpRAG, a Minimal Experience Reuse Baseline

To set a lower bound, the research team define ExpRAG. Each interaction becomes a structured experience text with template ⟨xi​,yi​^​,fi​⟩where xi​ is input, yi​^​ is model output and fi​ is feedback, for example a correctness signal. At a new step (t), the agent retrieves similar experiences from memory using a similarity score and concatenates them with the current input as in-context examples. Then it appends the new experience into memory.

ExpRAG does not change the agent control loop. It is still a single shot call to the backbone, but now augmented with explicitly stored prior tasks. The design is intentionally simple so that any gains on Evo-Memory can be attributed to task level experience retrieval, not to new planning or tool abstractions.

ReMem, Action Think Memory Refine

The main contribution on the agent side is ReMem, an action–think–memory refine pipeline built on top of the same backbone models. At each internal step, given the current input, memory state and past reasoning traces, the agent chooses one of three operations:

Think generates intermediate reasoning traces that decompose the task.

Act emits an environment action or final answer visible to the user.

Refine performs meta reasoning on memory by retrieving, pruning and reorganizing experience entries.

This loop induces a Markov decision process where the state includes the query, current memory and ongoing thoughts. Within a step the agent can interleave several Think and Refine operations, and the step terminates when an Act operation is issued. In contrast to standard ReAct style agents, memory is no longer a fixed buffer. It becomes an explicit object that the agent reasons about and edits during inference.

https://arxiv.org/pdf/2511.20857

Results on Reasoning, Tools and Embodied Environments

The research team instantiate all methods on Gemini 2.5 Flash and Claude 3.7 Sonnet under a unified search–predict–evolve protocol. This isolates the effect of memory architecture, since prompting, search and feedback are held constant across baselines.

On single turn benchmarks, evolving memory methods produce consistent but moderate gains. For Gemini 2.5 Flash, ReMem reaches average exact match 0.65 across AIME 24, AIME 25, GPQA Diamond and MMLU Pro subsets, and 0.85 and 0.71 API and accuracy on ToolBench. ExpRAG also performs strongly, with average 0.60, and outperforms several more complex designs such as Agent Workflow Memory and Dynamic Cheatsheet variants.

The impact is larger in multi turn environments. On Claude 3.7 Sonnet, ReMem reaches success and progress 0.92 and 0.96 on AlfWorld, 0.73 and 0.83 on BabyAI, 0.83 and 0.95 on PDDL and 0.62 and 0.89 on ScienceWorld, giving average 0.78 success and 0.91 progress across datasets. On Gemini 2.5 Flash, ReMem achieves average 0.50 success and 0.64 progress, improving over history and ReAct style baselines in all four environments.

Step efficiency is also improved. In AlfWorld, average steps to complete a task drop from 22.6 for a history baseline to 11.5 for ReMem. Lightweight designs such as ExpRecent and ExpRAG reduce steps as well, which indicates that even simple task level experience reuse can make behaviour more efficient without architectural changes to the backbone.

A further analysis links gains to task similarity inside each dataset. Using embeddings from the retriever encoder, the research team compute average distance from tasks to their cluster center. ReMem’s margin over a history baseline correlates strongly with this similarity measure, with reported Pearson correlation about 0.72 on Gemini 2.5 Flash and 0.56 on Claude 3.7 Sonnet. Structured domains such as PDDL and AlfWorld show larger improvements than diverse sets like AIME 25 or GPQA Diamond.

Key Takeaways

Evo-Memory is a comprehensive streaming benchmark that converts standard datasets into ordered task, so agents can retrieve, integrate and update memory over time rather than rely on static conversational recall.

The framework formalizes memory augmented agents as a tuple ((F, U, R, C)) and implements more than 10 representative memory modules, including retrieval based, workflow and hierarchical memories, evaluated on 10 single turn and multi turn datasets across reasoning, question answering, tool use and embodied environments.

ExpRAG provides a minimal experience reuse baseline that stores each task interaction as a structured text record with input, model output and feedback, then retrieves similar experiences as in context exemplars for new tasks, already giving consistent improvements over pure history based baselines.

ReMem extends the standard ReAct style loop with an explicit Think, Act, Refine Memory control cycle, which lets the agent actively retrieve, prune and reorganize its memory during inference, leading to higher accuracy, higher success rate and fewer steps on both single turn reasoning and long horizon interactive environments.

Across Gemini 2.5 Flash and Claude 3.7 Sonnet backbones, self evolving memories such as ExpRAG and especially ReMem make smaller models behave like stronger agents at test time, improving exact match, success and progress metrics without any retraining of base model weights.

Editorial Notes

Evo Memory is a useful step for evaluating self evolving memory in LLM agents. It forces models to operate on sequential task streams instead of isolated prompts. It compares more than 10 memory architectures under a single framework. Simple methods like ExpRAG already show clear gains. ReMem’s action, think, refine memory loop improves exact match, success and progress without retraining base weights. Overall, this research work makes test time evolution a concrete design target for LLM agent systems

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem Framework for Experience Reuse in LLM Agents appeared first on MarkTechPost.

DeepSeek Researchers Introduce DeepSeek-V3.2 and DeepSeek-V3.2-Special …

How do you get GPT-5-level reasoning on real long-context, tool-using workloads without paying the quadratic attention and GPU cost that usually makes those systems impractical? DeepSeek research introduces DeepSeek-V3.2 and DeepSeek-V3.2-Speciale. They are reasoning-first models built for agents and targets high quality reasoning, long context and agent workflows, with open weights and production APIs. The models combine DeepSeek Sparse Attention (DSA), a scaled GRPO reinforcement learning stack and an agent native tool protocol, and report performance comparable to GPT 5, with DeepSeek-V3.2-Speciale reaching Gemini 3.0 Pro level reasoning on public benchmarks and competitions.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Sparse Attention with Near Linear Long Context Cost

Both DeepSeek-V3.2 and DeepSeek-V3.2-Speciale use the DeepSeek-V3 Mixture of Experts transformer with about 671B total parameters and 37B active parameters per token, inherited from V3.1 Terminus. The only structural change is DeepSeek Sparse Attention, introduced through continued pre-training.

DeepSeek Sparse Attention splits attention into 2 components. A lightning indexer runs a small number of low precision heads over all token pairs and produces relevance scores. A fine grained selector keeps the top-k-key value positions per query, and the main attention path runs Multi-Query-Attention and Multi-Head-Latent-Attention on this sparse set.

This changes the dominant complexity from O(L²) to O(kL), where L is sequence length and k is the number of selected tokens and much smaller than L. Based on the benchmarks, DeepSeek-V3.2 matches the dense Terminus baseline on accuracy while reducing long context inference cost by about 50 percent, with faster throughput and lower memory use on H800 class hardware and on vLLM and SGLang backends.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Continued Pre Training for DeepSeek Sparse Attention

DeepSeek Sparse Attention (DSA) is introduced by continued pre-training on top of DeepSeek-V3.2 Terminus. In the dense warm up stage, dense attention remains active, all backbone parameters are frozen and only the lightning indexer is trained with a Kullback Leibler loss to match the dense attention distribution on 128K context sequences. This stage uses a small number of steps and about 2B tokens, enough for the indexer to learn useful scores.

In the sparse stage, the selector keeps 2048 key-value entries per query, the backbone is unfrozen and the model continues training on about 944B tokens. Gradients for the indexer still come only from the alignment loss with dense attention on the selected positions. This schedule makes DeepSeek Sparse Attention (DSA) behave as a drop in replacement for dense attention with similar quality and lower long context cost.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

GRPO with more than 10 Percent RL Compute

On top of the sparse architecture, DeepSeek-V3.2 uses Group Relative Policy Optimization (GRPO) as the main reinforcement learning method. The research team state that post training reinforcement learning RL compute exceeds 10 percent of pre training compute.

RL is organized around specialist domains. The research team trains dedicated runs for mathematics, competitive programming, general logical reasoning, browsing and agent tasks and safety, then distills these specialists into the shared 685B parameter base for DeepSeek-V3.2 and DeepSeek-V3.2-Speciale. GRPO is implemented with an unbiased KL estimator, off policy sequence masking and mechanisms that keep Mixture of Experts (MoE) routing and sampling masks consistent between training and sampling.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Agent Data, Thinking Mode and Tool Protocol

DeepSeek research team builds a large synthetic agent dataset by generating more than 1,800 environments and more than 85,000 tasks across code agents, search agents, general tools and code interpreter setups. Tasks are constructed to be hard to solve and easy to verify, and are used as RL targets together with real coding and search traces.

At inference time, DeepSeek-V3.2 introduces explicit thinking and non thinking modes. The deepseek-reasoner endpoint exposes thinking mode by default, where the model produces an internal chain of thought before the final answer. The thinking with tools guide describes how reasoning content is kept across tool calls and cleared when a new user message arrives, and how tool calls and tool results stay in the context even when reasoning text is trimmed for budget.

The chat template is updated around this behavior. The DeepSeek-V3.2 Speciale repository ships Python encoder and decoder helpers instead of a Jinja template. Messages can carry a reasoning_content field alongside content, controlled by a thinking parameter. A developer role is reserved for search agents and is not accepted in general chat flows by the official API, which protects this channel from accidental misuse.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Benchmarks, Competitions And Open Artifacts

On standard reasoning and coding benchmarks, DeepSeek-V3.2 and especially DeepSeek-V3.2 Speciale are reported as comparable to GPT-5 and close to Gemini-3.0 Pro on suites such as AIME 2025, HMMT 2025, GPQA and LiveCodeBench, with improved cost efficiency on long context workloads.

For formal competitions, DeepSeek research team states that DeepSeek-V3.2 Speciale achieves gold medal level performance on the International Mathematical Olympiad 2025, the Chinese Mathematical Olympiad 2025 and the International Olympiad in Informatics 2025, and competitive gold medal level performance at the ICPC World Finals 2025.

Key Takeaways

DeepSeek-V3.2 adds DeepSeek Sparse Attention, which brings near linear O(kL) attention cost and delivers around 50% lower long context API cost compared to previous dense DeepSeek models, while keeping quality similar to DeepSeek-V3.1 Terminus.

The model family keeps the 671B parameter MoE backbone with 37B active parameters per token and exposes a full 128K context window in production APIs, which makes long documents, multi step chains and large tool traces practical rather than a lab only feature.

Post training uses Group Relative Policy Optimization (GRPO) with a compute budget that is more than 10 percent of pre-training, focused on math, code, general reasoning, browsing or agent workloads and safety, along with contest style specialists whose cases are released for external verification.

DeepSeek-V3.2 is the first model in the DeepSeek family to integrate thinking directly into tool use, supporting both thinking and non thinking tool modes and a protocol where internal reasoning persists across tool calls and is reset only on new user messages.

Check out the Paper and Model weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepSeek Researchers Introduce DeepSeek-V3.2 and DeepSeek-V3.2-Speciale for Long Context Reasoning and Agentic Workloads appeared first on MarkTechPost.

MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic …

The AI coding landscape just got a massive shake-up. If you’ve been relying on Claude 3.5 Sonnet or GPT-4o for your dev workflows, you know the pain: great performance often comes with a bill that makes your wallet weep, or latency that breaks your flow.This article provides a technical overview of MiniMax-M2, focusing on its core design choices and capabilities, and how it changes the price to performance baseline for agentic coding workflows.

Branded as ‘Mini Price, Max Performance,’ MiniMax-M2 targets agentic coding workloads with around 2x the speed of leading competitors at roughly 8% of their price. The key change is not only cost efficiency, but a different computational and reasoning pattern in how the model structures and executes its “thinking” during complex tool and code workflows.

The Secret Sauce: Interleaved Thinking

The standout feature of MiniMax-M2 is its native mastery of Interleaved Thinking. 

But what does that actually mean?

Most LLMs operate in a linear “Chain of Thought” (CoT) where they do all their planning upfront and then fire off a series of tool calls (like running code or searching the web). The problem? If the first tool call returns unexpected data, the initial plan becomes stale, leading to “state drift” where the model keeps hallucinating a path that no longer exists.

Interleaved Thinking changes the game by creating a dynamic Plan -> Act-> Reflect loop.

Instead of front-loading all the logic, MiniMax-M2 alternates between explicit reasoning and tool use. It reasons, executes a tool, reads the output, and then reasons again based on that fresh evidence. This allows the model to:

Self-Correct: If a shell command fails, it reads the error and adjusts its next move immediately.

Preserve State: It carries forward hypotheses and constraints between steps, preventing the “memory loss” common in long coding tasks.

Handle Long Horizons: This approach is critical for complex agentic workflows (like building an entire app feature) where the path isn’t clear from step one.

Benchmarks show the impact is real: enabling Interleaved Thinking boosted MiniMax-M2’s score on SWE-Bench Verified by over 3% and on BrowseComp by a massive 40%.

Powered by Mixture of Experts MoE: Speed Meets Smarts

How does MiniMax-M2 achieve low latency while being smart enough to replace a senior dev? The answer lies in its Mixture of Experts (MoE) architecture.

MiniMax-M2 is a massive model with 230 billion total parameters, but it utilizes a “sparse” activation technique. For any given token generation, it only activates 10 billion parameters.

This design delivers the best of both worlds:

Huge Knowledge Base: You get the deep world knowledge and reasoning capacity of a 200B+ model.

Blazing Speed: Inference runs with the lightness of a 10B model, enabling high throughput and low latency.

For interactive agents like Claude Code, Cursor, or Cline, this speed is non-negotiable. You need the model to think, code, and debug in real-time without the “thinking…” spinner of death.

Agent & Code Native

MiniMax-M2 wasn’t just trained on text; it was developed for end-to-end developer workflows. It excels at handling robust toolchains including MCP (Model Context Protocol), shell execution, browser retrieval, and complex codebases.

It is already being integrated into the heavy hitters of the AI coding world:

Claude Code

Cursor

Cline

Kilo Code

Droid

The Economics: 90% Cheaper than the Competition

The pricing structure is perhaps the most aggressive we’ve seen for a model of this caliber. MiniMax is practically giving away “intelligence” compared to the current market leaders.

API Pricing (vs Claude 3.5 Sonnet):

Input Tokens: $0.3 / Million (10% of Sonnet’s cost)

Cache Hits: $0.03 / Million (10% of Sonnet’s cost)

Output Tokens: $1.2 / Million (8% of Sonnet’s cost)

For individual developers, they offer tiered Coding Plans that undercut the market significantly:

Starter: $10/month (Includes a $2 first-month promo).

Pro: $20/month.

Max: $50/month (Up to 5x the usage limit of Claude Code Max).

As if that was not enough…MiniMax recently launched a Global Developer Ambassador Program, a global initiative designed to empower independent ML and LLM developers. The program invites builders to collaborate directly with the MiniMax R&D team to shape the future.

The company is seeking developers with proven open-source experience who are already familiar with MiniMax models and active on platforms like GitHub and Hugging Face.

Key Program Highlights:

The Incentives: Ambassadors receive complimentary access to the MiniMax-M2 Max Coding Plan, early access to unreleased video and audio models, direct feedback channels with product leads, and potential full-time career opportunities.

The Role: Participants are expected to build public demos, create open-source tools, and provide critical feedback on APIs before public launches.

You can sign up here.

Editorial Notes

MiniMax-M2 challenges the idea that “smarter” must mean “slower” or “more expensive.” By leveraging MOE efficiency and Interleaved Thinking, it offers a compelling alternative for developers who want to run autonomous agents without bankrupting their API budget.

As we move toward a world where AI agents don’t just write code but architect entire systems, the ability to “think, act, and reflect” continuously, at a price that allows for thousands of iterations, might just make M2 the new standard for AI engineering.

Thanks to the MINIMAX AI team for the thought leadership/ Resources for this article. MINIMAX AI team has supported this content/article.
The post MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows appeared first on MarkTechPost.

How to Design an Advanced Multi-Page Interactive Analytics Dashboard w …

In this tutorial, we build an advanced multi-page interactive dashboard using Panel. Through each component of implementation, we explore how to generate synthetic data, apply rich filters, visualize dynamic time-series trends, compare segments and regions, and even simulate live KPI updates. We design the system step by step so we can truly understand how each widget, callback, and plotting function comes together to create a smooth, reactive analytics experience. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport sys, subprocess

def install_deps():
pkgs = [“panel”, “hvplot”, “pandas”, “numpy”, “bokeh”]
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”] + pkgs)

try:
import panel as pn
import hvplot.pandas
import pandas as pd
import numpy as np
except ImportError:
install_deps()
import panel as pn
import hvplot.pandas
import pandas as pd
import numpy as np

pn.extension()

rng = np.random.default_rng(42)
dates = pd.date_range(“2024-01-01″, periods=365, freq=”D”)
segments = [“A”, “B”, “C”]
regions = [“North”, “South”, “East”, “West”]

base = pd.DataFrame(
{
“date”: np.tile(dates, len(segments) * len(regions)),
“segment”: np.repeat(segments, len(dates) * len(regions)),
“region”: np.repeat(np.tile(regions, len(segments)), len(dates)),
}
)
base[“traffic”] = (
100
+ 40 * np.sin(2 * np.pi * base[“date”].dt.dayofyear / 365)
+ rng.normal(0, 15, len(base))
)
trend = {“A”: 1.0, “B”: 1.5, “C”: 2.0}
base[“traffic”] *= base[“segment”].map(trend)
base[“conversions”] = (base[“traffic”] * rng.uniform(0.01, 0.05, len(base))).astype(int)
base[“revenue”] = base[“conversions”] * rng.uniform(20, 60, len(base))
df = base.reset_index(drop=True)

We install all required dependencies and load Panel, hvPlot, Pandas, and NumPy so the dashboard runs smoothly in Colab. We generate a full year of synthetic time-series data across segments and regions, providing a rich dataset for exploration. By the end of this block, we will have a clean, ready-to-use dataframe for all upcoming visualizations. Check out the Full Codes here.

Copy CodeCopiedUse a different Browsersegment_sel = pn.widgets.CheckBoxGroup(name=”Segment”, value=segments[:2], options=segments, inline=True)
region_sel = pn.widgets.MultiChoice(name=”Region”, value=[“North”], options=regions)
metric_sel = pn.widgets.Select(name=”Metric”, value=”traffic”, options=[“traffic”, “conversions”, “revenue”])
date_range = pn.widgets.DateRangeSlider(
name=”Date Range”,
start=df[“date”].min(),
end=df[“date”].max(),
value=(df[“date”].min(), df[“date”].max()),
)
smooth_slider = pn.widgets.IntSlider(name=”Rolling Window (days)”, start=1, end=30, value=7)

def filtered_df(segment, region, drange):
d1, d2 = drange
mask = (
df[“segment”].isin(segment)
& df[“region”].isin(region or regions)
& (df[“date”] >= d1)
& (df[“date”] <= d2)
)
sub = df[mask].copy()
if sub.empty:
return df.iloc[:0]
return sub

@pn.depends(segment_sel, region_sel, metric_sel, smooth_slider, date_range)
def timeseries_plot(segment, region, metric, window, drange):
data = filtered_df(segment, region, drange)
if data.empty:
return pn.pane.Markdown(“### No data for current filters”)
grouped = data.sort_values(“date”).groupby(“date”)[metric].sum()
line = grouped.hvplot.line(title=f”{metric.title()} over time”, ylabel=metric.title())
if window > 1:
smooth = grouped.rolling(window).mean().hvplot.line(line_width=3, alpha=0.6)
return (line * smooth).opts(legend_position=”top_left”)
return line

We build the interactive widgets and the filtering logic that controls the entire dashboard. We wire the time-series plot to the widgets using reactive @pn.depends, letting us change segments, regions, metrics, date ranges, and smoothing windows instantly. With this setup, we can switch perspectives fluidly and see the effects in real time. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser@pn.depends(segment_sel, region_sel, metric_sel, date_range)
def segment_bar(segment, region, metric, drange):
data = filtered_df(segment, region, drange)
if data.empty:
return pn.pane.Markdown(“### No data to aggregate”)
agg = data.groupby(“segment”)[metric].sum().sort_values(ascending=False)
return agg.hvplot.bar(title=f”{metric.title()} by Segment”, yaxis=None)

@pn.depends(segment_sel, region_sel, metric_sel, date_range)
def region_heatmap(segment, region, metric, drange):
data = filtered_df(segment, region, drange)
if data.empty:
return pn.pane.Markdown(“### No data to aggregate”)
pivot = data.pivot_table(index=”segment”, columns=”region”, values=metric, aggfunc=”sum”)
return pivot.hvplot.heatmap(title=f”{metric.title()} Heatmap”, clabel=metric.title())

We construct additional visual layers: a segment-level bar chart and a region-segment heatmap. We let these charts react to the same global filters, so they update automatically whenever we make a selection. This gives us a deeper breakdown of patterns across categories without writing redundant code. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserkpi_source = df.copy()
kpi_idx = [0]

def compute_kpi(slice_df):
if slice_df.empty:
return 0, 0, 0
total_rev = slice_df[“revenue”].sum()
avg_conv = slice_df[“conversions”].mean()
cr = (slice_df[“conversions”].sum() / slice_df[“traffic”].sum()) * 100
return total_rev, avg_conv, cr

kpi_value = pn.indicators.Number(name=”Total Revenue (window)”, value=0, format=”$0,0″)
conv_value = pn.indicators.Number(name=”Avg Conversions”, value=0, format=”0.0″)
cr_value = pn.indicators.Number(name=”Conversion Rate”, value=0, format=”0.00%”)

def update_kpis():
step = 200
start = kpi_idx[0]
end = start + step
if start >= len(kpi_source):
kpi_idx[0] = 0
start, end = 0, step
window_df = kpi_source.iloc[start:end]
kpi_idx[0] = end
total_rev, avg_conv, cr = compute_kpi(window_df)
kpi_value.value = total_rev
conv_value.value = avg_conv
cr_value.value = cr / 100

pn.state.add_periodic_callback(update_kpis, period=1000, start=True)

We simulate a rolling stream of KPIs that update every second, creating a live-dashboard experience. We compute total revenue, average conversions, and conversion rate inside a sliding window and push the values to Panel’s numeric indicators. This lets us observe how metrics evolve continuously, just like a real monitoring system. Check out the Full Codes here.

Copy CodeCopiedUse a different Browsercontrols = pn.WidgetBox(
“### Global Controls”,
segment_sel,
region_sel,
metric_sel,
date_range,
smooth_slider,
sizing_mode=”stretch_width”,
)

page_overview = pn.Column(
pn.pane.Markdown(“## Overview: Filtered Time Series”),
controls,
timeseries_plot,
)

page_insights = pn.Column(
pn.pane.Markdown(“## Segment & Region Insights”),
pn.Row(segment_bar, region_heatmap),
)

page_live = pn.Column(
pn.pane.Markdown(“## Live KPI Window (simulated streaming)”),
pn.Row(kpi_value, conv_value, cr_value),
)

dashboard = pn.Tabs(
(“Overview”, page_overview),
(“Insights”, page_insights),
(“Live KPIs”, page_live),
)

dashboard

We assemble all components into a clean multi-page layout using Tabs. We organize the dashboard into an overview page, an insights page, and a live-KPI page, making navigation simple and intuitive. With this structure, we get a polished, interactive analytics application ready to run directly in Google Colab.

In conclusion, we see how seamlessly we can combine Panel widgets, hvPlot visualizations, and periodic callbacks to build a powerful analytics dashboard. We appreciate how every module, from filtering logic to bar charts to the live KPI stream, fits together to produce a cohesive multi-page interface that runs effortlessly. We finish with a complete, interactive system that we can extend into real-world reporting, experimentation, or production-grade dashboards.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design an Advanced Multi-Page Interactive Analytics Dashboard with Dynamic Filtering, Live KPIs, and Rich Visual Exploration Using Panel appeared first on MarkTechPost.

Meta AI Researchers Introduce Matrix: A Ray Native a Decentralized Fra …

How do you keep synthetic data fresh and diverse for modern AI models without turning a single orchestration pipeline into the bottleneck? Meta AI researchers introduce Matrix, a decentralized framework where both control and data flow are serialized into messages that move through distributed queues. As LLM training increasingly relies on synthetic conversations, tool traces and reasoning chains, most existing systems still depend on a central controller or domain specific setups, which wastes GPU capacity, adds coordination overhead and limits data diversity. Matrix instead uses peer to peer agent scheduling on a Ray cluster and delivers 2 to 15 times higher token throughput on real workloads while maintaining comparable quality.

https://arxiv.org/pdf/2511.21686

From Centralized Controllers to Peer to Peer Agents

Traditional agent frameworks keep workflow state and control logic inside a central orchestrator. Every agent call, tool call and retry goes through that controller. This model is easy to reason about, but it does not scale well when you need tens of thousands of concurrent synthetic dialogues or tool trajectories.

Matrix takes a different approach. It serializes both control flow and data flow into a message object called an orchestrator. The orchestrator holds the task state, including conversation history, intermediate results and routing logic. Stateless agents, implemented as Ray actors, pull an orchestrator from a distributed queue, apply their role specific logic, update the state and then send it directly to the next agent selected by the orchestrator. There is no central scheduler in the inner loop. Each task advances independently at row level, rather than waiting for batch level barriers as in Spark or Ray Data.

This design reduces idle time when different trajectories have very different lengths. It also makes fault handling local to a task. If one orchestrator fails it does not stall a batch.

https://arxiv.org/pdf/2511.21686

System Stack and Services

Matrix runs on a Ray cluster that is usually launched on SLURM. Ray provides distributed actors and queues. Ray Serve exposes LLM endpoints behind vLLM and SGLang, and can also route to external APIs such as Azure OpenAI or Gemini through proxy servers.

Tool calls and other complex services run inside Apptainer containers. This isolates the agent runtime from code execution sandboxes, HTTP tools or custom evaluators. Hydra manages configuration for agent roles, orchestrator types, resource allocations and I or O schemas. Grafana integrates with Ray metrics to track queue length, pending tasks, token throughput and GPU utilization in real time.

Matrix also introduces message offloading. When conversation history grows beyond a size threshold, large payloads are stored in Ray’s object store and only object identifiers are kept in the orchestrator. This reduces cluster bandwidth while still allowing agents to reconstruct prompts when needed.

Case Study 1: Collaborative Reasoner

Collaborative Reasoner, also known as Coral, evaluates multi agent dialogue where two LLM agents discuss a question, disagree when needed and reach a final answer. In the original implementation a central controller manages thousands of self collaboration trajectories. Matrix reimplements the same protocol using peer to peer orchestrators and stateless agents.

On 31 A100 nodes, using LLaMA 3.1 8B Instruct, Matrix configures concurrency as 248 GPUs with 50 queries per GPU, so 12,400 concurrent conversations. The Coral baseline runs at its optimal concurrency of 5,000. Under identical hardware, Matrix generates about 2 billion tokens in roughly 4 hours, while Coral produces about 0.62 billion tokens in about 9 hours. That is a 6.8 times increase in token throughput with almost identical agreement correctness around 0.47.

https://arxiv.org/pdf/2511.21686

Case Study 2: NaturalReasoning Web Data Curation

NaturalReasoning constructs a reasoning dataset from large web corpora. Matrix models the pipeline with three agents. A Filter agent uses a smaller classifier model to select English passages that likely contain reasoning. A Score agent uses a larger instruction tuned model to assign quality scores. A Question agent extracts questions, answers and reasoning chains.

On 25 million DCLM web documents, only about 5.45 percent survive all filters, yielding around 1.19 million question answer pairs with associated reasoning steps. Matrix then compares different parallelism strategies on a 500 thousand document subset. The best configuration combines data parallelism and task parallelism, with 20 data partitions and 700 concurrent tasks per partition. This achieves about 1.61 times higher throughput than a setting that only scales task concurrency.

Over the full 25 million document run, Matrix reaches 5,853 tokens per second, compared to 2,778 tokens per second for a Ray Data batch baseline with 14,000 concurrent tasks. That corresponds to a 2.1 times throughput gain that comes purely from peer to peer row level scheduling, not from different models.

https://arxiv.org/pdf/2511.21686

Case Study 3, Tau2-Bench Tool Use Trajectories

Tau2-Bench evaluates conversational agents that must use tools and a database in a customer support setting. Matrix represents this environment with four agents, a user simulator, an assistant, a tool executor and a reward calculator, plus a sink that collects metrics. Tool APIs and reward logic are reused from the Tau2 reference implementation and are wrapped in containers.

On a cluster with 13 H100 nodes and dozens of LLM replicas, Matrix generates 22,800 trajectories in about 1.25 hours. That corresponds to roughly 41,000 tokens per second. The baseline Tau2-agent implementation on a single node, configured with 500 concurrent threads, reaches about 2,654 tokens per second and 1,519 trajectories. Average reward stays almost unchanged across both systems, which confirms that the speedup does not come from cutting corners in the environment. Overall, Matrix delivers about 15.4 times higher token throughput on this benchmark.

https://arxiv.org/pdf/2511.21686

Key Takeaways

Matrix replaces centralized orchestrators with a peer to peer, message driven agent architecture that treats each task as an independent state machine moving through stateless agents.

The framework is built entirely on an open source stack, SLURM, Ray, vLLM, SGLang and Apptainer, and scales to tens of thousands of concurrent multi agent workflows for synthetic data generation, benchmarking and data processing.

Across three case studies, Collaborative Reasoner, NaturalReasoning and Tau2-Bench, Matrix delivers about 2 to 15.4 times higher token throughput than specialized baselines under identical hardware, while maintaining comparable output quality and rewards.

Matrix offloads large conversation histories to Ray’s object store and keeps only lightweight references in messages, which reduces peak network bandwidth and supports high throughput LLM serving with gRPC based model backends.

Editorial Notes

Matrix is a pragmatic systems contribution that takes multi agent synthetic data generation from bespoke scripts to an operational runtime. By encoding control flow and data flow into orchestrators, then pushing execution into stateless P2P agents on Ray, it cleanly separates scheduling, LLM inference and tools. The case studies on Collaborative Reasoner, NaturalReasoning and Tau2-Bench show that careful systems design, not new model architectures, is now the main lever for scaling synthetic data pipelines.

Check out the Paper and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Researchers Introduce Matrix: A Ray Native a Decentralized Framework for Multi Agent Synthetic Data Generation appeared first on MarkTechPost.

StepFun AI Releases Step-Audio-R1: A New Audio LLM that Finally Benefi …

Why do current audio AI models often perform worse when they generate longer reasoning instead of grounding their decisions in the actual sound. StepFun research team releases Step-Audio-R1, a new audio LLM designed for test time compute scaling, address this failure mode by showing that the accuracy drop with chain of thought is not an audio limitation but a training and modality grounding problem?

https://arxiv.org/pdf/2511.15848

The Core Problem, Audio Models Reason over Text Surrogates

Most current audio models inherit their reasoning behavior from text training. They learn to reason as if they read transcripts, not as if they listen. The StepFun team calls this Textual Surrogate Reasoning. The model uses imagined words and descriptions instead of acoustic cues such as pitch contour, rhythm, timbre or background noise patterns.

This mismatch explains why longer chain of thought often hurts performance in audio. The model spends more tokens elaborating wrong or modality irrelevant assumptions. Step-Audio-R1 attacks this by forcing the model to justify answers using acoustic evidence. The training pipeline is organized around Modality Grounded Reasoning Distillation, MGRD, which selects and distills reasoning traces that explicitly reference audio features.

Architecture

The architecture stays close to the previous Step Audio systems:

A Qwen2 based audio encoder processes raw waveforms at 25 Hz.

An audio adaptor downsamples the encoder output by a factor of 2, to 12.5 Hz, and aligns frames to the language token stream.

A Qwen2.5 32B decoder consumes the audio features and generates text.

The decoder always produces an explicit reasoning block inside <think> and </think> tags, followed by the final answer. This separation lets training objectives shape the structure and content of reasoning without losing focus on task accuracy. The model is released as a 33B parameter audio text to text model on Hugging Face under Apache 2.0.

https://arxiv.org/pdf/2511.15848

Training Pipeline, from Cold Start to Audio Grounded RL

The pipeline has a supervised cold start stage and a reinforcement learning stage that both mix text and audio tasks.

Cold start uses about 5 million examples, covering 1 billion tokens of text only data and 4 billion tokens from audio paired data. Audio tasks include automatic speech recognition, paralinguistic understanding and audio question text answer style dialogs. A fraction of the audio data carries audio chain of thought traces generated by an earlier model. Text data covers multi turn dialog, knowledge question answering, math and code reasoning. All samples share a format where reasoning is wrapped in <think> tags, even when the reasoning block is initially empty.

Supervised learning trains Step-Audio-R1 to follow this format and to generate useful reasoning for both audio and text. This gives a baseline chain of thought behavior, but it is still biased toward text based reasoning.

Modality Grounded Reasoning Distillation MGRD

MGRD is applied in several iterations. For each round, the research team samples audio questions where the label depends on real acoustic properties. For example, questions about speaker emotion, background events in sound scenes or musical structure. The current model produces multiple reasoning and answer candidates per question. A filter keeps only chains that meet three constraints:

They reference acoustic cues, not just textual descriptions or imagined transcripts.

They are logically coherent as short step by step explanations.

Their final answers are correct according to labels or programmatic checks.

These accepted traces form a distilled audio chain of thought dataset. The model is fine tuned on this dataset together with the original text reasoning data. This is followed by Reinforcement Learning with Verified Rewards, RLVR. For text questions, rewards are based on answer correctness. For audio questions, the reward mixes answer correctness and reasoning format, with a typical weighting of 0.8 for accuracy and 0.2 for reasoning. Training uses PPO with about 16 responses sampled per prompt and supports sequences up to around 10 240 tokens to allow long deliberation.

https://arxiv.org/pdf/2511.15848

Benchmarks, closing the gap to Gemini 3 Pro

On a combined speech to text benchmark suite that includes Big Bench Audio, Spoken MQA, MMSU, MMAU and Wild Speech, Step-Audio-R1 reaches an average score of about 83.6 percent. Gemini 2.5 Pro reports about 81.5 percent and Gemini 3 Pro reaches about 85.1 percent. On Big Bench Audio alone, Step-Audio-R1 reaches about 98.7 percent, which is higher than both Gemini versions.

For speech to speech reasoning, the Step-Audio-R1 Realtime variant adopts listen while thinking and think while speaking style streaming. On Big Bench Audio speech to speech, it reaches about 96.1 percent reasoning accuracy with first packet latency around 0.92 seconds. This score surpasses GPT based realtime baselines and Gemini 2.5 Flash style native audio dialogs while keeping sub second interaction.

https://arxiv.org/pdf/2511.15848

Ablations, what matters for audio reasoning

The ablation section provides several design signals for engineers:

A reasoning format reward is necessary. Without it, reinforcement learning tends to shorten or remove chain of thought, which lowers audio benchmark scores.

RL data should target medium difficulty problems. Selecting questions where pass at 8 lies in a middle band gives more stable rewards and maintains long reasoning.

Scaling RL audio data without such selection does not help. Quality of prompts and labels matters more than raw size.

The researchers also describe a self cognition correction pipeline that reduces the frequency of answers such as ‘I can only read text and cannot hear audio’ in a model that is trained to process sound. This uses Direct Preference Optimization on curated preference pairs where correct behavior is to acknowledge and use audio input.

Key Takeaways

Step-Audio-R1 is one of the first audio language model that turns longer chain of thought into a consistent accuracy gain for audio tasks, solving the inverted scaling failure seen in previous audio LLMs.

The model explicitly targets Textual Surrogate Reasoning by using Modality Grounded Reasoning Distillation, which filters and distills only those reasoning traces that rely on acoustic cues such as pitch, timbre and rhythm instead of imagined transcripts.

Architecturally, Step-Audio-R1 combines a Qwen2 based audio encoder with an adaptor and a Qwen2.5 32B decoder that always generates <think> reasoning segments before answers, and is released as a 33B audio text to text model under Apache 2.0.

Across comprehensive audio understanding and reasoning benchmarks covering speech, environmental sounds and music, Step-Audio-R1 surpasses Gemini 2.5 Pro and reaches performance comparable to Gemini 3 Pro, while also supporting a realtime variant for low latency speech to speech interaction.

The training recipe combines large scale supervised chain of thought, modality grounded distillation and Reinforcement Learning with Verified Rewards, providing a concrete and reproducible blueprint for building future audio reasoning models that actually benefit from test time compute scaling.

Editorial Notes

Step-Audio-R1 is an important release because it converts chain of thought from a liability into a useful tool for audio reasoning by directly addressing Textual Surrogate Reasoning with Modality Grounded Reasoning Distillation and Reinforcement Learning with Verified Rewards. It shows that test time compute scaling can benefit audio models when reasoning is anchored in acoustic features and delivers benchmark results comparable to Gemini 3 Pro while remaining open and practically usable for engineers. Overall this research work turns extended deliberation in audio LLMs from a consistent failure mode into a controllable and reproducible design pattern.

Check out the Paper, Repo, Project Page and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post StepFun AI Releases Step-Audio-R1: A New Audio LLM that Finally Benefits from Test Time Compute Scaling appeared first on MarkTechPost.

NVIDIA AI Releases Orchestrator-8B: A Reinforcement Learning Trained C …

How can an AI system learn to pick the right model or tool for each step of a task instead of always relying on one large model for everything? NVIDIA researchers release ToolOrchestra, a novel method for training a small language model to act as the orchestrator- the ‘brain’ of a heterogeneous tool-use agent

https://arxiv.org/pdf/2511.21689

From Single Model Agents to an Orchestration Policy

Most current agents follow a simple pattern. A single large model such as GPT-5 receives a prompt that describes available tools, then decides when to call web search or a code interpreter. All high level reasoning still stays inside the same model. ToolOrchestra changes this setup. It trains a dedicated controller model called as ‘Orchestrator-8B‘, that treats both classic tools and other LLMs as callable components.

A pilot study in the same research shows why naive prompting is not enough. When Qwen3-8B is prompted to route between GPT-5, GPT-5 mini, Qwen3-32B and Qwen2.5-Coder-32B, it delegates 73 percent of cases to GPT-5. When GPT-5 acts as its own orchestrator, it calls GPT-5 or GPT-5 mini in 98 percent of cases. The research team call these self enhancement and other enhancement biases. The routing policy over uses strong models and ignores cost instructions.

ToolOrchestra instead trains a small orchestrator explicitly for this routing problem, using reinforcement learning over full multi turn trajectories.

What is Orchestrator 8B?

Orchestrator-8B is an 8B parameter decoder only Transformer. It is built by fine tuning Qwen3-8B as an orchestration model and released on Hugging Face.

At inference time, the system runs a multi turn loop that alternates reasoning and tool calls. The rollout has three main steps. First, Orchestrator 8B reads the user instruction and an optional natural language preference description, for example a request to prioritize low latency or to avoid web search. Second, it generates internal chain of thought style reasoning and plans an action. Third, it chooses a tool from the available set and emits a structured tool call in a unified JSON format. The environment executes that call, appends the result as an observation and feeds it back into the next step. The process stops when a termination signal is produced or a maximum of 50 turns is reached.

Tools cover three main groups. Basic tools include Tavily web search, a Python sandbox code interpreter and a local Faiss index built with Qwen3-Embedding-8B. Specialized LLMs include Qwen2.5-Math-72B, Qwen2.5-Math-7B and Qwen2.5-Coder-32B. Generalist LLM tools include GPT-5, GPT-5 mini, Llama 3.3-70B-Instruct and Qwen3-32B. All tools share the same schema with names, natural language descriptions and typed parameter specs.

End to End Reinforcement Learning with Multi Objective Rewards

ToolOrchestra formulates the whole workflow as a Markov Decision Process. The state contains the conversation history, past tool calls and observations, and user preferences. Actions are the next text step, including both reasoning tokens and a tool call schema. After up to 50 steps, the environment computes a scalar reward for the full trajectory.

The reward has three components. Outcome reward is binary and depends on whether the trajectory solves the task. For open-ended answers, GPT-5 is used as a judge to compare the model output with the reference. Efficiency rewards penalize both monetary cost and wall clock latency. Token usage for proprietary and open source tools is mapped to monetary cost using public API and Together AI pricing. Preference reward measures how well tool usage matches a user preference vector that can increase or decrease the weight on cost, latency or specific tools. These components are combined into a single scalar using the preference vector.

The policy is optimized with Group Relative Policy Optimization GRPO, a variant of policy gradient reinforcement learning that normalizes rewards within groups of trajectories for the same task. The training process includes filters that drop trajectories with invalid tool call format or weak reward variance to stabilize optimization.

https://arxiv.org/pdf/2511.21689

To make this training possible at scale, the research team plans to introduce ToolScale, a synthetic dataset of multi step tool calling tasks. For each domain, an LLM generates a database schema, database entries, domain specific APIs and then diverse user tasks with ground truth sequences of function calls and required intermediate information.

Benchmark results and cost profile

NVIDIA research team evaluates Orchestrator-8B on three challenging benchmarks, Humanity’s Last Exam, FRAMES and τ² Bench. These benchmarks target long horizon reasoning, factuality under retrieval and function calling in a dual control environment.

On Humanity’s Last Exam text only questions, Orchestrator-8B reaches 37.1 percent accuracy. GPT-5 with basic tools reaches 35.1 percent in the same setting. On FRAMES, Orchestrator-8B achieves 76.3 percent versus 74.0 percent for GPT-5 with tools. On τ² Bench, Orchestrator-8B scores 80.2 percent versus 77.7 percent for GPT-5 with basic tools.

https://arxiv.org/pdf/2511.21689

The efficiency gap is larger. In the configuration that uses basic tools plus specialized and generalist LLM tools, Orchestrator-8B has average cost 9.2 cents and latency 8.2 minutes per query, averaged over Humanity’s Last Exam and FRAMES. In the same configuration, GPT-5 costs 30.2 cents and takes 19.8 minutes on average. The model card summarizes this as about 30 percent of the monetary cost and 2.5 times faster for Orchestrator-8B compared to GPT-5.

Tool use analysis supports this picture. Claude Opus 4.1 used as an orchestrator calls GPT-5 most of the time. GPT-5 used as an orchestrator prefers GPT-5 mini. Orchestrator-8B spreads calls more evenly across strong models, cheaper models, search, local retrieval and the code interpreter, and reaches higher accuracy at lower cost for the same turn budget.

https://arxiv.org/pdf/2511.21689

Generalization experiments replace the training time tools with unseen models such as OpenMath Llama-2-70B, DeepSeek-Math-7B-Instruct, Codestral-22B-v0.1, Claude Sonnet-4.1 and Gemma-3-27B. Orchestrator-8B still achieves the best trade off between accuracy, cost and latency among all baselines in this setting. A separate preference aware test set shows that Orchestrator-8B also tracks user tool usage preferences more closely than GPT-5, Claude Opus-4.1 and Qwen3-235B-A22B under the same reward metric.

Key Takeaways

ToolOrchestra trains an 8B parameter orchestration model, Orchestrator-8B, that selects and sequences tools and LLMs to solve multi step agentic tasks using reinforcement learning with outcome, efficiency and preference aware rewards.

Orchestrator-8B is released as an open weight model on Hugging Face. It is designed to coordinate diverse tools such as web search, code execution, retrieval and specialist LLMs through a unified schema.

On Humanity’s Last Exam, Orchestrator-8B reaches 37.1 percent accuracy, surpassing GPT-5 at 35.1 percent, while being about 2.5 times more efficient, and on τ² Bench and FRAMES it outperforms GPT-5 while using roughly 30 percent of the cost.

The framework shows that naive prompting of a frontier LLM as its own router leads to self enhancement bias where it overuses itself or a small set of strong models, while a trained orchestrator learns a more balanced, cost aware routing policy over multiple tools.

Editorial Notes

NVIDIA’s ToolOrchestra is a practical step toward compound AI systems where an 8B orchestration model, Orchestrator-8B, learns an explicit routing policy over tools and LLMs instead of relying on a single frontier model. It shows clear gains on Humanity’s Last Exam, FRAMES and τ² Bench with about 30 percent of the cost and around 2.5 times better efficiency than GPT-5 based baselines, which makes it directly relevant for teams that care about accuracy, latency and budget. This launch makes orchestration policy a first class optimization target in AI systems.

Check out the Paper, Repo, Project Page and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post NVIDIA AI Releases Orchestrator-8B: A Reinforcement Learning Trained Controller for Efficient Tool and Model Selection appeared first on MarkTechPost.

A Coding Guide to Design an Agentic AI System Using a Control-Plane Ar …

In this tutorial, we build an advanced Agentic AI using the control-plane design pattern, and we walk through each component step by step as we implement it. We treat the control plane as the central orchestrator that coordinates tools, manages safety rules, and structures the reasoning loop. Also, we set up a miniature retrieval system, defined modular tools, and integrated an agentic reasoning layer that dynamically plans and executes actions. At last, we observe how the entire system behaves like a disciplined, tool-aware AI capable of retrieving knowledge, assessing understanding, updating learner profiles, and logging all interactions through a unified, scalable architecture. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys

def install_deps():
deps = [‘anthropic’, ‘numpy’, ‘scikit-learn’]
for dep in deps:
subprocess.check_call([sys.executable, ‘-m’, ‘pip’, ‘install’, ‘-q’, dep])

try:
import anthropic
except ImportError:
install_deps()
import anthropic

import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from dataclasses import dataclass, asdict
from typing import List, Dict, Any, Optional
from datetime import datetime

@dataclass
class Document:
id: str
content: str
metadata: Dict[str, Any]
embedding: Optional[np.ndarray] = None

class SimpleRAGRetriever:
def __init__(self):
self.documents = self._init_knowledge_base()

def _init_knowledge_base(self) -> List[Document]:
docs = [
Document(“cs101”, “Python basics: Variables store data. Use x=5 for integers, name=’Alice’ for strings. Print with print().”, {“topic”: “python”, “level”: “beginner”}),
Document(“cs102”, “Functions encapsulate reusable code. Define with def func_name(params): and call with func_name(args).”, {“topic”: “python”, “level”: “intermediate”}),
Document(“cs103”, “Object-oriented programming uses classes. class MyClass: defines structure, __init__ initializes instances.”, {“topic”: “python”, “level”: “advanced”}),
Document(“math101”, “Linear algebra: Vectors are ordered lists of numbers. Matrix multiplication combines transformations.”, {“topic”: “math”, “level”: “intermediate”}),
Document(“ml101”, “Machine learning trains models on data to make predictions. Supervised learning uses labeled examples.”, {“topic”: “ml”, “level”: “beginner”}),
Document(“ml102”, “Neural networks are composed of layers. Each layer applies weights and activation functions to transform inputs.”, {“topic”: “ml”, “level”: “advanced”}),
]
for i, doc in enumerate(docs):
doc.embedding = np.random.rand(128)
doc.embedding[i*20:(i+1)*20] += 2
return docs

def retrieve(self, query: str, top_k: int = 2) -> List[Document]:
query_embedding = np.random.rand(128)
scores = [cosine_similarity([query_embedding], [doc.embedding])[0][0] for doc in self.documents]
top_indices = np.argsort(scores)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]

We set up all dependencies, import the libraries we rely on, and initialize the data structures for our knowledge base. We define a simple retriever and generate mock embeddings to simulate similarity search in a lightweight way. As we run this block, we prepare everything needed for retrieval-driven reasoning in the later components. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ToolRegistry:
def __init__(self, retriever: SimpleRAGRetriever):
self.retriever = retriever
self.interaction_log = []
self.user_state = {“level”: “beginner”, “topics_covered”: []}

def search_knowledge(self, query: str, filters: Optional[Dict] = None) -> Dict:
docs = self.retriever.retrieve(query, top_k=2)
if filters:
docs = [d for d in docs if all(d.metadata.get(k) == v for k, v in filters.items())]
return {
“tool”: “search_knowledge”,
“results”: [{“content”: d.content, “metadata”: d.metadata} for d in docs],
“count”: len(docs)
}

def assess_understanding(self, topic: str) -> Dict:
questions = {
“python”: [“What keyword defines a function?”, “How do you create a variable?”],
“ml”: [“What is supervised learning?”, “Name two types of ML algorithms.”],
“math”: [“What is a vector?”, “Explain matrix multiplication.”]
}
return {
“tool”: “assess_understanding”,
“topic”: topic,
“questions”: questions.get(topic, [“General comprehension check.”])
}

def update_learner_profile(self, topic: str, level: str) -> Dict:
if topic not in self.user_state[“topics_covered”]:
self.user_state[“topics_covered”].append(topic)
self.user_state[“level”] = level
return {
“tool”: “update_learner_profile”,
“status”: “updated”,
“profile”: self.user_state.copy()
}

def log_interaction(self, event: str, details: Dict) -> Dict:
log_entry = {
“timestamp”: datetime.now().isoformat(),
“event”: event,
“details”: details
}
self.interaction_log.append(log_entry)
return {“tool”: “log_interaction”, “status”: “logged”, “entry_id”: len(self.interaction_log)}

We build the tool registry that our agent uses while interacting with the system. We define tools such as knowledge search, assessments, profile updates, and logging, and we maintain a persistent user-state dictionary. As we use this layer, we see how each tool becomes a modular capability that the control plane can route to. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ControlPlane:
def __init__(self, tool_registry: ToolRegistry):
self.tools = tool_registry
self.safety_rules = {
“max_tools_per_request”: 4,
“allowed_tools”: [“search_knowledge”, “assess_understanding”,
“update_learner_profile”, “log_interaction”]
}
self.execution_log = []

def execute(self, plan: Dict[str, Any]) -> Dict[str, Any]:
if not self._validate_request(plan):
return {“error”: “Safety validation failed”, “plan”: plan}

action = plan.get(“action”)
params = plan.get(“parameters”, {})
result = self._route_and_execute(action, params)

self.execution_log.append({
“timestamp”: datetime.now().isoformat(),
“plan”: plan,
“result”: result
})

return {
“success”: True,
“action”: action,
“result”: result,
“metadata”: {
“execution_count”: len(self.execution_log),
“safety_checks_passed”: True
}
}

def _validate_request(self, plan: Dict) -> bool:
action = plan.get(“action”)
if action not in self.safety_rules[“allowed_tools”]:
return False
if len(self.execution_log) >= 100:
return False
return True

def _route_and_execute(self, action: str, params: Dict) -> Any:
tool_map = {
“search_knowledge”: self.tools.search_knowledge,
“assess_understanding”: self.tools.assess_understanding,
“update_learner_profile”: self.tools.update_learner_profile,
“log_interaction”: self.tools.log_interaction
}
tool_func = tool_map.get(action)
if tool_func:
return tool_func(**params)
return {“error”: f”Unknown action: {action}”}

We implement the control plane that orchestrates tool execution, checks safety rules, and manages permissions. We validate every request, route actions to the right tool, and keep an execution log for transparency. As we run this snippet, we observe how the control plane becomes the governing system that ensures predictable and safe agentic behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TutorAgent:
def __init__(self, control_plane: ControlPlane, api_key: str):
self.control_plane = control_plane
self.client = anthropic.Anthropic(api_key=api_key)
self.conversation_history = []

def teach(self, student_query: str) -> str:
plan = self._plan_actions(student_query)
results = []
for action_plan in plan:
result = self.control_plane.execute(action_plan)
results.append(result)

response = self._synthesize_response(student_query, results)

self.conversation_history.append({
“query”: student_query,
“plan”: plan,
“results”: results,
“response”: response
})
return response

def _plan_actions(self, query: str) -> List[Dict]:
plan = []
query_lower = query.lower()

if any(kw in query_lower for kw in [“what”, “how”, “explain”, “teach”]):
plan.append({
“action”: “search_knowledge”,
“parameters”: {“query”: query},
“context”: {“intent”: “knowledge_retrieval”}
})

if any(kw in query_lower for kw in [“test”, “quiz”, “assess”, “check”]):
topic = “python” if “python” in query_lower else “ml”
plan.append({
“action”: “assess_understanding”,
“parameters”: {“topic”: topic},
“context”: {“intent”: “assessment”}
})

plan.append({
“action”: “log_interaction”,
“parameters”: {“event”: “query_processed”, “details”: {“query”: query}},
“context”: {“intent”: “logging”}
})

return plan

def _synthesize_response(self, query: str, results: List[Dict]) -> str:
response_parts = [f”Student Query: {query}n”]

for result in results:
if result.get(“success”) and “result” in result:
tool_result = result[“result”]

if result[“action”] == “search_knowledge”:
response_parts.append(“n Retrieved Knowledge:”)
for doc in tool_result.get(“results”, []):
response_parts.append(f” • {doc[‘content’]}”)

elif result[“action”] == “assess_understanding”:
response_parts.append(“n Assessment Questions:”)
for q in tool_result.get(“questions”, []):
response_parts.append(f” • {q}”)

return “n”.join(response_parts)

We implement the TutorAgent, which plans actions, communicates with the control plane, and synthesizes final responses. We analyze queries, generate multi-step plans, and combine tool outputs into meaningful answers for learners. As we execute this snippet, we see the agent behaving intelligently by coordinating retrieval, assessment, and logging. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_demo():
print(“=” * 70)
print(“Control Plane as a Tool: RAG AI Tutor Demo”)
print(“=” * 70)

API_KEY = “your-api-key-here”

retriever = SimpleRAGRetriever()
tool_registry = ToolRegistry(retriever)
control_plane = ControlPlane(tool_registry)

print(“System initialized”)
print(f”Tools: {len(control_plane.safety_rules[‘allowed_tools’])}”)
print(f”Knowledge base: {len(retriever.documents)} documents”)

try:
tutor = TutorAgent(control_plane, API_KEY)
except:
print(“Mock mode enabled”)
tutor = None

demo_queries = [
“Explain Python functions to me”,
“I want to learn about machine learning”,
“Test my understanding of Python basics”
]

for query in demo_queries:
print(“n— Query —“)
if tutor:
print(tutor.teach(query))
else:
plan = [
{“action”: “search_knowledge”, “parameters”: {“query”: query}},
{“action”: “log_interaction”, “parameters”: {“event”: “query”, “details”: {}}}
]
print(query)
for action in plan:
result = control_plane.execute(action)
print(f”{action[‘action’]}: {result.get(‘success’, False)}”)

print(“Summary”)
print(f”Executions: {len(control_plane.execution_log)}”)
print(f”Logs: {len(tool_registry.interaction_log)}”)
print(f”Profile: {tool_registry.user_state}”)

if __name__ == “__main__”:
run_demo()

We run a complete demo that initializes all components, processes sample student queries, and prints system state summaries. We watch the agent step through retrieval and logging while the control plane enforces rules and tracks execution history. As we finish this block, we get a clear picture of how the entire architecture works together in a realistic teaching loop.

In conclusion, we gain a clear understanding of how the control-plane pattern simplifies orchestration, strengthens safety, and creates a clean separation between reasoning and tool execution. We now see how a retrieval system, tool registry, and agentic planning layer come together to form a coherent AI tutor that responds intelligently to student queries. As we experiment with the demo, we observe how the system routes tasks, applies rules, and synthesizes useful insights from tool outputs, all while remaining modular and extensible.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Design an Agentic AI System Using a Control-Plane Architecture for Safe, Modular, and Scalable Tool-Driven Reasoning Workflows appeared first on MarkTechPost.

DeepSeek AI Releases DeepSeekMath-V2: The Open Weights Maths Model Tha …

How can an AI system prove complex olympiad level math problems in clear natural language while also checking that its own reasoning is actually correct? DeepSeek AI has released DeepSeekMath-V2, an open weights large language model that is optimized for natural language theorem proving with self verification. The model is built on DeepSeek-V3.2-Exp-Base, runs as a 685B parameter mixture of experts, and is available on Hugging Face under an Apache 2.0 license.

In evaluations, DeepSeekMath-V2 reaches gold level scores on IMO 2025 and CMO 2024, and achieves 118 of 120 points on Putnam 2024 when used with scaled test time compute.

Why Final Answer Rewards are not Enough?

Most recent math reasoning models use reinforcement learning that rewards only the final answer on benchmarks such as AIME and HMMT. This approach pushed models from weak baselines to near saturation on short answer contests in about one year. (Hugging Face)

However, the DeepSeek research team points out two structural problems:

A correct numeric answer does not guarantee correct reasoning. The model may reach the right number through algebraic mistakes that cancel out.

Many tasks, such as olympiad proofs and theorem proving, require a complete argument in natural language. These tasks do not have a single final numeric answer, so standard answer based rewards do not apply.

DeepSeekMath-V2 therefore optimizes proof quality instead of pure answer accuracy. The system evaluates whether a proof is complete and logically sound, and uses that evaluation as the main learning signal.

Training a Verifier before the Generator

The core design is verifier first. DeepSeek research team trains an LLM based verifier that can read a problem and a candidate proof, then output both a natural language analysis and a discrete quality score in the set {0, 0.5, 1}.

The initial reinforcement learning data comes from Art of Problem Solving contests. The research team crawl 17,503 proof style problems from olympiads, team selection tests, and post 2010 problems that explicitly require proofs. These problems form the base set for cold start RL. Candidate proofs come from a DeepSeek-V3.2 reasoning model that is prompted to iteratively refine its own solutions, which increases detail but also creates many imperfect proofs. Human experts label these proofs using the 0, 0.5, 1 rubric, based on rigor and completeness.

The verifier is trained with Group Relative Policy Optimization (GRPO). The reward has two components:

A format reward, which checks that the verifier output follows a fixed template, including an analysis section and a final score in a box.

A score reward, which penalizes the absolute difference between the predicted score and the expert score.

This stage produces a verifier that can grade olympiad style proofs in a consistent way.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

Meta Verification to Control Hallucinated Critiques

A verifier can still game the reward. It can output the correct final score while inventing fake issues in the analysis. This would satisfy the numeric objective but make the explanations unreliable.

To address this, the research team introduce a meta verifier. The meta verifier reads the original problem, the proof, and the verifier analysis, and then evaluates whether the analysis is faithful. It scores aspects such as restatement of steps, identification of real defects, and consistency between the narrative and the final score.

The meta verifier is also trained with GRPO, with its own format and score rewards. Its output, a meta quality score, is then used as an extra reward term for the base verifier. Analyses that hallucinate problems get low meta scores, even if the final proof score is correct. In experiments, this raises the average meta evaluated quality of analyses from around 0.85 to 0.96 on a validation split, while keeping proof score accuracy stable.

Self Verifying Proof Generator and Sequential Refinement

Once the verifier is strong, DeepSeek research team trains the proof generator. The generator takes a problem and outputs both a solution and a self analysis that follows the same rubric as the verifier.

The reward for the generator combines three signals:

The verifier score on the generated proof.

The agreement between the self reported score and the verifier score.

The meta verification score of the self analysis.

Formally, the main reward uses weights α = 0.76 for the proof score and β = 0.24 for the self analysis component, multiplied by a format term that enforces the output structure. This pushes the generator to write proofs that the verifier accepts, and to be honest about remaining issues. If it claims that a flawed proof is perfect, it loses reward through disagreement and low meta scores.

DeepSeek also exploits the 128K token context limit of the base model. For hard problems, the generator often cannot repair all issues in a single pass, because the refined proof plus analysis would exceed context. In that case, the system runs sequential refinement. It generates a proof and self analysis, feeds them back as context, and asks the model to produce a new proof that fixes the previously detected issues. This loop can repeat several times, subject to the context budget.

https://github.com/deepseek-ai/DeepSeek-Math-V2/tree/main

Scaling Verification and Auto Labeling

As the generator improves, it produces harder proofs, which are costly to label by hand. To keep training data fresh, the research team introduces an automatic labeling pipeline based on scaled verification.

For each candidate proof, the system samples multiple independent verifier analyses, then evaluates each analysis using the meta verifier. If several high quality analyses converge on the same serious issues, the proof is labeled as incorrect. If no valid issues survive meta checking, the proof is labeled as correct. In the final training iterations this pipeline replaces human labels, with spot checks confirming good agreement with experts.

Competition and Benchmark Results

The research team evaluated DeepSeekMath-V2 on several fronts:

On an internal set of 91 CNML level problems covering algebra, geometry, number theory, combinatorics, and inequalities, it shows that DeepSeekMath-V2 achieves the highest mean proof score among Gemini 2.5 Pro, GPT 5 Thinking High, and DeepSeekMath-V2 in every category, as measured by their verifier.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

On IMO Shortlist 2024, sequential refinement with self verification improves both pass at 1 and best of 32 quality metrics as the maximum number of refinement iterations increases.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

On IMO ProofBench, expert evaluation the above figure shows that DeepSeekMath-V2 outperforms DeepMind DeepThink IMO Gold on the Basic subset and remains competitive on the Advanced subset, while clearly beating other large models.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

For full competitions, it reports:

IMO 2025: 5 of 6 problems solved, gold medal level.

CMO 2024: 4 problems fully solved plus partial credit on 1 more, gold medal level.

Putnam 2024: 11 of 12 problems solved completely and the remaining problem with minor errors, for 118 of 120 points, above the best human score of 90.

Key Takeaways

DeepSeekMath V2 is a 685B parameter model built on DeepSeek V3.2 Exp Base, designed for natural language theorem proving with self verification, and released as open weights under the Apache 2.0 license.

The main innovation is a verifier first training pipeline with a GRPO trained verifier and meta verifier that score proofs on rigor, not only final answers, which directly addresses the gap between correct answers and correct reasoning.

A proof generator is then trained against this verifier and meta verifier, using rewards that combine proof quality, agreement with self evaluation, and analysis faithfulness, plus sequential refinement under 128K context to iteratively repair proofs.

With scaled test time compute and large verification budgets, DeepSeekMath V2 reaches gold level performance on IMO 2025 and CMO 2024 and scores 118 of 120 on Putnam 2024, surpassing the best human score that year.

Editorial Notes

DeepSeekMath-V2 is an important step toward self verifiable mathematical reasoning, because it directly tackles the gap between correct final answers and correct reasoning, using a verifier, meta verifier and proof generator trained with GRPO on olympiad style proofs and deployed at 685B scale to reach gold level performance on IMO 2025, CMO 2024 and a near perfect 118 of 120 score on Putnam 2024. Overall, this release shows that self verifiable mathematical reasoning with open weights is now practically achievable for competition level problems.

Check out the Full Paper, Model Weights on HF and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepSeek AI Releases DeepSeekMath-V2: The Open Weights Maths Model That Scored 118/120 on Putnam 2024 appeared first on MarkTechPost.

A Coding Implementation for an Agentic AI Framework that Performs Lite …

In this tutorial, we build a complete scientific discovery agent step by step and experience how each component works together to form a coherent research workflow. We begin by loading our literature corpus, constructing retrieval and LLM modules, and then assembling agents that search papers, generate hypotheses, design experiments, and produce structured reports. Through snippets mentioned below, we see how an agentic pipeline emerges naturally, allowing us to explore a scientific question from initial curiosity to a full analysis within a single, integrated system. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport sys, subprocess

def install_deps():
pkgs = [“transformers”, “scikit-learn”, “numpy”]
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”] + pkgs)

try:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
except ImportError:
install_deps()
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

from dataclasses import dataclass
from typing import List, Dict, Any

np.random.seed(42)

LITERATURE = [
{“id”: “P1″,”title”: “Self-Supervised Protein Language Models for Structure Prediction”,”field”: “computational biology”,
“abstract”: “We explore transformer-based protein language models trained on millions of sequences. The models learn residue-level embeddings that improve secondary structure prediction and stability estimation.”},
{“id”: “P2″,”title”: “CRISPR Off-Target Detection Using Deep Learning”,”field”: “genome editing”,
“abstract”: “We propose a convolutional neural network architecture for predicting CRISPR-Cas9 off-target effects directly from genomic sequences, achieving state-of-the-art accuracy on GUIDE-seq datasets.”},
{“id”: “P3″,”title”: “Foundation Models for Scientific Equation Discovery”,”field”: “scientific ML”,
“abstract”: “Large language models are combined with symbolic regression to recover governing equations from noisy experimental observations in physics and fluid dynamics.”},
{“id”: “P4″,”title”: “Active Learning for Materials Property Optimization”,”field”: “materials science”,
“abstract”: “We integrate Bayesian optimization with graph neural networks to actively select candidate materials that maximize target properties while reducing experimental cost.”},
{“id”: “P5″,”title”: “Graph-Based Retrieval for Cross-Domain Literature Review”,”field”: “NLP for science”,
“abstract”: “We construct a heterogeneous citation and concept graph over multi-domain scientific papers and show that graph-aware retrieval improves cross-domain literature exploration.”},
]

corpus_texts = [p[“abstract”] + ” ” + p[“title”] for p in LITERATURE]
vectorizer = TfidfVectorizer(stop_words=”english”)
corpus_matrix = vectorizer.fit_transform(corpus_texts)

MODEL_NAME = “google/flan-t5-small”
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

def generate_text(prompt: str, max_new_tokens: int = 256) -> str:
inputs = tokenizer(prompt, return_tensors=”pt”, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, num_beams=4, early_stopping=True)
return tokenizer.decode(outputs[0], skip_special_tokens=True)

We laid the foundation for our scientific agent by loading libraries, preparing the literature corpus, and initializing our language model. We build the TF-IDF vectorizer and embed all abstracts to later retrieve relevant papers. With the model loaded and data structured, we create the computational backbone for everything that follows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class PaperHit:
paper: Dict[str, Any]
score: float

class LiteratureAgent:
def __init__(self, vectorizer, corpus_matrix, papers: List[Dict[str, Any]]):
self.vectorizer = vectorizer
self.corpus_matrix = corpus_matrix
self.papers = papers

def search(self, query: str, k: int = 3) -> List[PaperHit]:
q_vec = self.vectorizer.transform([query])
sims = cosine_similarity(q_vec, self.corpus_matrix)[0]
idxs = np.argsort(-sims)[:k]
hits = [PaperHit(self.papers[i], float(sims[i])) for i in idxs]
return hits

We implement the literature-search component of our agent. We convert user queries into a vector space and identify the most relevant scientific papers using cosine similarity. Through this, we give our system the ability to ground its reasoning in the closest-matching prior work. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class ExperimentPlan:
system: str
hypothesis: str
variables: Dict[str, Any]
protocol: List[str]

@dataclass
class ExperimentResult:
plan: ExperimentPlan
metrics: Dict[str, float]

class ExperimentAgent:
def design_experiment(self, question: str, hypothesis: str, hits: List[PaperHit]) -> ExperimentPlan:
top_field = hits[0].paper[“field”] if hits else “computational science”
protocol = [
f”Construct dataset combining ideas from: {‘, ‘.join(h.paper[‘id’] for h in hits)}.”,
“Split data into train/validation/test.”,
“Compare baseline model vs. augmented model implementing the hypothesis.”,
“Evaluate using appropriate metrics and perform ablation analysis.”,
]
variables = {
“baseline_model”: “sequence CNN”,
“augmented_model”: “protein language model + CNN”,
“n_train_samples”: 5000,
“n_validation_samples”: 1000,
“metric”: “AUROC”,
}
system = f”{top_field} system related to: {question}”
return ExperimentPlan(system=system, hypothesis=hypothesis, variables=variables, protocol=protocol)

def run_experiment(self, plan: ExperimentPlan) -> ExperimentResult:
base = 0.78 + 0.02 * np.random.randn()
gain = abs(0.05 + 0.01 * np.random.randn())
metrics = {
“baseline_AUROC”: round(base, 3),
“augmented_AUROC”: round(base + gain, 3),
“estimated_gain”: round(gain, 3),
}
return ExperimentResult(plan=plan, metrics=metrics)

We design and simulate experiments based on the retrieved literature and the generated hypothesis. We automatically define variables, build a protocol, and generate synthetic metrics that imitate the dynamics of a real scientific evaluation. This lets us move from theoretical ideas to an actionable experimental plan. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ReportAgent:
def write_report(self, question: str, hits: List[PaperHit], plan: ExperimentPlan, result: ExperimentResult) -> str:
related_work = “n”.join(f”- {h.paper[‘title’]} ({h.paper[‘field’]})” for h in hits)
protocol_str = “n”.join(f”- {step}” for step in plan.protocol)
prompt = f”””
You are an AI research assistant writing a concise research-style report.

Research question:
{question}

Hypothesis:
{plan.hypothesis}

Relevant prior work:
{related_work}

Planned experiment:
System: {plan.system}
Variables: {plan.variables}
Protocol:
{protocol_str}

Simulated results:
{result.metrics}

Write a clear report with the following sections:
1. Background
2. Proposed Approach
3. Experimental Setup
4. Results and Discussion
5. Limitations and Future Work
“””
return generate_text(prompt.strip(), max_new_tokens=320)

We generate a full research-style report using the LLM. We assemble the hypothesis, protocol, results, and related work into a structured document with clearly defined sections. This allows us to turn the pipeline’s raw outputs into polished scientific communication. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ScientificAgent:
def __init__(self):
self.lit_agent = LiteratureAgent(vectorizer, corpus_matrix, LITERATURE)
self.exp_agent = ExperimentAgent()
self.report_agent = ReportAgent()

def propose_hypothesis(self, question: str, hits: List[PaperHit]) -> str:
context = ” “.join(h.paper[“abstract”] for h in hits)
prompt = f”””
You are an AI scientist. Given a research question and related abstracts,
propose a single, testable hypothesis in 2-3 sentences.

Research question:
{question}

Related abstracts:
{context}
“””
return generate_text(prompt.strip(), max_new_tokens=96)

def run_pipeline(self, question: str) -> str:
hits = self.lit_agent.search(question, k=3)
hypothesis = self.propose_hypothesis(question, hits)
plan = self.exp_agent.design_experiment(question, hypothesis, hits)
result = self.exp_agent.run_experiment(plan)
report = self.report_agent.write_report(question, hits, plan, result)
return report

if __name__ == “__main__”:
research_question = (
“How can protein language model embeddings improve CRISPR off-target ”
“prediction compared to sequence-only CNN baselines?”
)
agent = ScientificAgent()
final_report = agent.run_pipeline(research_question)
print(final_report)

We orchestrate the entire pipeline, searching the literature, generating a hypothesis, designing the experiment, running the simulation, and writing the report. We then execute the system on a real research question and observe the complete workflow in action. This step brings all the modules together into a unified scientific agent.

In conclusion, we see how a compact codebase can evolve into a functioning AI co-researcher capable of searching, reasoning, simulating, and summarizing. We understand how each snippet contributes to the full pipeline and how agentic components amplify one another when combined. Also, we place ourselves in a strong position to extend the agent with richer literature sources, more realistic models, and more sophisticated experimental logic, pushing our scientific exploration further with every iteration.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation for an Agentic AI Framework that Performs Literature Analysis, Hypothesis Generation, Experimental Planning, Simulation, and Scientific Reporting appeared first on MarkTechPost.

OceanBase Releases seekdb: An Open Source AI Native Hybrid Search Data …

AI applications rarely deal with one clean table. They mix user profiles, chat logs, JSON metadata, embeddings, and sometimes spatial data. Most teams answer this with a patchwork of an OLTP database, a vector store, and a search engine. OceanBase released seekdb, an open source AI focused database (under the Apache 2.0 license). seekdb is described as an AI native search database that unifies relational data, vector data, text, JSON, and GIS in one engine and exposes hybrid search and in database AI workflows. 

What is seekdb?

seekdb is positioned as the lightweight, embedded version of the OceanBase engine, aimed at AI applications rather than general purpose distributed deployments. It runs as a single node database, supports embedded mode and client or server mode, and remains compatible with MySQL drivers and SQL syntax.

In the capability matrix, seekdb is marked as:

Embedded database supported

Standalone database supported

Distributed database not supported

while the full OceanBase product covers the distributed case.

From a data model perspective, seekdb supports:

Relational data with standard SQL

Vector search

Full text search

JSON data

Spatial GIS data

all inside one storage and indexing layer.

Hybrid search as the core feature

The main feature OceanBase pushes is hybrid search. This is search that combines vector based semantic retrieval, full text keyword retrieval, and scalar filters in a single query and a single ranking step.

seekdb implements hybrid search through a system package named DBMS_HYBRID_SEARCH with two entry points:

DBMS_HYBRID_SEARCH.SEARCH which returns results as JSON, sorted by relevance

DBMS_HYBRID_SEARCH.GET_SQL which returns the concrete SQL string used for execution

The hybrid search path can run:

pure vector search

pure full text search

combined hybrid search

and can push relational filters and joins down into storage. It also supports query reranking strategies like weighted scores and reciprocal rank fusion and can plug in large language model based re-rankers.

For retrieval augmented generation (RAG) and agent memory, this means you can write a single SQL query that does semantic matching on embeddings, exact matching on product codes or proper nouns, and relational filtering on user or tenant scopes.

Vector and full text engine details

At its core, seekdb exposes a modern vector and full text stack.

For vectors, seekdb:

supports dense vectors and sparse vectors

supports Manhattan, Euclidean, inner product, and cosine distance metrics

provides in memory index types such as HNSW, HNSW SQ, HNSW BQ

provides disk based index types including IVF and IVF PQ

Hybrid vector index show how you can store raw text, let seekdb call an embedding model automatically, and have the system maintain the corresponding vector index without a separate preprocessing pipeline.

For text, seekdb offers full text search with:

keyword, phrase, and Boolean queries

BM25 ranking for relevance

multiple tokenizer modes

The key point is that full text and vector indexes are first class and are integrated in the same query planner as scalar indexes and GIS indexes, so hybrid search does not need external orchestration.

AI functions inside the database

seekdb includes built in AI function expressions that let you call models directly from SQL, without a separate application service mediating every call. The main functions are:

AI_EMBED to convert text into embeddings

AI_COMPLETE for text generation using a chat or completion model

AI_RERANK to rerank a list of candidatesAI_PROMPT to assemble prompt templates and dynamic values into a JSON object for AI_COMPLETE

Model metadata and endpoints are managed by the DBMS_AI_SERVICE package, which lets you register external providers, set URLs, and configure keys, all on the database side. 

Multimodal data and workloads

seekdb is built to handle multiple data modalities in one node. it has a multimodal data and indexing layer that covers vectors, text, JSON, and GIS, and a multi-model compute layer for hybrid workloads across vector, full text, and scalar conditions.

It also provides JSON indexes for metadata queries and GIS indexes for spatial conditions. This allows queries like:

find semantically similar documents

filter by JSON metadata like tenant, region, or category

constrain by spatial range or polygon

without leaving the same engine.

Because seekdb is derived from the OceanBase engine, it inherits ACID transactions, row and column hybrid storage, and vectorized execution, although high scale distributed deployments remain a job for the full OceanBase database.

Comparison Table

Key Takeaways

AI native hybrid search: seekdb unifies vector search, full text search and relational filtering in a single SQL and DBMS_HYBRID_SEARCH interface, so RAG and agent workloads can run multi signal retrieval in one query instead of stitching together multiple engines.

Multimodal data in one engine: seekdb stores and indexes relational data, vectors, text, JSON and GIS in the same engine, which lets AI applications keep documents, embeddings and metadata consistent without maintaining separate databases.

In database AI functions for RAG: With AI_EMBED, AI_COMPLETE, AI_RERANK and AI_PROMPT, seekdb can call embedding models, LLMs and rerankers directly from SQL, which simplifies RAG pipelines and moves more orchestration logic into the database layer.

Single node, embedded friendly design: seekdb is a single node, MySQL compatible engine that supports embedded and standalone modes, while distributed, large scale deployments remain the role of full OceanBase, which makes seekdb suitable for local, edge and service embedded AI workloads.

Open source and tool ecosystem: seekdb is open sourced under Apache 2.0 and integrates with a growing ecosystem of AI tools and frameworks, with Python support via pyseekdb and MCP based integration for code assistants and agents, so it can act as a unified data plane for AI applications.

Check out the Repo and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OceanBase Releases seekdb: An Open Source AI Native Hybrid Search Database for Multi-model RAG and AI Agents appeared first on MarkTechPost.

Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Exp …

Tencent Hunyuan has released HunyuanOCR, a 1B parameter vision language model that is specialized for OCR and document understanding. The model is built on Hunyuan’s native multimodal architecture and runs spotting, parsing, information extraction, visual question answering, and text image translation through a single end to end pipeline.

HunyuanOCR is a lightweight alternative to general VLMs such as Gemini 2.5 and Qwen3 VL that still matches or surpasses them on OCR centric tasks. It targets production use cases like document parsing, card and receipt extraction, video subtitle extraction, and multilingual document translation.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Architecture, Native Resolution ViT plus Lightweight LLM

HunyuanOCR uses 3 main modules, a Native Resolution Visual Encoder called Hunyuan ViT, an Adaptive MLP Connector, and a Lightweight Language Model. The encoder is based on SigLIP-v2-400M and is extended to support arbitrary input resolutions through adaptive patching that preserves the original aspect ratio. Images are split into patches according to their native proportions and processed with global attention, which improves recognition on long text lines, long documents, and low quality scans.

The Adaptive MLP Connector performs learnable pooling on the spatial dimension. It compresses the dense visual tokens into a shorter sequence, while keeping information from text dense regions. This reduces sequence length passed to the language model and lowers compute, while preserving OCR relevant details.

The language model is based on the densely architected Hunyuan 0.5B model and uses XD RoPE. XD RoPE splits rotary position embeddings into 4 subspaces for text, height, width, and time. This gives the model a native way to align 1D token order with 2D layout and 3D spatiotemporal structure. As a result, the same stack can handle multi column pages, cross page flows, and sequences of video frames.

Training and inference follow a fully end to end paradigm. There is no external layout analysis or post processing model in the loop. All tasks are expressed as natural language prompts and handled in a single forward pass. This design removes error propagation across pipeline stages and simplifies deployment.

Data and Pre Training Recipe

The data pipeline builds more than 200M image text pairs, across 9 real world scenarios, including street views, documents, advertisements, handwritten text, screenshots, cards and certificates and invoices, game interfaces, video frames, and artistic typography. The corpus covers more than 130 languages.

Synthetic data comes from a multilingual generator that supports right to left scripts and paragraph level rendering. The pipeline controls font, language, rotation, and RGB values, and applies warping, blur, and local lighting changes to simulate mobile captures and other hard conditions.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Pre training follows 4 stages. Stage-1 performs vision language alignment with pure text, synthetic parsing and recognition data, and general caption data, using 50B tokens and 8k context. Stage-2 runs multimodal pre training on 300B tokens that mix pure text with synthetic spotting, parsing, translation, and VQA samples. Stage-3 extends context length to 32k with 80B tokens focused on long documents and long text. Stage-4 is application oriented supervised fine tuning on 24B tokens of human annotated and hard negative data, keeping 32k context and unified instruction templates.

Reinforcement Learning with Verifiable Rewards

After supervised training, HunyuanOCR is further optimized with reinforcement learning. The research team use Group Relative Policy Optimization GRPO and a Reinforcement Learning with Verifiable Rewards setup for structured tasks. For text spotting, the reward is based on intersection over union matching of boxes combined with normalized edit distance over text. For document parsing, the reward uses normalized edit distance between the generated structure and the reference.

For VQA and translation, the system uses an LLM as a judge. VQA uses a binary reward that checks semantic match. Translation uses a COMET style scoring LLM with scores in [0, 5], normalized to [0, 1]. The training framework enforces length limits and strict formats, and assigns zero reward when outputs overflow or break schema, which stabilizes optimization and encourages valid JSON or structured outputs.

Benchmark Results, a 1B Model Competing with Larger VLMs

On the internal text spotting benchmark of 900 images across 9 categories, HunyuanOCR reaches an overall score of 70.92. It outperforms traditional pipeline methods like PaddleOCR and BaiduOCR and also general VLMs such as Gemini 2.5 Pro, Qwen3 VL 2B, Qwen3 VL 235B, and Seed 1.6 Vision, despite using far fewer parameters.

On OmniDocBench, HunyuanOCR achieves 94.10 overall, with 94.73 on formulas and 91.81 on tables. On the Wild OmniDocBench variant, which prints and recaptures documents under folds and lighting changes, it scores 85.21 overall. On DocML, a multilingual parsing benchmark across 14 non Chinese and non English languages, it reaches 91.03, and the paper reports state of the art results across all 14 languages.

For information extraction and VQA, HunyuanOCR reaches 92.29 accuracy on cards, 92.53 on receipts, and 92.87 on video subtitles. On OCRBench, it scores 860, higher than DeepSeek OCR at similar scale and close to larger general VLMs like Qwen3 VL 2B Instruct and Gemini 2.5 Pro.

In text image translation, HunyuanOCR uses the DoTA benchmark and a DocML based internal set. It achieves a strong COMET score on DoTA for English to Chinese document translation, and the model wins first place in Track 2.2 OCR free Small Model of the ICDAR 2025 DIMT competition.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Key Takeaways

Compact end to end OCR VLM: HunyuanOCR is a 1B parameter OCR focused vision language model that connects a 0.4B native resolution ViT to a 0.5B Hunyuan language model through an MLP adapter, and runs spotting, parsing, information extraction, VQA and translation in one end to end instruction driven pipeline without external layout or detection modules.

Unified support for diverse OCR scenarios: The model is trained on more than 200M image text pairs across 9 scenarios, including documents, street views, advertisements, handwritten content, screenshots, cards and invoices, game interfaces and video frames, with coverage of over 130 languages in training and support for more than 100 languages in deployment.

Data pipeline plus reinforcement learning: Training uses a 4 stage recipe, vision language alignment, multimodal pre training, long context pre training and application oriented supervised fine tuning, followed by reinforcement learning with group relative policy optimization and verifiable rewards for spotting, parsing, VQA and translation.

Strong benchmark results for sub 3B modelsHunyuanOCR reaches 94.1 on OmniDocBench for document understanding, and achieves 860 on OCRBench, which is reported as state of the art among vision language models with fewer than 3B parameters, while also outperforming several commercial OCR APIs and larger open models such as Qwen3 VL 4B on core OCR benchmarks.

Editorial Notes

HunyuanOCR is a strong signal that OCR specific VLMs are maturing into practical infrastructure, not just benchmarks. Tencent combines a 1B parameter end to end architecture with Native Vision Transformer, Adaptive MLP Connector and RL with verifiable rewards to deliver a single model that covers spotting, parsing, IE, VQA and translation across more than 100 languages, and it does so while reaching leading scores on OCRBench for sub 3B models and 94.1 on OmniDocBench. Overall, HunyuanOCR marks an important shift toward compact, instruction driven OCR engines that are realistic for production deployment.

Check out the Paper, Model weight and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Expert VLM appeared first on MarkTechPost.

How to Implement Functional Components of Transformer and Mini-GPT Mod …

In this tutorial, we explore how to build neural networks from scratch using Tinygrad while remaining fully hands-on with tensors, autograd, attention mechanisms, and transformer architectures. We progressively build every component ourselves, from basic tensor operations to multi-head attention, transformer blocks, and, finally, a working mini-GPT model. Through each stage, we observe how Tinygrad’s simplicity helps us understand what happens under the hood when models train, optimize, and fuse kernels for performance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess, sys, os
print(“Installing dependencies…”)
subprocess.check_call([“apt-get”, “install”, “-qq”, “clang”], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, “git+https://github.com/tinygrad/tinygrad.git”])

import numpy as np
from tinygrad import Tensor, nn, Device
from tinygrad.nn import optim
import time

print(f” Using device: {Device.DEFAULT}”)
print(“=” * 60)

print(“n PART 1: Tensor Operations & Autograd”)
print(“-” * 60)

x = Tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = Tensor([[2.0, 0.0], [1.0, 2.0]], requires_grad=True)

z = (x @ y).sum() + (x ** 2).mean()
z.backward()

print(f”x:n{x.numpy()}”)
print(f”y:n{y.numpy()}”)
print(f”z (scalar): {z.numpy()}”)
print(f”∂z/∂x:n{x.grad.numpy()}”)
print(f”∂z/∂y:n{y.grad.numpy()}”)

We set up Tinygrad in our Colab environment and immediately begin experimenting with tensors and automatic differentiation. We create a small computation graph and observe how gradients flow through matrix operations. As we print the outputs, we gain an intuitive understanding of how Tinygrad handles backpropagation under the hood. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 2: Building Custom Layers”)
print(“-” * 60)

class MultiHeadAttention:
def __init__(self, dim, num_heads):
self.num_heads = num_heads
self.dim = dim
self.head_dim = dim // num_heads
self.qkv = Tensor.glorot_uniform(dim, 3 * dim)
self.out = Tensor.glorot_uniform(dim, dim)

def __call__(self, x):
B, T, C = x.shape[0], x.shape[1], x.shape[2]
qkv = x.reshape(B * T, C).dot(self.qkv).reshape(B, T, 3, self.num_heads, self.head_dim)
q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
scale = (self.head_dim ** -0.5)
attn = (q @ k.transpose(-2, -1)) * scale
attn = attn.softmax(axis=-1)
out = (attn @ v).transpose(1, 2).reshape(B, T, C)
return out.reshape(B * T, C).dot(self.out).reshape(B, T, C)

class TransformerBlock:
def __init__(self, dim, num_heads):
self.attn = MultiHeadAttention(dim, num_heads)
self.ff1 = Tensor.glorot_uniform(dim, 4 * dim)
self.ff2 = Tensor.glorot_uniform(4 * dim, dim)
self.ln1_w = Tensor.ones(dim)
self.ln2_w = Tensor.ones(dim)

def __call__(self, x):
x = x + self.attn(self._layernorm(x, self.ln1_w))
ff = x.reshape(-1, x.shape[-1])
ff = ff.dot(self.ff1).gelu().dot(self.ff2)
x = x + ff.reshape(x.shape)
return self._layernorm(x, self.ln2_w)

def _layernorm(self, x, w):
mean = x.mean(axis=-1, keepdim=True)
var = ((x – mean) ** 2).mean(axis=-1, keepdim=True)
return w * (x – mean) / (var + 1e-5).sqrt()

We design our own multi-head attention module and a transformer block entirely from scratch. We implement the projections, attention scores, softmax, feedforward layers, and layer normalization manually. As we run this code, we see how each component contributes to a transformer layer’s overall behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n PART 3: Mini-GPT Architecture”)
print(“-” * 60)

class MiniGPT:
def __init__(self, vocab_size=256, dim=128, num_heads=4, num_layers=2, max_len=32):
self.vocab_size = vocab_size
self.dim = dim
self.tok_emb = Tensor.glorot_uniform(vocab_size, dim)
self.pos_emb = Tensor.glorot_uniform(max_len, dim)
self.blocks = [TransformerBlock(dim, num_heads) for _ in range(num_layers)]
self.ln_f = Tensor.ones(dim)
self.head = Tensor.glorot_uniform(dim, vocab_size)

def __call__(self, idx):
B, T = idx.shape[0], idx.shape[1]
tok_emb = self.tok_emb[idx.flatten()].reshape(B, T, self.dim)
pos_emb = self.pos_emb[:T].reshape(1, T, self.dim)
x = tok_emb + pos_emb
for block in self.blocks:
x = block(x)
mean = x.mean(axis=-1, keepdim=True)
var = ((x – mean) ** 2).mean(axis=-1, keepdim=True)
x = self.ln_f * (x – mean) / (var + 1e-5).sqrt()
return x.reshape(B * T, self.dim).dot(self.head).reshape(B, T, self.vocab_size)

def get_params(self):
params = [self.tok_emb, self.pos_emb, self.ln_f, self.head]
for block in self.blocks:
params.extend([block.attn.qkv, block.attn.out, block.ff1, block.ff2, block.ln1_w, block.ln2_w])
return params

model = MiniGPT(vocab_size=256, dim=64, num_heads=4, num_layers=2, max_len=16)
params = model.get_params()
total_params = sum(p.numel() for p in params)
print(f”Model initialized with {total_params:,} parameters”)

We assemble the full MiniGPT architecture using the components built earlier. We embed tokens, add positional information, stack multiple transformer blocks, and project the final outputs back to vocab logits. As we initialize the model, we begin to appreciate how a compact transformer can be built with surprisingly few moving parts. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 4: Training Loop”)
print(“-” * 60)

def gen_data(batch_size, seq_len):
x = np.random.randint(0, 256, (batch_size, seq_len))
y = np.roll(x, 1, axis=1)
y[:, 0] = x[:, 0]
return Tensor(x, dtype=’int32′), Tensor(y, dtype=’int32′)

optimizer = optim.Adam(params, lr=0.001)
losses = []

print(“Training to predict previous token in sequence…”)
with Tensor.train():
for step in range(20):
start = time.time()
x_batch, y_batch = gen_data(batch_size=16, seq_len=16)
logits = model(x_batch)
B, T, V = logits.shape[0], logits.shape[1], logits.shape[2]
loss = logits.reshape(B * T, V).sparse_categorical_crossentropy(y_batch.reshape(B * T))
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss.numpy())
elapsed = time.time() – start
if step % 5 == 0:
print(f”Step {step:3d} | Loss: {loss.numpy():.4f} | Time: {elapsed*1000:.1f}ms”)

print(“nn PART 5: Lazy Evaluation & Kernel Fusion”)
print(“-” * 60)

N = 512
a = Tensor.randn(N, N)
b = Tensor.randn(N, N)

print(“Creating computation: (A @ B.T + A).sum()”)
lazy_result = (a @ b.T + a).sum()
print(“→ No computation done yet (lazy evaluation)”)

print(“nCalling .realize() to execute…”)
start = time.time()
realized = lazy_result.realize()
elapsed = time.time() – start

print(f”✓ Computed in {elapsed*1000:.2f}ms”)
print(f”Result: {realized.numpy():.4f}”)
print(“nNote: Operations were fused into optimized kernels!”)

We train the MiniGPT model on simple synthetic data and observe the loss decreasing across steps. We also explore Tinygrad’s lazy execution model by creating a fused kernel that executes only when it is realized. As we monitor timings, we understand how kernel fusion improves performance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 6: Custom Operations”)
print(“-” * 60)

def custom_activation(x):
return x * x.sigmoid()

x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]], requires_grad=True)
y = custom_activation(x)
loss = y.sum()
loss.backward()

print(f”Input: {x.numpy()}”)
print(f”Swish(x): {y.numpy()}”)
print(f”Gradient: {x.grad.numpy()}”)

print(“nn” + “=” * 60)
print(” Tutorial Complete!”)
print(“=” * 60)
print(“””
Key Concepts Covered:
1. Tensor operations with automatic differentiation
2. Custom neural network layers (Attention, Transformer)
3. Building a mini-GPT language model from scratch
4. Training loop with Adam optimizer
5. Lazy evaluation and kernel fusion
6. Custom activation functions
“””)

We implement a custom activation function and verify that gradients propagate correctly through it. We then print a summary of all major concepts covered in the tutorial. As we finish, we reflect on how each section builds our ability to understand, modify, and extend deep learning internals using Tinygrad.

In conclusion, we reinforce our understanding of how neural networks truly operate beneath modern abstractions, and we experience firsthand how Tinygrad empowers us to tinker with every internal detail. We have built a transformer, trained it on synthetic data, experimented with lazy evaluation and kernel fusion, and even created custom operations, all within a minimal, transparent framework. At last, we recognize how this workflow prepares us for deeper experimentation, whether we extend the model, integrate real datasets, or continue exploring Tinygrad’s low-level capabilities.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Implement Functional Components of Transformer and Mini-GPT Model from Scratch Using Tinygrad to Understand Deep Learning Internals appeared first on MarkTechPost.