OpenAGI Foundation Launches Lux: A Foundation Computer Use Model that …

How do you turn slow, manual click work across browsers and desktops into a reliable, automated system that can actually use a computer for you at scale? Lux is the latest example of computer use agents moving from research demo to infrastructure. OpenAGI Foundation team has released Lux, a foundation model that operates real desktops and browsers and reports a score of 83.6 on the Online Mind2Web benchmark, which covers more than 300 real world computer use tasks. This is ahead of Google Gemini CUA at 69.0, OpenAI Operator at 61.3 and Anthropic Claude Sonnet 4 at 61.0.

https://agiopen.org/blog

What Lux Actually Does?

Lux is a computer use model, not a chat model with a browser plugin. It takes a natural language goal, views the screen, and outputs low level actions such as clicks, key presses and scroll events. It can drive browsers, editors, spreadsheets, email clients and other desktop applications because it works on rendered UI, not on application specific APIs.

From a developer point of view, Lux is available through the OpenAGI SDK and API console. The research team describes target workloads that include software QA flows, deep research runs, social media management, online store operations and bulk data entry. In all of these settings the agent needs to sequence dozens or hundreds of UI actions while staying aligned with a natural language task description.

https://agiopen.org/blog

Three Execution Modes For Different Control Levels

Lux ships with three execution modes that expose different tradeoffs between speed, autonomy and control.

Actor mode is the fast path. It runs around 1 second per step and is aimed at clearly specified tasks such as filling a form, pulling a report from a dashboard or extracting a small set of fields from a page. Think of it as a low latency macro engine that still understands natural language.

Thinker mode handles vague or multi step goals. It decomposes the high level instruction into smaller sub tasks and then executes them. Example workloads include multi page research, triage of long email queues or navigation of analytics interfaces where the exact click path is not specified in advance.

Tasker mode gives maximum determinism. The caller supplies an explicit Python list of steps that Lux executes one by one and it retries until the sequence completes or hits a hard failure. This allows teams to keep task graphs, guardrails and failure policies in their own code while delegating UI control to the model.

Tasker, Actor and Thinker are the three primary modes for procedural workflows, fast execution and complex goal solving.

Benchmarks, Latency And Cost

On Online Mind2Web, Lux reaches a success rate of 83.6 percent. The same benchmark reports 69.0 percent for Gemini CUA, 61.3 percent for OpenAI Operator and 61.0 percent for Claude Sonnet 4. The benchmark contains more than 300 web based tasks collected from real services, so it is a useful proxy for practical agents that drive browsers and web apps.

Latency and cost are where the numbers become important for engineering teams. OpenAGI team reports that Lux completes each step in about 1 second, while OpenAI Operator is around 3 seconds per step in the same evaluation setting. The research team also states that Lux is about 10 times cheaper per token than Operator. For any agent that can easily run hundreds of steps in a session, these constant factors determine whether a workload is viable in production.

Agentic Active Pre-training and Why OSGym Matters?

Lux is trained with a method that OpenAGI research team calls Agentic Active Pre-training. The team contrasts this with standard language model pre-training that passively ingests text from the internet. The idea is that Lux learns by acting in digital environments and refining its behavior through large scale interaction, rather than only minimizing token prediction loss on static logs. The optimization objective differs from classical reinforcement learning, and is set up to favor self driven exploration and understanding instead of a manually shaped reward.

This training setup depends on a data engine that can expose many operating system environments in parallel. OpenAGI team has already open sourced that engine as OSGym, under an MIT license that allows both research and commercial use. OSGym runs full operating system replicas, not only browser sandboxes, and supports tasks that span office software, browsers, development tools and multi application workflows.

Key Takeaways

Lux is a foundation computer use model that operates full desktops and browsers and reaches 83.6 percent success on the Online Mind2Web benchmark, ahead of Gemini CUA, OpenAI Operator and Claude Sonnet-4.

Lux exposes 3 modes, Actor, Thinker and Tasker, which cover low latency UI macros, multi step goal decomposition and deterministic scripted execution for production workflows.

Lux is reported to run around 1 second per step and to be about 10 times cheaper per token than OpenAI Operator, which matters for long horizon agents that run hundreds of actions per task.

Lux is trained with Agentic Active Pre-training, where the model learns by acting in environments, rather than only consuming static web text, which targets robust screen to action behavior instead of pure language modeling.

OSGym, the open source data engine behind Lux, can run more than 1,000 OS replicas and generate more than 1,400 multi turn trajectories per minute at low per replica cost, which gives teams a practical way to train and evaluate their own computer use agents.

Check out the Official Announcement, Project and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OpenAGI Foundation Launches Lux: A Foundation Computer Use Model that Tops Online Mind2Web with OSGym At Scale appeared first on MarkTechPost.

Kernel Principal Component Analysis (PCA): Explained with an Example

Dimensionality reduction techniques like PCA work wonderfully when datasets are linearly separable—but they break down the moment nonlinear patterns appear. That’s exactly what happens with datasets such as two moons: PCA flattens the structure and mixes the classes together. 

Kernel PCA fixes this limitation by mapping the data into a higher-dimensional feature space where nonlinear patterns become linearly separable. In this article, we’ll walk through how Kernel PCA works and use a simple example to visually compare PCA vs. Kernel PCA, showing how a nonlinear dataset that PCA fails to separate becomes perfectly separable after applying Kernel PCA.

What is PCA and how is it different from Kernel PCA?

Principal Component Analysis (PCA) is a linear dimensionality-reduction technique that identifies the directions (principal components) along which the data varies the most. It works by computing orthogonal linear combinations of the original features and projecting the dataset onto the directions of maximum variance. 

These components are uncorrelated and ordered so that the first few capture most of the information in the data. PCA is powerful, but it comes with one important limitation: it can only uncover linear relationships in the data. When applied to nonlinear datasets—like the “two moons” example—it often fails to separate the underlying structure.

Kernel PCA extends PCA to handle nonlinear relationships. Instead of directly applying PCA in the original feature space, Kernel PCA first uses a kernel function (such as RBF, polynomial, or sigmoid) to implicitly project the data into a higher-dimensional feature space where the nonlinear structure becomes linearly separable. 

PCA is then performed in this transformed space using a kernel matrix, without explicitly computing the higher-dimensional projection. This “kernel trick” allows Kernel PCA to capture complex patterns that standard PCA cannot.

We will now create a dataset that is nonlinear and then apply PCA to the dataset.

Code Implementation

Generating the dataset

We generate a nonlinear “two moons” dataset using make_moons, which is ideal for demonstrating why PCA fails and Kernel PCA succeeds.

Copy CodeCopiedUse a different Browserimport matplotlib.pyplot as plt
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.02, random_state=123)

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

Applying PCA on the dataset

Copy CodeCopiedUse a different Browserfrom sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.title(“PCA”)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.xlabel(“Component 1”)
plt.ylabel(“Component 2”)
plt.show()

The PCA visualization shows that the two moon-shaped clusters remain intertwined even after dimensionality reduction. This happens because PCA is a strictly linear technique—it can only rotate, scale, or flatten the data along straight directions of maximum variance. 

Since the “two moons” dataset has a nonlinear structure, PCA is unable to separate the classes or untangle the curved shapes. As a result, the transformed data still looks almost identical to the original pattern, and the two classes remain overlapped in the projected space.

Applying Kernel PCA on the dataset

We now apply Kernel PCA using an RBF kernel, which maps the nonlinear data into a higher-dimensional space where it becomes linearly separable. In the kernel space the two classes in our dataset are linearly separable. Kernel PCA uses a kernel function to project the dataset into a higher-dimensional space, where it is linearly separable.

Copy CodeCopiedUse a different Browserfrom sklearn.decomposition import KernelPCA
kpca = KernelPCA(kernel=’rbf’, gamma=15)
X_kpca = kpca.fit_transform(X)

plt.title(“Kernel PCA”)
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c=y)
plt.show()

The goal of PCA (and dimensionality reduction in general) is not just to compress the data—it’s to reveal the underlying structure in a way that preserves meaningful variation. In nonlinear datasets like the two-moons example, traditional PCA cannot “unfold” the curved shapes because it only applies linear transformations.

Kernel PCA, however, performs a nonlinear mapping before applying PCA, allowing the algorithm to untangle the moons into two clearly separated clusters. This separation is valuable because it makes downstream tasks like visualization, clustering, and even classification far more effective. When the data becomes linearly separable after transformation, simple models—such as linear classifiers—can successfully distinguish between the classes, something that would be impossible in the original or PCA-transformed space.

Challenges involved with Kernel PCA

While Kernel PCA is powerful for handling nonlinear datasets, it comes with several practical challenges. The biggest drawback is computational cost—because it relies on computing pairwise similarities between all data points, the algorithm has O(n²) time and memory complexity, making it slow and memory-heavy for large datasets. 

Another challenge is model selection: choosing the right kernel (RBF, polynomial, etc.) and tuning parameters like gamma can be tricky and often requires experimentation or domain expertise. 

Kernel PCA can also be harder to interpret, since the transformed components no longer correspond to intuitive directions in the original feature space. Finally, it is sensitive to missing values and outliers, which can distort the kernel matrix and degrade performance.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Kernel Principal Component Analysis (PCA): Explained with an Example appeared first on MarkTechPost.

How to Design a Fully Local Multi-Agent Orchestration System Using Tin …

In this tutorial, we explore how we can orchestrate a team of specialized AI agents locally using an efficient manager-agent architecture powered by TinyLlama. We walk through how we build structured task decomposition, inter-agent collaboration, and autonomous reasoning loops without relying on any external APIs. By running everything directly through the transformers library, we create a fully offline, lightweight, and transparent multi-agent system that we can customize, inspect, and extend. Through the snippets, we observe how each component, from task structures to agent prompts to result synthesis, comes together to form a coherent human-AI workflow that we control end-to-end. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install transformers torch accelerate bitsandbytes -q

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
import re
from typing import List, Dict, Any
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class Task:
id: str
description: str
assigned_to: str = None
status: str = “pending”
result: Any = None
dependencies: List[str] = None

def __post_init__(self):
if self.dependencies is None:
self.dependencies = []

@dataclass
class Agent:
name: str
role: str
expertise: str
system_prompt: str

We set up all the core imports and define the fundamental data structures needed to manage tasks and agents. We define Task and Agent as structured entities to cleanly orchestrate work. By doing this, we ensure that every part of the system has a consistent and reliable foundation. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserAGENT_REGISTRY = {
“researcher”: Agent(
name=”researcher”,
role=”Research Specialist”,
expertise=”Information gathering, analysis, and synthesis”,
system_prompt=”You are a research specialist. Provide thorough research on topics.”
),
“coder”: Agent(
name=”coder”,
role=”Software Engineer”,
expertise=”Writing clean, efficient code with best practices”,
system_prompt=”You are an expert programmer. Write clean, well-documented code.”
),
“writer”: Agent(
name=”writer”,
role=”Content Writer”,
expertise=”Clear communication and documentation”,
system_prompt=”You are a professional writer. Create clear, engaging content.”
),
“analyst”: Agent(
name=”analyst”,
role=”Data Analyst”,
expertise=”Data interpretation and insights”,
system_prompt=”You are a data analyst. Provide clear insights from data.”
)
}

class LocalLLM:
def __init__(self, model_name: str = “TinyLlama/TinyLlama-1.1B-Chat-v1.0″):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
) if torch.cuda.is_available() else None
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map=”auto”,
low_cpu_mem_usage=True
)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token

def generate(self, prompt: str, max_tokens: int = 300) -> str:
formatted_prompt = f”<|system|>nYou are a helpful AI assistant.</s>n<|user|>n{prompt}</s>n<|assistant|>n”
inputs = self.tokenizer(
formatted_prompt,
return_tensors=”pt”,
truncation=True,
max_length=1024,
padding=True
)
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
top_p=0.9,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
use_cache=True
)
full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
if “<|assistant|>” in full_response:
return full_response.split(“<|assistant|>”)[-1].strip()
return full_response[len(formatted_prompt):].strip()

We register all our specialized agents and implement the local LLM wrapper that powers the system. We load TinyLlama in an efficient 4-bit mode so we can run everything smoothly on Colab or local hardware. With this, we give ourselves a flexible and fully local way to generate responses for each agent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ManagerAgent:
def __init__(self, model_name: str = “TinyLlama/TinyLlama-1.1B-Chat-v1.0”):
self.llm = LocalLLM(model_name)
self.agents = AGENT_REGISTRY
self.tasks: Dict[str, Task] = {}
self.execution_log = []

def log(self, message: str):
timestamp = datetime.now().strftime(“%H:%M:%S”)
log_entry = f”[{timestamp}] {message}”
self.execution_log.append(log_entry)
print(log_entry)

def decompose_goal(self, goal: str) -> List[Task]:
self.log(f” Decomposing goal: {goal}”)
agent_info = “n”.join([f”- {name}: {agent.expertise}” for name, agent in self.agents.items()])
prompt = f”””Break down this goal into 3 specific subtasks. Assign each to the best agent.

Goal: {goal}

Available agents:
{agent_info}

Respond ONLY with a JSON array.”””
response = self.llm.generate(prompt, max_tokens=250)
try:
json_match = re.search(r'[s*{.*?}s*]’, response, re.DOTALL)
if json_match:
tasks_data = json.loads(json_match.group())
else:
raise ValueError(“No JSON found”)
except:
tasks_data = self._create_default_tasks(goal)

tasks = []
for i, task_data in enumerate(tasks_data[:3]):
task = Task(
id=task_data.get(‘id’, f’task_{i+1}’),
description=task_data.get(‘description’, f’Work on: {goal}’),
assigned_to=task_data.get(‘assigned_to’, list(self.agents.keys())[i % len(self.agents)]),
dependencies=task_data.get(‘dependencies’, [] if i == 0 else [f’task_{i}’])
)
self.tasks[task.id] = task
tasks.append(task)
self.log(f” ✓ {task.id}: {task.description[:50]}… → {task.assigned_to}”)

return tasks

We begin constructing the ManagerAgent class and focus on how we decompose a high-level goal into well-defined subtasks. We generate structured JSON-based tasks and automatically assign them to the right agent. By doing this, we allow the system to think step by step and organize work just like a human project manager. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def _create_default_tasks(self, goal: str) -> List[Dict]:
if any(word in goal.lower() for word in [‘code’, ‘program’, ‘implement’, ‘algorithm’]):
return [
{“id”: “task_1”, “description”: f”Research and explain the concept: {goal}”, “assigned_to”: “researcher”, “dependencies”: []},
{“id”: “task_2”, “description”: f”Write code implementation for: {goal}”, “assigned_to”: “coder”, “dependencies”: [“task_1”]},
{“id”: “task_3”, “description”: f”Create documentation and examples”, “assigned_to”: “writer”, “dependencies”: [“task_2”]}
]
return [
{“id”: “task_1”, “description”: f”Research: {goal}”, “assigned_to”: “researcher”, “dependencies”: []},
{“id”: “task_2”, “description”: f”Analyze findings and structure content”, “assigned_to”: “analyst”, “dependencies”: [“task_1”]},
{“id”: “task_3”, “description”: f”Write comprehensive response”, “assigned_to”: “writer”, “dependencies”: [“task_2″]}
]

def execute_task(self, task: Task, context: Dict[str, Any] = None) -> str:
self.log(f” Executing {task.id} with {task.assigned_to}”)
task.status = “in_progress”
agent = self.agents[task.assigned_to]
context_str = “”
if context and task.dependencies:
context_str = “nnContext from previous tasks:n”
for dep_id in task.dependencies:
if dep_id in context:
context_str += f”- {context[dep_id][:150]}…n”

prompt = f”””{agent.system_prompt}

Task: {task.description}{context_str}

Provide a clear, concise response:”””
result = self.llm.generate(prompt, max_tokens=250)
task.result = result
task.status = “completed”
self.log(f” ✓ Completed {task.id}”)
return result

We define fallback task logic and the full execution flow for each task. We guide each agent with its own system prompt and provide contextual information to keep results coherent. This allows us to execute tasks intelligently while respecting dependency order. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef synthesize_results(self, goal: str, results: Dict[str, str]) -> str:
self.log(” Synthesizing final results”)
results_text = “nn”.join([f”Task {tid}:n{res[:200]}” for tid, res in results.items()])
prompt = f”””Combine these task results into one final coherent answer.

Original Goal: {goal}

Task Results:
{results_text}

Final comprehensive answer:”””
return self.llm.generate(prompt, max_tokens=350)

def execute_goal(self, goal: str) -> Dict[str, Any]:
self.log(f”n{‘=’*60}n Starting Manager Agentn{‘=’*60}”)
tasks = self.decompose_goal(goal)
results = {}
completed = set()
max_iterations = len(tasks) * 2
iteration = 0

while len(completed) < len(tasks) and iteration < max_iterations:
iteration += 1
for task in tasks:
if task.id in completed:
continue
deps_met = all(dep in completed for dep in task.dependencies)
if deps_met:
result = self.execute_task(task, results)
results[task.id] = result
completed.add(task.id)

final_output = self.synthesize_results(goal, results)
self.log(f”n{‘=’*60}n Execution Complete!n{‘=’*60}n”)

return {
“goal”: goal,
“tasks”: [asdict(task) for task in tasks],
“final_output”: final_output,
“execution_log”: self.execution_log
}

We synthesize the outputs from all subtasks and convert them into one unified final answer. We also implement an orchestration loop that ensures each task runs only after its dependencies are complete. This snippet shows how we bring everything together into a smooth multi-step reasoning pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_basic():
manager = ManagerAgent()
goal = “Explain binary search algorithm with a simple example”
result = manager.execute_goal(goal)
print(“n” + “=”*60)
print(“FINAL OUTPUT”)
print(“=”*60)
print(result[“final_output”])
return result

def demo_coding():
manager = ManagerAgent()
goal = “Implement a function to find the maximum element in a list”
result = manager.execute_goal(goal)
print(“n” + “=”*60)
print(“FINAL OUTPUT”)
print(“=”*60)
print(result[“final_output”])
return result

def demo_custom(custom_goal: str):
manager = ManagerAgent()
result = manager.execute_goal(custom_goal)
print(“n” + “=”*60)
print(“FINAL OUTPUT”)
print(“=”*60)
print(result[“final_output”])
return result

if __name__ == “__main__”:
print(” Manager Agent Tutorial – APIless Local Version”)
print(“=”*60)
print(“Using TinyLlama (1.1B) – Fast & efficient!n”)
result = demo_basic()
print(“nn Try more:”)
print(” – demo_coding()”)
print(” – demo_custom(‘your goal here’)”)

We provide demonstration functions to easily test our system with different goals. We run sample tasks to observe how the manager decomposes, executes, and synthesizes work in real time. This gives us an interactive way to understand the entire workflow and refine it further.

In conclusion, we demonstrate how to design and operate a complete multi-agent orchestration system locally with minimal dependencies. We now understand how the manager breaks down goals, routes tasks to the right expert agents, collects their outputs, resolves dependencies, and synthesizes the final result. This implementation allows us to appreciate how modular, predictable, and powerful local agentic patterns can be when built from scratch.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Fully Local Multi-Agent Orchestration System Using TinyLlama for Intelligent Task Decomposition and Autonomous Collaboration appeared first on MarkTechPost.

Apple Researchers Release CLaRa: A Continuous Latent Reasoning Framewo …

How do you keep RAG systems accurate and efficient when every query tries to stuff thousands of tokens into the context window and the retriever and generator are still optimized as 2 separate, disconnected systems? A team of researchers from Apple and University of Edinburgh released CLaRa, Continuous Latent Reasoning, (CLaRa-7B-Base, CLaRa-7B-Instruct and CLaRa-7B-E2E) a retrieval augmented generation framework that compresses documents into continuous memory tokens and then performs both retrieval and generation in that shared latent space. The goal is simple. Shorten context, avoid double encoding, and let the generator teach the retriever what actually matters for downstream answers.

https://arxiv.org/pdf/2511.18659

From raw documents to continuous memory tokens

CLaRa starts with a semantic compressor that attaches a small number of learned memory tokens to each document. During Salient Compressor Pretraining, SCP, the base model is a Mistral 7B style transformer with LoRA adapters that switch between a compressor role and a generator role. The final layer hidden states of the memory tokens become the compressed representation for that document.

SCP is trained on about 2M passages from Wikipedia 2021. A local Qwen-32B model generates 3 supervision signals for each passage. Simple QA pairs cover atomic facts. Complex QA pairs connect several facts in one question to enforce multi hop reasoning. Paraphrases reorder and compress the text while preserving semantics. A verification loop checks factual consistency and coverage and can regenerate missing questions or paraphrases for up to 10 rounds before accepting a sample.

Training uses 2 losses. A cross entropy term trains the generator to answer questions or produce paraphrases conditioned only on the memory tokens and an instruction prefix. A mean squared error term aligns the average hidden state of document tokens with the average hidden state of the memory tokens. The MSE loss gives modest but consistent gains of about 0.3 to 0.6 F1 points at compression ratios 32 and 128 and keeps compressed and original representations in the same semantic region.

https://arxiv.org/pdf/2511.18659

Joint retrieval and generation in a shared space

After offline compression, each document is represented only by its memory tokens. CLaRa then trains a query reasoner and an answer generator on top of the same backbone. The query reasoner is another LoRA adapter that maps an input question into the same number of memory tokens used for documents. Retrieval becomes pure embedding search. The system computes cosine similarity between the query embedding and each candidate document embedding.

The best compressed document embeddings for a query are concatenated with the query tokens and fed into the generator adapter. Training uses only a standard next token prediction loss on the final answer. There are no explicit relevance labels. The key trick is a differentiable top k selector implemented with a Straight Through estimator. During the forward pass the model uses hard top k selection. During the backward pass a softmax distribution over document scores allows gradients from the generator to flow into the query reasoner parameters.

The research team shows 2 effects in the gradient analysis. First, the retriever is encouraged to assign higher probability to documents that increase answer likelihood. Second, because retrieval and generation share the same compressed representations, generator gradients reshape the latent document space to make it easier to reason over. Logit lens analysis of the query embeddings recovers topic tokens such as “NFL” and “Oklahoma” for a question about the nephew of Ivory Lee Brown, even though those tokens are not in the raw query but are present in the supporting articles.

https://arxiv.org/pdf/2511.18659

Compression quality and QA accuracy

The compressor is evaluated on 4 QA datasets: Natural Questions, HotpotQA, MuSiQue and 2WikiMultihopQA. Under the Normal setting, where the system retrieves the top 5 Wikipedia 2021 documents per query, SCP-Mistral-7B at 4 times compression reaches an average F1 of 39.86. This is 5.37 points better than the hard compression baseline LLMLingua 2 and 1.13 points better than the best soft compression baseline PISCO.

Under the Oracle setting, where the gold document is guaranteed to be in the candidate set, SCP-Mistral-7B at 4 times compression reaches an average F1 of 66.76. That is 17.31 points above LLMLingua-2 and 5.35 points above PISCO. Even more interesting, the compressed representations outperform a BGE based text retriever plus full document Mistral-7B generator by about 2.36 average F1 points for Mistral and about 6.36 points for Phi 4 mini. Well trained soft compression can exceed full text RAG while cutting context length by factors from 4 to 128.

https://arxiv.org/pdf/2511.18659

The performance at very high compression ratios, above 32 in Oracle, does drop, but the decline remains moderate in Normal retrieval conditions. The key explanation as per the research team is, weak document relevance bottlenecks the system before compression quality does.

End to end QA and retrieval behavior

For end to end QA, CLaRa uses 20 candidate documents per query with compression ratios 4, 16 and 32. On the Normal setting, CLaRa-Mistral-7B with instruction initialized weights and 16 times compression reaches F1 equal to 50.89 on Natural Questions and 44.66 on 2WikiMultihopQA. This is comparable to DRO-Mistral-7B, which reads full uncompressed text, while using 16 times shorter document representations. On some datasets, CLaRa at 16 times compression slightly improves F1 over DRO, for example from 43.65 to 47.18 on 2Wiki.

In the Oracle setting, CLaRa-Mistral-7B exceeds 75, F1 on both Natural Questions and HotpotQA at 4 times compression. This shows that the generator can fully exploit accurate retrieval even when all evidence is stored only in compressed memory tokens. Instruction initialized CLaRa generally wins over pre-training initialized CLaRa in the Normal setting, while the gap narrows in Oracle, where retrieval noise is limited.

On the retrieval side, CLaRa used as a reranker under Oracle conditions delivers strong Recall at 5. With pretraining initialization at compression 4 on HotpotQA, CLaRa-Mistral-7B reaches Recall at 5 equal to 96.21. This beats the supervised BGE Reranker baseline at 85.93 by 10.28 points and even outperforms a fully supervised Sup Instruct retriever trained with contrastive relevance labels.

https://arxiv.org/pdf/2511.18659

What Apple has released?

Apple’s research team released 3 models on Hugging Face: CLaRa-7B-Base, CLaRa-7B-Instruct and CLaRa-7B-E2E. CLaRa-7B-Instruct is described as an instruction tuned unified RAG model with built in document compression at 16 and 128 times. It answers instruction style questions directly from compressed representations and uses Mistral-7B-Instruct v0.2 as the base model.

Key Takeaways

CLaRa replaces raw documents with a small set of continuous memory tokens learned via QA guided and paraphrase guided semantic compression, which preserves key reasoning signals even at 16 times and 128 times compression.

Retrieval and generation are trained in a single shared latent space, the query encoder and generator share the same compressed representations and are optimized together with one language modeling loss.

A differentiable top-k estimator lets gradients flow from answer tokens back into the retriever, which aligns document relevance with answer quality and removes the usual disjoint tuning loop for RAG systems.

On multi hop QA benchmarks like Natural Questions, HotpotQA, MuSiQue and 2WikiMultihopQA, CLaRa’s SCP compressor at 4 times compression outperforms strong text based baselines such as LLMLingua 2 and PISCO and can even beat full text BGE/ Mistral pipelines on average F1.

Apple has released 3 practical models, CLaRa-7B-Base, CLaRa-7B-Instruct and CLaRa-7B-E2E, along with the full training pipeline on GitHub.

Editorial Notes

CLaRa is an important step for retrieval augmented generation because it treats semantic document compression and joint optimization in a shared continuous space as first class citizens, not afterthoughts bolted onto a text only pipeline. It shows that embedding based compression with SCP, combined with end to end training via a differentiable top-k estimator and a single language modeling loss, can match or surpass text based RAG baselines while using far shorter contexts and simpler retrieval stacks. Overall, CLaRa demonstrates that unified continuous latent reasoning is a credible alternative to classic chunk and retrieve RAG for real world QA workloads.

Check out the Paper, Model Weights on HF and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Apple Researchers Release CLaRa: A Continuous Latent Reasoning Framework for Compression‑Native RAG with 16x–128x Semantic Document Compression appeared first on MarkTechPost.

AI Interview Series #4: Transformers vs Mixture of Experts (MoE)

Question:

MoE models contain far more parameters than Transformers, yet they can run faster at inference. How is that possible?

Difference between Transformers & Mixture of Experts (MoE)

Transformers and Mixture of Experts (MoE) models share the same backbone architecture—self-attention layers followed by feed-forward layers—but they differ fundamentally in how they use parameters and compute.

Feed-Forward Network vs Experts

Transformer: Each block contains a single large feed-forward network (FFN). Every token passes through this FFN, activating all parameters during inference.

MoE: Replaces the FFN with multiple smaller feed-forward networks, called experts. A routing network selects only a few experts (Top-K) per token, so only a small fraction of total parameters is active.

Parameter Usage

Transformer: All parameters across all layers are used for every token → dense compute.

MoE: Has more total parameters, but activates only a small portion per token → sparse compute. Example: Mixtral 8×7B has 46.7B total parameters, but uses only ~13B per token.

Inference Cost

Transformer: High inference cost due to full parameter activation. Scaling to models like GPT-4 or Llama 2 70B requires powerful hardware.

MoE: Lower inference cost because only K experts per layer are active. This makes MoE models faster and cheaper to run, especially at large scales.

Token Routing

Transformer: No routing. Every token follows the exact same path through all layers.

MoE: A learned router assigns tokens to experts based on softmax scores. Different tokens select different experts. Different layers may activate different experts which  increases specialization and model capacity.

Model Capacity

Transformer: To scale capacity, the only option is adding more layers or widening the FFN—both increase FLOPs heavily.

MoE: Can scale total parameters massively without increasing per-token compute. This enables “bigger brains at lower runtime cost.”

While MoE architectures offer massive capacity with lower inference cost, they introduce several training challenges. The most common issue is expert collapse, where the router repeatedly selects the same experts, leaving others under-trained. 

Load imbalance is another challenge—some experts may receive far more tokens than others, leading to uneven learning. To address this, MoE models rely on techniques like noise injection in routing, Top-K masking, and expert capacity limits. 

These mechanisms ensure all experts stay active and balanced, but they also make MoE systems more complex to train compared to standard Transformers.

AI Interview Series #3: Explain Federated Learning

The post AI Interview Series #4: Transformers vs Mixture of Experts (MoE) appeared first on MarkTechPost.

How to Build a Meta-Cognitive AI Agent That Dynamically Adjusts Its Ow …

In this tutorial, we build an advanced meta-cognitive control agent that learns how to regulate its own depth of thinking. We treat reasoning as a spectrum, ranging from fast heuristics to deep chain-of-thought to precise tool-like solving, and we train a neural meta-controller to decide which mode to use for each task. By optimizing the trade-off between accuracy, computation cost, and a limited reasoning budget, we explore how an agent can monitor its internal state and adapt its reasoning strategy in real time. Through each snippet, we experiment, observe patterns, and understand how meta-cognition emerges when an agent learns to think about its own thinking. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserimport random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

OPS = [‘+’, ‘*’]

def make_task():
op = random.choice(OPS)
if op == ‘+’:
a, b = random.randint(1, 99), random.randint(1, 99)
else:
a, b = random.randint(2, 19), random.randint(2, 19)
return a, b, op

def true_answer(a, b, op):
return a + b if op == ‘+’ else a * b

def true_difficulty(a, b, op):
if op == ‘+’ and a <= 30 and b <= 30:
return 0
if op == ‘*’ and a <= 10 and b <= 10:
return 1
return 2

def heuristic_difficulty(a, b, op):
score = 0
if op == ‘*’:
score += 0.6
score += max(a, b) / 100.0
return min(score, 1.0)

def fast_heuristic(a, b, op):
if op == ‘+’:
base = a + b
noise = random.choice([-2, -1, 0, 0, 0, 1, 2, 3])
else:
base = int(0.8 * a * b)
noise = random.choice([-5, -3, 0, 0, 2, 5, 8])
return base + noise, 0.5

def deep_chain_of_thought(a, b, op, verbose=False):
if op == ‘+’:
x, y = a, b
carry = 0
pos = 1
result = 0
step = 0
while x > 0 or y > 0 or carry:
dx, dy = x % 10, y % 10
s = dx + dy + carry
carry, digit = divmod(s, 10)
result += digit * pos
x //= 10; y //= 10; pos *= 10
step += 1
else:
result = 0
step = 0
for i, d in enumerate(reversed(str(b))):
row = a * int(d) * (10 ** i)
result += row
step += 1
return result, max(2.0, 0.4 * step)

def tool_solver(a, b, op):
return eval(f”{a}{op}{b}”), 1.2

ACTION_NAMES = [“fast”, “deep”, “tool”]

We set up the world our meta-agent operates in. We generate arithmetic tasks, define ground-truth answers, estimate difficulty, and implement three different reasoning modes. As we run it, we observe how each solver behaves differently in terms of accuracy and computational cost, which form the foundation of the agent’s decision space. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserdef encode_state(a, b, op, rem_budget, error_ema, last_action):
a_n = a / 100.0
b_n = b / 100.0
op_plus = 1.0 if op == ‘+’ else 0.0
op_mul = 1.0 – op_plus
diff_hat = heuristic_difficulty(a, b, op)
rem_n = rem_budget / MAX_BUDGET
last_onehot = [0.0, 0.0, 0.0]
if last_action is not None:
last_onehot[last_action] = 1.0
feats = [
a_n, b_n, op_plus, op_mul,
diff_hat, rem_n, error_ema
] + last_onehot
return torch.tensor(feats, dtype=torch.float32, device=device)

STATE_DIM = 10
N_ACTIONS = 3

class PolicyNet(nn.Module):
def __init__(self, state_dim, hidden=48, n_actions=3):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.Tanh(),
nn.Linear(hidden, hidden),
nn.Tanh(),
nn.Linear(hidden, n_actions)
)
def forward(self, x):
return self.net(x)

policy = PolicyNet(STATE_DIM, hidden=48, n_actions=N_ACTIONS).to(device)
optimizer = optim.Adam(policy.parameters(), lr=3e-3)

We encode each task into a structured state that captures operands, operation type, predicted difficulty, remaining budget, and recent performance. We then define a neural policy network that maps this state to a probability distribution over actions. As we work through it, we see how the policy becomes the core mechanism through which the agent learns to regulate its thinking. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different BrowserGAMMA = 0.98
COST_PENALTY = 0.25
MAX_BUDGET = 25.0
EPISODES = 600
STEPS_PER_EP = 20
ERROR_EMA_DECAY = 0.9

def run_episode(train=True):
log_probs = []
rewards = []
info = []
rem_budget = MAX_BUDGET
error_ema = 0.0
last_action = None

for _ in range(STEPS_PER_EP):
a, b, op = make_task()
state = encode_state(a, b, op, rem_budget, error_ema, last_action)
logits = policy(state)
dist = torch.distributions.Categorical(logits=logits)
action = dist.sample() if train else torch.argmax(logits)
act_idx = int(action.item())

if act_idx == 0:
pred, cost = fast_heuristic(a, b, op)
elif act_idx == 1:
pred, cost = deep_chain_of_thought(a, b, op, verbose=False)
else:
pred, cost = tool_solver(a, b, op)

correct = (pred == true_answer(a, b, op))
acc_reward = 1.0 if correct else 0.0
budget_penalty = 0.0

rem_budget -= cost
if rem_budget < 0:
budget_penalty = -1.5 * (abs(rem_budget) / MAX_BUDGET)

step_reward = acc_reward – COST_PENALTY * cost + budget_penalty
rewards.append(step_reward)

if train:
log_probs.append(dist.log_prob(action))

err = 0.0 if correct else 1.0
error_ema = ERROR_EMA_DECAY * error_ema + (1 – ERROR_EMA_DECAY) * err
last_action = act_idx

info.append({
“correct”: correct,
“cost”: cost,
“difficulty”: true_difficulty(a, b, op),
“action”: act_idx
})

if train:
returns = []
G = 0.0
for r in reversed(rewards):
G = r + GAMMA * G
returns.append(G)
returns = list(reversed(returns))
returns_t = torch.tensor(returns, dtype=torch.float32, device=device)
baseline = returns_t.mean()
adv = returns_t – baseline
loss = -(torch.stack(log_probs) * adv).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()

return rewards, info

We implement the heart of learning using the REINFORCE policy gradient algorithm. We run multi-step episodes, collect log-probabilities, accumulate rewards, and compute returns. As we execute this part, we watch the meta-controller adjust its strategy by reinforcing decisions that balance accuracy with cost. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserprint(“Training meta-cognitive controller…”)
for ep in range(EPISODES):
rewards, _ = run_episode(train=True)
if (ep + 1) % 100 == 0:
print(f” episode {ep+1:4d} | avg reward {np.mean(rewards):.3f}”)

def evaluate(n_episodes=50):
all_actions = {0: [0,0,0], 1: [0,0,0], 2: [0,0,0]}
stats = {0: {“n”:0,”acc”:0,”cost”:0},
1: {“n”:0,”acc”:0,”cost”:0},
2: {“n”:0,”acc”:0,”cost”:0}}

for _ in range(n_episodes):
_, info = run_episode(train=False)
for step in info:
d = step[“difficulty”]
a_idx = step[“action”]
all_actions[d][a_idx] += 1
stats[d][“n”] += 1
stats[d][“acc”] += 1 if step[“correct”] else 0
stats[d][“cost”] += step[“cost”]

for d in [0,1,2]:
if stats[d][“n”] == 0:
continue
n = stats[d][“n”]
print(f”Difficulty {d}:”)
print(” action counts [fast, deep, tool]:”, all_actions[d])
print(” accuracy:”, stats[d][“acc”]/n)
print(” avg cost:”, stats[d][“cost”]/n)
print()

print(“Policy behavior by difficulty:”)
evaluate()

We train the meta-cognitive agent over hundreds of episodes and evaluate its behavior across difficulty levels. We observe how the policy evolves, using fast heuristics for simple tasks while resorting to deeper reasoning for harder ones. As we analyze the outputs, we understand how training shapes the agent’s reasoning choices. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserprint(“nExample hard task with meta-selected thinking mode:”)
a, b, op = 47, 18, ‘*’
state = encode_state(a, b, op, MAX_BUDGET, 0.3, None)
with torch.no_grad():
logits = policy(state)
act = int(torch.argmax(logits).item())

print(f”Task: {a} {op} {b}”)
print(“Chosen mode:”, ACTION_NAMES[act])

if act == 1:
pred, cost = deep_chain_of_thought(a, b, op, verbose=True)
elif act == 0:
pred, cost = fast_heuristic(a, b, op)
print(“Fast heuristic:”, pred)
else:
pred, cost = tool_solver(a, b, op)
print(“Tool solver:”, pred)

print(“True:”, true_answer(a,b,op), “| cost:”, cost)

We inspect a detailed reasoning trace for a hard example chosen by the trained policy. We see the agent confidently pick a mode and walk through the reasoning steps, allowing us to witness its meta-cognitive behavior in action. As we test different tasks, we appreciate how the model adapts its thinking based on context.

In conclusion, we have seen how a neural controller can learn to dynamically choose the most effective reasoning pathway based on the task’s difficulty and the constraints of the moment. We observe how the agent gradually discovers when quick heuristics are sufficient, when deeper reasoning is necessary, and when calling a precise solver is worth the cost. Through this process, we experience how metacognitive control transforms decision-making, leading to more efficient and adaptable reasoning systems.

Check out the FULL CODE NOTEBOOK. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Meta-Cognitive AI Agent That Dynamically Adjusts Its Own Reasoning Depth for Efficient Problem Solving appeared first on MarkTechPost.

NVIDIA and Mistral AI Bring 10x Faster Inference for the Mistral 3 Fam …

NVIDIA announced today a significant expansion of its strategic collaboration with Mistral AI. This partnership coincides with the release of the new Mistral 3 frontier open model family, marking a pivotal moment where hardware acceleration and open-source model architecture have converged to redefine performance benchmarks.

This collaboration is a massive leap in inference speed: the new models now run up to 10x faster on NVIDIA GB200 NVL72 systems compared to the previous generation H200 systems. This breakthrough unlocks unprecedented efficiency for enterprise-grade AI, promising to solve the latency and cost bottlenecks that have historically plagued the large-scale deployment of reasoning models.

A Generational Leap: 10x Faster on Blackwell

As enterprise demand shifts from simple chatbots to high-reasoning, long-context agents, inference efficiency has become the critical bottleneck. The collaboration between NVIDIA and Mistral AI addresses this head-on by optimizing the Mistral 3 family specifically for the NVIDIA Blackwell architecture.

Where production AI systems must deliver both strong user experience (UX) and cost-efficient scale, the NVIDIA GB200 NVL72 provides up to 10x higher performance than the previous-generation H200. This is not merely a gain in raw speed; it translates to significantly higher energy efficiency. The system exceeds 5,000,000 tokens per second per megawatt (MW) at user interactivity rates of 40 tokens per second.

Created by MarkTechpost.com and source NVIDIA

For data centers grappling with power constraints, this efficiency gain is as critical as the performance boost itself. This generational leap ensures a lower per-token cost while maintaining the high throughput required for real-time applications.

A New Mistral 3 Family

The engine driving this performance is the newly released Mistral 3 family. This suite of models delivers industry-leading accuracy, efficiency, and customization capabilities, covering the spectrum from massive data center workloads to edge device inference.

Mistral Large 3: The Flagship MoE

At the top of the hierarchy sits Mistral Large 3, a state-of-the-art sparse Multimodal and Multilingual Mixture-of-Experts (MoE) model.

Total Parameters: 675 Billion

Active Parameters: 41 Billion

Context Window: 256K tokens

Trained on NVIDIA Hopper GPUs, Mistral Large 3 is designed to handle complex reasoning tasks, offering parity with top-tier closed models while retaining the flexibility of open weights.

Ministral 3: Dense Power at the Edge

Complementing the large model is the Ministral 3 series, a suite of small, dense, high-performance models designed for speed and versatility.

Sizes: 3B, 8B, and 14B parameters.

Variants: Base, Instruct, and Reasoning for each size (nine models total).

Context Window: 256K tokens across the board.

The Ministral 3 series excel at GPQA Diamond Accuracy benchmark by utilizing 100 less tokens while delivery higher accuracy :

Significant Engineering Behind the Speed: A Comprehensive Optimization Stack

The “10x” performance claim is driven by a comprehensive stack of optimizations co-developed by Mistral and NVIDIA engineers. The teams adopted an “extreme co-design” approach, merging hardware capabilities with model architecture adjustments.

TensorRT-LLM Wide Expert Parallelism (Wide-EP)

To fully exploit the massive scale of the GB200 NVL72, NVIDIA employed Wide Expert Parallelism within TensorRT-LLM. This technology provides optimized MoE GroupGEMM kernels, expert distribution, and load balancing.

Crucially, Wide-EP exploits the NVL72’s coherent memory domain and NVLink fabric. It is highly resilient to architectural variations across large MoEs. For instance, Mistral Large 3 utilizes roughly 128 experts per layer, about half as many as comparable models like DeepSeek-R1. Despite this difference, Wide-EP enables the model to realize the high-bandwidth, low-latency, non-blocking benefits of the NVLink fabric, ensuring that the model’s massive size does not result in communication bottlenecks.

Native NVFP4 Quantization

One of the most significant technical advancements in this release is the support for NVFP4, a quantization format native to the Blackwell architecture.

For Mistral Large 3, developers can deploy a compute-optimized NVFP4 checkpoint quantized offline using the open-source llm-compressor library.

This approach reduces compute and memory costs while strictly maintaining accuracy. It leverages NVFP4’s higher-precision FP8 scaling factors and finer-grained block scaling to control quantization error. The recipe specifically targets the MoE weights while keeping other components at original precision, allowing the model to deploy seamlessly on the GB200 NVL72 with minimal accuracy loss.

Disaggregated Serving with NVIDIA Dynamo

Mistral Large 3 utilizes NVIDIA Dynamo, a low-latency distributed inference framework, to disaggregate the prefill and decode phases of inference.

In traditional setups, the prefill phase (processing the input prompt) and the decode phase (generating the output) compete for resources. By rate-matching and disaggregating these phases, Dynamo significantly boosts performance for long-context workloads, such as 8K input/1K output configurations. This ensures high throughput even when utilizing the model’s massive 256K context window.

From Cloud to Edge: Ministral 3 Performance

The optimization efforts extend beyond the massive data centers. Recognizing the growing need for local AI, the Ministral 3 series is engineered for edge deployment, offering flexibility for a variety of needs.

RTX and Jetson Acceleration

The dense Ministral models are optimized for platforms like the NVIDIA GeForce RTX AI PC and NVIDIA Jetson robotics modules.

RTX 5090: The Ministral-3B variants can reach blistering inference speeds of 385 tokens per second on the NVIDIA RTX 5090 GPU. This brings workstation-class AI performance to local PCs, enabling fast iteration and greater data privacy.

Jetson Thor: For robotics and edge AI, developers can use the vLLM container on NVIDIA Jetson Thor. The Ministral-3-3B-Instruct model achieves 52 tokens per second for single concurrency, scaling up to 273 tokens per second with a concurrency of 8.

Broad Framework Support

NVIDIA has collaborated with the open-source community to ensure these models are usable everywhere.

Llama.cpp & Ollama: NVIDIA collaborated with these popular frameworks to ensure faster iteration and lower latency for local development.

SGLang: NVIDIA collaborated with SGLang to create an implementation of Mistral Large 3 that supports both disaggregation and speculative decoding.

vLLM: NVIDIA worked with vLLM to expand support for kernel integrations, including speculative decoding (EAGLE), Blackwell support, and expanded parallelism.

Production-Ready with NVIDIA NIM

To streamline enterprise adoption, the new models will be available through NVIDIA NIM microservices.

Mistral Large 3 and Ministral-14B-Instruct are currently available through the NVIDIA API catalog and preview API. Soon, enterprise developers will be able to use downloadable NVIDIA NIM microservices. This provides a containerized, production-ready solution that allows enterprises to deploy the Mistral 3 family with minimal setup on any GPU-accelerated infrastructure.

This availability ensures that the specific “10x” performance advantage of the GB200 NVL72 can be realized in production environments without complex custom engineering, democratizing access to frontier-class intelligence.

Conclusion: A New Standard for Open Intelligence

The release of the NVIDIA-accelerated Mistral 3 open model family represents a major leap for AI in the open-source community. By offering frontier-level performance under an open source license, and backing it with a robust hardware optimization stack, Mistral and NVIDIA are meeting developers where they are.

From the massive scale of the GB200 NVL72 utilizing Wide-EP and NVFP4, to the edge-friendly density of Ministral on an RTX 5090, this partnership delivers a scalable, efficient path for artificial intelligence. With upcoming optimizations such as speculative decoding with multitoken prediction (MTP) and EAGLE-3 expected to push performance even further, the Mistral 3 family is poised to become a foundational element of the next generation of AI applications.

Available to test!

If you are a developer looking to benchmark these performance gains, you can download the Mistral 3 models directly from Hugging Face or test the deployment-free hosted versions on build.nvidia.com/mistralai to evaluate the latency and throughput for your specific use case.

Check out the Models on Hugging Face. You can find details on Corporate Blog and Technical/Developer Blog.

Thanks to the NVIDIA AI team for the thought leadership/ Resources for this article. NVIDIA AI team has supported this content/article.
The post NVIDIA and Mistral AI Bring 10x Faster Inference for the Mistral 3 Family on GB200 NVL72 GPU Systems appeared first on MarkTechPost.

How We Learn Step-Level Rewards from Preferences to Solve Sparse-Rewar …

In this tutorial, we explore Online Process Reward Learning (OPRL) and demonstrate how we can learn dense, step-level reward signals from trajectory preferences to solve sparse-reward reinforcement learning tasks. We walk through each component, from the maze environment and reward-model network to preference generation, training loops, and evaluation, while observing how the agent gradually improves its behaviour through online preference-driven shaping. By running this end-to-end implementation, we gain a practical understanding of how OPRL enables better credit assignment, faster learning, and more stable policy optimization in challenging environments where the agent would otherwise struggle to discover meaningful rewards. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserimport numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
import matplotlib.pyplot as plt
from collections import deque
import random

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

class MazeEnv:
def __init__(self, size=8):
self.size = size
self.start = (0, 0)
self.goal = (size-1, size-1)
self.obstacles = set([(i, size//2) for i in range(1, size-2)])
self.reset()

def reset(self):
self.pos = self.start
self.steps = 0
return self._get_state()

def _get_state(self):
state = np.zeros(self.size * self.size)
state[self.pos[0] * self.size + self.pos[1]] = 1
return state

def step(self, action):
moves = [(-1,0), (0,1), (1,0), (0,-1)]
new_pos = (self.pos[0] + moves[action][0],
self.pos[1] + moves[action][1])
if (0 <= new_pos[0] < self.size and
0 <= new_pos[1] < self.size and
new_pos not in self.obstacles):
self.pos = new_pos
self.steps += 1
done = self.pos == self.goal or self.steps >= 60
reward = 10.0 if self.pos == self.goal else 0.0
return self._get_state(), reward, done

def render(self):
grid = [[‘.’ for _ in range(self.size)] for _ in range(self.size)]
for obs in self.obstacles:
grid[obs[0]][obs[1]] = ‘█’
grid[self.goal[0]][self.goal[1]] = ‘G’
grid[self.pos[0]][self.pos[1]] = ‘A’
return ‘n’.join([”.join(row) for row in grid])

class ProcessRewardModel(nn.Module):
def __init__(self, state_dim, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.LayerNorm(hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.LayerNorm(hidden),
nn.ReLU(),
nn.Linear(hidden, 1),
nn.Tanh()
)
def forward(self, states):
return self.net(states)
def trajectory_reward(self, states):
return self.forward(states).sum()

class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden=128):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU()
)
self.actor = nn.Linear(hidden, action_dim)
self.critic = nn.Linear(hidden, 1)
def forward(self, state):
features = self.backbone(state)
return self.actor(features), self.critic(features)

We set up the entire foundation of our OPRL system by importing libraries, defining the maze environment, and building the reward and policy networks. We establish how states are represented, how obstacles block movement, and how the sparse reward structure works. We also design the core neural models that will later learn process rewards and drive the policy’s decisions. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserclass OPRLAgent:
def __init__(self, state_dim, action_dim, lr=3e-4):
self.policy = PolicyNetwork(state_dim, action_dim)
self.reward_model = ProcessRewardModel(state_dim)
self.policy_opt = Adam(self.policy.parameters(), lr=lr)
self.reward_opt = Adam(self.reward_model.parameters(), lr=lr)
self.trajectories = deque(maxlen=200)
self.preferences = deque(maxlen=500)
self.action_dim = action_dim

def select_action(self, state, epsilon=0.1):
if random.random() < epsilon:
return random.randint(0, self.action_dim – 1)
state_t = torch.FloatTensor(state).unsqueeze(0)
with torch.no_grad():
logits, _ = self.policy(state_t)
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, 1).item()

def collect_trajectory(self, env, epsilon=0.1):
states, actions, rewards = [], [], []
state = env.reset()
done = False
while not done:
action = self.select_action(state, epsilon)
next_state, reward, done = env.step(action)
states.append(state)
actions.append(action)
rewards.append(reward)
state = next_state
traj = {
‘states’: torch.FloatTensor(np.array(states)),
‘actions’: torch.LongTensor(actions),
‘rewards’: torch.FloatTensor(rewards),
‘return’: float(sum(rewards))
}
self.trajectories.append(traj)
return traj

We begin constructing the OPRL agent by implementing action selection and trajectory collection. We use an ε-greedy strategy to ensure exploration and gather sequences of states, actions, and returns. As we run the agent through the maze, we store entire trajectories that will later serve as preference data for shaping the reward model. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browser def generate_preference(self):
if len(self.trajectories) < 2:
return
t1, t2 = random.sample(list(self.trajectories), 2)
label = 1.0 if t1[‘return’] > t2[‘return’] else 0.0
self.preferences.append({‘t1’: t1, ‘t2’: t2, ‘label’: label})

def train_reward_model(self, n_updates=5):
if len(self.preferences) < 32:
return 0.0
total_loss = 0.0
for _ in range(n_updates):
batch = random.sample(list(self.preferences), 32)
loss = 0.0
for item in batch:
r1 = self.reward_model.trajectory_reward(item[‘t1’][‘states’])
r2 = self.reward_model.trajectory_reward(item[‘t2’][‘states’])
logit = r1 – r2
pred_prob = torch.sigmoid(logit)
label = item[‘label’]
loss += -(label * torch.log(pred_prob + 1e-8) +
(1-label) * torch.log(1 – pred_prob + 1e-8))
loss = loss / len(batch)
self.reward_opt.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.reward_model.parameters(), 1.0)
self.reward_opt.step()
total_loss += loss.item()
return total_loss / n_updates

We generate preference pairs from collected trajectories and train the process reward model using the Bradley–Terry formulation. We compare trajectory-level scores, compute probabilities, and update the reward model to reflect which behaviours appear better. This allows us to learn dense, differentiable, step-level rewards that guide the agent even when the environment itself is sparse. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browser def train_policy(self, n_updates=3, gamma=0.98):
if len(self.trajectories) < 5:
return 0.0
total_loss = 0.0
for _ in range(n_updates):
traj = random.choice(list(self.trajectories))
with torch.no_grad():
process_rewards = self.reward_model(traj[‘states’]).squeeze()
shaped_rewards = traj[‘rewards’] + 0.1 * process_rewards
returns = []
G = 0
for r in reversed(shaped_rewards.tolist()):
G = r + gamma * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
returns = (returns – returns.mean()) / (returns.std() + 1e-8)
logits, values = self.policy(traj[‘states’])
log_probs = F.log_softmax(logits, dim=-1)
action_log_probs = log_probs.gather(1, traj[‘actions’].unsqueeze(1))
advantages = returns – values.squeeze().detach()
policy_loss = -(action_log_probs.squeeze() * advantages).mean()
value_loss = F.mse_loss(values.squeeze(), returns)
entropy = -(F.softmax(logits, dim=-1) * log_probs).sum(-1).mean()
loss = policy_loss + 0.5 * value_loss – 0.01 * entropy
self.policy_opt.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
self.policy_opt.step()
total_loss += loss.item()
return total_loss / n_updates

def train_oprl(episodes=500, render_interval=100):
env = MazeEnv(size=8)
agent = OPRLAgent(state_dim=64, action_dim=4, lr=3e-4)
returns, reward_losses, policy_losses = [], [], []
success_rate = []
for ep in range(episodes):
epsilon = max(0.05, 0.5 – ep / 1000)
traj = agent.collect_trajectory(env, epsilon)
returns.append(traj[‘return’])
if ep % 2 == 0 and ep > 10:
agent.generate_preference()
if ep > 20 and ep % 2 == 0:
rew_loss = agent.train_reward_model(n_updates=3)
reward_losses.append(rew_loss)
if ep > 10:
pol_loss = agent.train_policy(n_updates=2)
policy_losses.append(pol_loss)
success = 1 if traj[‘return’] > 5 else 0
success_rate.append(success)
if ep % render_interval == 0 and ep > 0:
test_env = MazeEnv(size=8)
agent.collect_trajectory(test_env, epsilon=0)
print(test_env.render())
return returns, reward_losses, policy_losses, success_rate

We train the policy using shaped rewards produced by the learned process reward model. We compute returns, advantages, value estimates, and entropy bonuses, enabling the agent to improve its strategy over time. We then build a full training loop in which exploration decays, preferences accumulate, and both the reward model and the policy are updated continuously. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserprint(“Training OPRL Agent on Sparse Reward Maze…n”)
returns, rew_losses, pol_losses, success = train_oprl(episodes=500, render_interval=250)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0,0].plot(returns, alpha=0.3)
axes[0,0].plot(np.convolve(returns, np.ones(20)/20, mode=’valid’), linewidth=2)
axes[0,0].set_xlabel(‘Episode’)
axes[0,0].set_ylabel(‘Return’)
axes[0,0].set_title(‘Agent Performance’)
axes[0,0].grid(alpha=0.3)

success_smooth = np.convolve(success, np.ones(20)/20, mode=’valid’)
axes[0,1].plot(success_smooth, linewidth=2, color=’green’)
axes[0,1].set_xlabel(‘Episode’)
axes[0,1].set_ylabel(‘Success Rate’)
axes[0,1].set_title(‘Goal Success Rate’)
axes[0,1].grid(alpha=0.3)

axes[1,0].plot(rew_losses, linewidth=2, color=’orange’)
axes[1,0].set_xlabel(‘Update Step’)
axes[1,0].set_ylabel(‘Loss’)
axes[1,0].set_title(‘Reward Model Loss’)
axes[1,0].grid(alpha=0.3)

axes[1,1].plot(pol_losses, linewidth=2, color=’red’)
axes[1,1].set_xlabel(‘Update Step’)
axes[1,1].set_ylabel(‘Loss’)
axes[1,1].set_title(‘Policy Loss’)
axes[1,1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(“OPRL Training Complete!”)
print(“Process rewards, preference learning, reward shaping, and online updates demonstrated.”)

We visualize the learning dynamics by plotting returns, success rates, reward-model loss, and policy loss. We monitor how the agent’s performance evolves as OPRL shapes the reward landscape. By the end of the visualization, we clearly see the impact of process rewards on solving a challenging, sparse-reward maze.

In conclusion, we see how OPRL transforms sparse terminal outcomes into rich online feedback that continuously guides the agent’s behaviour. We watch the process reward model learn preferences, shape the return signal, and accelerate the policy’s ability to reach the goal. With larger mazes, varying shaping strengths, or even real human preference feedback, we appreciate how OPRL provides a flexible and powerful framework for credit assignment in complex decision-making tasks. We finish with a clear, hands-on understanding of how OPRL operates and how we can extend it to more advanced agentic RL settings.

Check out the FULL CODE NOTEBOOK and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How We Learn Step-Level Rewards from Preferences to Solve Sparse-Reward Environments Using Online Process Reward Learning appeared first on MarkTechPost.

Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem F …

Large language model agents are starting to store everything they see, but can they actually improve their policies at test time from those experiences rather than just replaying context windows?

Researchers from University of Illinois Urbana Champaign and Google DeepMind propose Evo-Memory, a streaming benchmark and agent framework that targets this exact gap. Evo-Memory evaluates test-time learning with self-evolving memory, asking whether agents can accumulate and reuse strategies from continuous task streams instead of relying only on static conversational logs.

https://arxiv.org/pdf/2511.20857

Conversational Recall vs Experience Reuse

Most current agents implement conversational recall. They store dialogue history, tool traces, and retrieved documents, which are then reintegrated into the context window for future queries. This type of memory serves as a passive buffer, capable of recovering facts or recalling previous steps, but it does not actively modify the agent’s approach for related tasks.

Evo-Memory instead focuses on experience reuse. Here each interaction is treated as an experience that encodes not only inputs and outputs, but also whether a task succeeded and which strategies were effective. The benchmark checks if agents can retrieve those experiences in later tasks, apply them as reusable procedures, and refine the memory over time.

Benchmark Design and Task Streams

The research team formalizes a memory augmented agent as a tuple ((F, U, R, C)). The base model (F) generates outputs. The retrieval module (R) searches a memory store. The context constructor (C) synthesizes a working prompt from the current input and retrieved items. The update function (U) writes new experience entries and evolves the memory after every step.

Evo-Memory restructures conventional benchmarks into sequential task streams. Each dataset becomes an ordered sequence of tasks where early items carry strategies that are useful for later ones. The suite covers AIME 24, AIME 25, GPQA Diamond, MMLU-Pro economics, engineering, philosophy, and ToolBench for tool use, along with multi turn environments from AgentBoard including AlfWorld, BabyAI, ScienceWorld, Jericho, and PDDL planning.

Evaluation is done along four axes. Single turn tasks use exact match or answer accuracy. Embodied environments report success rate and progress rate. Step efficiency measures average steps per successful task. Sequence robustness tests whether performance is stable when task order changes.

https://arxiv.org/pdf/2511.20857

ExpRAG, a Minimal Experience Reuse Baseline

To set a lower bound, the research team define ExpRAG. Each interaction becomes a structured experience text with template ⟨xi​,yi​^​,fi​⟩where xi​ is input, yi​^​ is model output and fi​ is feedback, for example a correctness signal. At a new step (t), the agent retrieves similar experiences from memory using a similarity score and concatenates them with the current input as in-context examples. Then it appends the new experience into memory.

ExpRAG does not change the agent control loop. It is still a single shot call to the backbone, but now augmented with explicitly stored prior tasks. The design is intentionally simple so that any gains on Evo-Memory can be attributed to task level experience retrieval, not to new planning or tool abstractions.

ReMem, Action Think Memory Refine

The main contribution on the agent side is ReMem, an action–think–memory refine pipeline built on top of the same backbone models. At each internal step, given the current input, memory state and past reasoning traces, the agent chooses one of three operations:

Think generates intermediate reasoning traces that decompose the task.

Act emits an environment action or final answer visible to the user.

Refine performs meta reasoning on memory by retrieving, pruning and reorganizing experience entries.

This loop induces a Markov decision process where the state includes the query, current memory and ongoing thoughts. Within a step the agent can interleave several Think and Refine operations, and the step terminates when an Act operation is issued. In contrast to standard ReAct style agents, memory is no longer a fixed buffer. It becomes an explicit object that the agent reasons about and edits during inference.

https://arxiv.org/pdf/2511.20857

Results on Reasoning, Tools and Embodied Environments

The research team instantiate all methods on Gemini 2.5 Flash and Claude 3.7 Sonnet under a unified search–predict–evolve protocol. This isolates the effect of memory architecture, since prompting, search and feedback are held constant across baselines.

On single turn benchmarks, evolving memory methods produce consistent but moderate gains. For Gemini 2.5 Flash, ReMem reaches average exact match 0.65 across AIME 24, AIME 25, GPQA Diamond and MMLU Pro subsets, and 0.85 and 0.71 API and accuracy on ToolBench. ExpRAG also performs strongly, with average 0.60, and outperforms several more complex designs such as Agent Workflow Memory and Dynamic Cheatsheet variants.

The impact is larger in multi turn environments. On Claude 3.7 Sonnet, ReMem reaches success and progress 0.92 and 0.96 on AlfWorld, 0.73 and 0.83 on BabyAI, 0.83 and 0.95 on PDDL and 0.62 and 0.89 on ScienceWorld, giving average 0.78 success and 0.91 progress across datasets. On Gemini 2.5 Flash, ReMem achieves average 0.50 success and 0.64 progress, improving over history and ReAct style baselines in all four environments.

Step efficiency is also improved. In AlfWorld, average steps to complete a task drop from 22.6 for a history baseline to 11.5 for ReMem. Lightweight designs such as ExpRecent and ExpRAG reduce steps as well, which indicates that even simple task level experience reuse can make behaviour more efficient without architectural changes to the backbone.

A further analysis links gains to task similarity inside each dataset. Using embeddings from the retriever encoder, the research team compute average distance from tasks to their cluster center. ReMem’s margin over a history baseline correlates strongly with this similarity measure, with reported Pearson correlation about 0.72 on Gemini 2.5 Flash and 0.56 on Claude 3.7 Sonnet. Structured domains such as PDDL and AlfWorld show larger improvements than diverse sets like AIME 25 or GPQA Diamond.

Key Takeaways

Evo-Memory is a comprehensive streaming benchmark that converts standard datasets into ordered task, so agents can retrieve, integrate and update memory over time rather than rely on static conversational recall.

The framework formalizes memory augmented agents as a tuple ((F, U, R, C)) and implements more than 10 representative memory modules, including retrieval based, workflow and hierarchical memories, evaluated on 10 single turn and multi turn datasets across reasoning, question answering, tool use and embodied environments.

ExpRAG provides a minimal experience reuse baseline that stores each task interaction as a structured text record with input, model output and feedback, then retrieves similar experiences as in context exemplars for new tasks, already giving consistent improvements over pure history based baselines.

ReMem extends the standard ReAct style loop with an explicit Think, Act, Refine Memory control cycle, which lets the agent actively retrieve, prune and reorganize its memory during inference, leading to higher accuracy, higher success rate and fewer steps on both single turn reasoning and long horizon interactive environments.

Across Gemini 2.5 Flash and Claude 3.7 Sonnet backbones, self evolving memories such as ExpRAG and especially ReMem make smaller models behave like stronger agents at test time, improving exact match, success and progress metrics without any retraining of base model weights.

Editorial Notes

Evo Memory is a useful step for evaluating self evolving memory in LLM agents. It forces models to operate on sequential task streams instead of isolated prompts. It compares more than 10 memory architectures under a single framework. Simple methods like ExpRAG already show clear gains. ReMem’s action, think, refine memory loop improves exact match, success and progress without retraining base weights. Overall, this research work makes test time evolution a concrete design target for LLM agent systems

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem Framework for Experience Reuse in LLM Agents appeared first on MarkTechPost.

DeepSeek Researchers Introduce DeepSeek-V3.2 and DeepSeek-V3.2-Special …

How do you get GPT-5-level reasoning on real long-context, tool-using workloads without paying the quadratic attention and GPU cost that usually makes those systems impractical? DeepSeek research introduces DeepSeek-V3.2 and DeepSeek-V3.2-Speciale. They are reasoning-first models built for agents and targets high quality reasoning, long context and agent workflows, with open weights and production APIs. The models combine DeepSeek Sparse Attention (DSA), a scaled GRPO reinforcement learning stack and an agent native tool protocol, and report performance comparable to GPT 5, with DeepSeek-V3.2-Speciale reaching Gemini 3.0 Pro level reasoning on public benchmarks and competitions.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Sparse Attention with Near Linear Long Context Cost

Both DeepSeek-V3.2 and DeepSeek-V3.2-Speciale use the DeepSeek-V3 Mixture of Experts transformer with about 671B total parameters and 37B active parameters per token, inherited from V3.1 Terminus. The only structural change is DeepSeek Sparse Attention, introduced through continued pre-training.

DeepSeek Sparse Attention splits attention into 2 components. A lightning indexer runs a small number of low precision heads over all token pairs and produces relevance scores. A fine grained selector keeps the top-k-key value positions per query, and the main attention path runs Multi-Query-Attention and Multi-Head-Latent-Attention on this sparse set.

This changes the dominant complexity from O(L²) to O(kL), where L is sequence length and k is the number of selected tokens and much smaller than L. Based on the benchmarks, DeepSeek-V3.2 matches the dense Terminus baseline on accuracy while reducing long context inference cost by about 50 percent, with faster throughput and lower memory use on H800 class hardware and on vLLM and SGLang backends.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Continued Pre Training for DeepSeek Sparse Attention

DeepSeek Sparse Attention (DSA) is introduced by continued pre-training on top of DeepSeek-V3.2 Terminus. In the dense warm up stage, dense attention remains active, all backbone parameters are frozen and only the lightning indexer is trained with a Kullback Leibler loss to match the dense attention distribution on 128K context sequences. This stage uses a small number of steps and about 2B tokens, enough for the indexer to learn useful scores.

In the sparse stage, the selector keeps 2048 key-value entries per query, the backbone is unfrozen and the model continues training on about 944B tokens. Gradients for the indexer still come only from the alignment loss with dense attention on the selected positions. This schedule makes DeepSeek Sparse Attention (DSA) behave as a drop in replacement for dense attention with similar quality and lower long context cost.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

GRPO with more than 10 Percent RL Compute

On top of the sparse architecture, DeepSeek-V3.2 uses Group Relative Policy Optimization (GRPO) as the main reinforcement learning method. The research team state that post training reinforcement learning RL compute exceeds 10 percent of pre training compute.

RL is organized around specialist domains. The research team trains dedicated runs for mathematics, competitive programming, general logical reasoning, browsing and agent tasks and safety, then distills these specialists into the shared 685B parameter base for DeepSeek-V3.2 and DeepSeek-V3.2-Speciale. GRPO is implemented with an unbiased KL estimator, off policy sequence masking and mechanisms that keep Mixture of Experts (MoE) routing and sampling masks consistent between training and sampling.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Agent Data, Thinking Mode and Tool Protocol

DeepSeek research team builds a large synthetic agent dataset by generating more than 1,800 environments and more than 85,000 tasks across code agents, search agents, general tools and code interpreter setups. Tasks are constructed to be hard to solve and easy to verify, and are used as RL targets together with real coding and search traces.

At inference time, DeepSeek-V3.2 introduces explicit thinking and non thinking modes. The deepseek-reasoner endpoint exposes thinking mode by default, where the model produces an internal chain of thought before the final answer. The thinking with tools guide describes how reasoning content is kept across tool calls and cleared when a new user message arrives, and how tool calls and tool results stay in the context even when reasoning text is trimmed for budget.

The chat template is updated around this behavior. The DeepSeek-V3.2 Speciale repository ships Python encoder and decoder helpers instead of a Jinja template. Messages can carry a reasoning_content field alongside content, controlled by a thinking parameter. A developer role is reserved for search agents and is not accepted in general chat flows by the official API, which protects this channel from accidental misuse.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Benchmarks, Competitions And Open Artifacts

On standard reasoning and coding benchmarks, DeepSeek-V3.2 and especially DeepSeek-V3.2 Speciale are reported as comparable to GPT-5 and close to Gemini-3.0 Pro on suites such as AIME 2025, HMMT 2025, GPQA and LiveCodeBench, with improved cost efficiency on long context workloads.

For formal competitions, DeepSeek research team states that DeepSeek-V3.2 Speciale achieves gold medal level performance on the International Mathematical Olympiad 2025, the Chinese Mathematical Olympiad 2025 and the International Olympiad in Informatics 2025, and competitive gold medal level performance at the ICPC World Finals 2025.

Key Takeaways

DeepSeek-V3.2 adds DeepSeek Sparse Attention, which brings near linear O(kL) attention cost and delivers around 50% lower long context API cost compared to previous dense DeepSeek models, while keeping quality similar to DeepSeek-V3.1 Terminus.

The model family keeps the 671B parameter MoE backbone with 37B active parameters per token and exposes a full 128K context window in production APIs, which makes long documents, multi step chains and large tool traces practical rather than a lab only feature.

Post training uses Group Relative Policy Optimization (GRPO) with a compute budget that is more than 10 percent of pre-training, focused on math, code, general reasoning, browsing or agent workloads and safety, along with contest style specialists whose cases are released for external verification.

DeepSeek-V3.2 is the first model in the DeepSeek family to integrate thinking directly into tool use, supporting both thinking and non thinking tool modes and a protocol where internal reasoning persists across tool calls and is reset only on new user messages.

Check out the Paper and Model weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepSeek Researchers Introduce DeepSeek-V3.2 and DeepSeek-V3.2-Speciale for Long Context Reasoning and Agentic Workloads appeared first on MarkTechPost.

MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic …

The AI coding landscape just got a massive shake-up. If you’ve been relying on Claude 3.5 Sonnet or GPT-4o for your dev workflows, you know the pain: great performance often comes with a bill that makes your wallet weep, or latency that breaks your flow.This article provides a technical overview of MiniMax-M2, focusing on its core design choices and capabilities, and how it changes the price to performance baseline for agentic coding workflows.

Branded as ‘Mini Price, Max Performance,’ MiniMax-M2 targets agentic coding workloads with around 2x the speed of leading competitors at roughly 8% of their price. The key change is not only cost efficiency, but a different computational and reasoning pattern in how the model structures and executes its “thinking” during complex tool and code workflows.

The Secret Sauce: Interleaved Thinking

The standout feature of MiniMax-M2 is its native mastery of Interleaved Thinking. 

But what does that actually mean?

Most LLMs operate in a linear “Chain of Thought” (CoT) where they do all their planning upfront and then fire off a series of tool calls (like running code or searching the web). The problem? If the first tool call returns unexpected data, the initial plan becomes stale, leading to “state drift” where the model keeps hallucinating a path that no longer exists.

Interleaved Thinking changes the game by creating a dynamic Plan -> Act-> Reflect loop.

Instead of front-loading all the logic, MiniMax-M2 alternates between explicit reasoning and tool use. It reasons, executes a tool, reads the output, and then reasons again based on that fresh evidence. This allows the model to:

Self-Correct: If a shell command fails, it reads the error and adjusts its next move immediately.

Preserve State: It carries forward hypotheses and constraints between steps, preventing the “memory loss” common in long coding tasks.

Handle Long Horizons: This approach is critical for complex agentic workflows (like building an entire app feature) where the path isn’t clear from step one.

Benchmarks show the impact is real: enabling Interleaved Thinking boosted MiniMax-M2’s score on SWE-Bench Verified by over 3% and on BrowseComp by a massive 40%.

Powered by Mixture of Experts MoE: Speed Meets Smarts

How does MiniMax-M2 achieve low latency while being smart enough to replace a senior dev? The answer lies in its Mixture of Experts (MoE) architecture.

MiniMax-M2 is a massive model with 230 billion total parameters, but it utilizes a “sparse” activation technique. For any given token generation, it only activates 10 billion parameters.

This design delivers the best of both worlds:

Huge Knowledge Base: You get the deep world knowledge and reasoning capacity of a 200B+ model.

Blazing Speed: Inference runs with the lightness of a 10B model, enabling high throughput and low latency.

For interactive agents like Claude Code, Cursor, or Cline, this speed is non-negotiable. You need the model to think, code, and debug in real-time without the “thinking…” spinner of death.

Agent & Code Native

MiniMax-M2 wasn’t just trained on text; it was developed for end-to-end developer workflows. It excels at handling robust toolchains including MCP (Model Context Protocol), shell execution, browser retrieval, and complex codebases.

It is already being integrated into the heavy hitters of the AI coding world:

Claude Code

Cursor

Cline

Kilo Code

Droid

The Economics: 90% Cheaper than the Competition

The pricing structure is perhaps the most aggressive we’ve seen for a model of this caliber. MiniMax is practically giving away “intelligence” compared to the current market leaders.

API Pricing (vs Claude 3.5 Sonnet):

Input Tokens: $0.3 / Million (10% of Sonnet’s cost)

Cache Hits: $0.03 / Million (10% of Sonnet’s cost)

Output Tokens: $1.2 / Million (8% of Sonnet’s cost)

For individual developers, they offer tiered Coding Plans that undercut the market significantly:

Starter: $10/month (Includes a $2 first-month promo).

Pro: $20/month.

Max: $50/month (Up to 5x the usage limit of Claude Code Max).

As if that was not enough…MiniMax recently launched a Global Developer Ambassador Program, a global initiative designed to empower independent ML and LLM developers. The program invites builders to collaborate directly with the MiniMax R&D team to shape the future.

The company is seeking developers with proven open-source experience who are already familiar with MiniMax models and active on platforms like GitHub and Hugging Face.

Key Program Highlights:

The Incentives: Ambassadors receive complimentary access to the MiniMax-M2 Max Coding Plan, early access to unreleased video and audio models, direct feedback channels with product leads, and potential full-time career opportunities.

The Role: Participants are expected to build public demos, create open-source tools, and provide critical feedback on APIs before public launches.

You can sign up here.

Editorial Notes

MiniMax-M2 challenges the idea that “smarter” must mean “slower” or “more expensive.” By leveraging MOE efficiency and Interleaved Thinking, it offers a compelling alternative for developers who want to run autonomous agents without bankrupting their API budget.

As we move toward a world where AI agents don’t just write code but architect entire systems, the ability to “think, act, and reflect” continuously, at a price that allows for thousands of iterations, might just make M2 the new standard for AI engineering.

Thanks to the MINIMAX AI team for the thought leadership/ Resources for this article. MINIMAX AI team has supported this content/article.
The post MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows appeared first on MarkTechPost.

How to Design an Advanced Multi-Page Interactive Analytics Dashboard w …

In this tutorial, we build an advanced multi-page interactive dashboard using Panel. Through each component of implementation, we explore how to generate synthetic data, apply rich filters, visualize dynamic time-series trends, compare segments and regions, and even simulate live KPI updates. We design the system step by step so we can truly understand how each widget, callback, and plotting function comes together to create a smooth, reactive analytics experience. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport sys, subprocess

def install_deps():
pkgs = [“panel”, “hvplot”, “pandas”, “numpy”, “bokeh”]
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”] + pkgs)

try:
import panel as pn
import hvplot.pandas
import pandas as pd
import numpy as np
except ImportError:
install_deps()
import panel as pn
import hvplot.pandas
import pandas as pd
import numpy as np

pn.extension()

rng = np.random.default_rng(42)
dates = pd.date_range(“2024-01-01″, periods=365, freq=”D”)
segments = [“A”, “B”, “C”]
regions = [“North”, “South”, “East”, “West”]

base = pd.DataFrame(
{
“date”: np.tile(dates, len(segments) * len(regions)),
“segment”: np.repeat(segments, len(dates) * len(regions)),
“region”: np.repeat(np.tile(regions, len(segments)), len(dates)),
}
)
base[“traffic”] = (
100
+ 40 * np.sin(2 * np.pi * base[“date”].dt.dayofyear / 365)
+ rng.normal(0, 15, len(base))
)
trend = {“A”: 1.0, “B”: 1.5, “C”: 2.0}
base[“traffic”] *= base[“segment”].map(trend)
base[“conversions”] = (base[“traffic”] * rng.uniform(0.01, 0.05, len(base))).astype(int)
base[“revenue”] = base[“conversions”] * rng.uniform(20, 60, len(base))
df = base.reset_index(drop=True)

We install all required dependencies and load Panel, hvPlot, Pandas, and NumPy so the dashboard runs smoothly in Colab. We generate a full year of synthetic time-series data across segments and regions, providing a rich dataset for exploration. By the end of this block, we will have a clean, ready-to-use dataframe for all upcoming visualizations. Check out the Full Codes here.

Copy CodeCopiedUse a different Browsersegment_sel = pn.widgets.CheckBoxGroup(name=”Segment”, value=segments[:2], options=segments, inline=True)
region_sel = pn.widgets.MultiChoice(name=”Region”, value=[“North”], options=regions)
metric_sel = pn.widgets.Select(name=”Metric”, value=”traffic”, options=[“traffic”, “conversions”, “revenue”])
date_range = pn.widgets.DateRangeSlider(
name=”Date Range”,
start=df[“date”].min(),
end=df[“date”].max(),
value=(df[“date”].min(), df[“date”].max()),
)
smooth_slider = pn.widgets.IntSlider(name=”Rolling Window (days)”, start=1, end=30, value=7)

def filtered_df(segment, region, drange):
d1, d2 = drange
mask = (
df[“segment”].isin(segment)
& df[“region”].isin(region or regions)
& (df[“date”] >= d1)
& (df[“date”] <= d2)
)
sub = df[mask].copy()
if sub.empty:
return df.iloc[:0]
return sub

@pn.depends(segment_sel, region_sel, metric_sel, smooth_slider, date_range)
def timeseries_plot(segment, region, metric, window, drange):
data = filtered_df(segment, region, drange)
if data.empty:
return pn.pane.Markdown(“### No data for current filters”)
grouped = data.sort_values(“date”).groupby(“date”)[metric].sum()
line = grouped.hvplot.line(title=f”{metric.title()} over time”, ylabel=metric.title())
if window > 1:
smooth = grouped.rolling(window).mean().hvplot.line(line_width=3, alpha=0.6)
return (line * smooth).opts(legend_position=”top_left”)
return line

We build the interactive widgets and the filtering logic that controls the entire dashboard. We wire the time-series plot to the widgets using reactive @pn.depends, letting us change segments, regions, metrics, date ranges, and smoothing windows instantly. With this setup, we can switch perspectives fluidly and see the effects in real time. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser@pn.depends(segment_sel, region_sel, metric_sel, date_range)
def segment_bar(segment, region, metric, drange):
data = filtered_df(segment, region, drange)
if data.empty:
return pn.pane.Markdown(“### No data to aggregate”)
agg = data.groupby(“segment”)[metric].sum().sort_values(ascending=False)
return agg.hvplot.bar(title=f”{metric.title()} by Segment”, yaxis=None)

@pn.depends(segment_sel, region_sel, metric_sel, date_range)
def region_heatmap(segment, region, metric, drange):
data = filtered_df(segment, region, drange)
if data.empty:
return pn.pane.Markdown(“### No data to aggregate”)
pivot = data.pivot_table(index=”segment”, columns=”region”, values=metric, aggfunc=”sum”)
return pivot.hvplot.heatmap(title=f”{metric.title()} Heatmap”, clabel=metric.title())

We construct additional visual layers: a segment-level bar chart and a region-segment heatmap. We let these charts react to the same global filters, so they update automatically whenever we make a selection. This gives us a deeper breakdown of patterns across categories without writing redundant code. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserkpi_source = df.copy()
kpi_idx = [0]

def compute_kpi(slice_df):
if slice_df.empty:
return 0, 0, 0
total_rev = slice_df[“revenue”].sum()
avg_conv = slice_df[“conversions”].mean()
cr = (slice_df[“conversions”].sum() / slice_df[“traffic”].sum()) * 100
return total_rev, avg_conv, cr

kpi_value = pn.indicators.Number(name=”Total Revenue (window)”, value=0, format=”$0,0″)
conv_value = pn.indicators.Number(name=”Avg Conversions”, value=0, format=”0.0″)
cr_value = pn.indicators.Number(name=”Conversion Rate”, value=0, format=”0.00%”)

def update_kpis():
step = 200
start = kpi_idx[0]
end = start + step
if start >= len(kpi_source):
kpi_idx[0] = 0
start, end = 0, step
window_df = kpi_source.iloc[start:end]
kpi_idx[0] = end
total_rev, avg_conv, cr = compute_kpi(window_df)
kpi_value.value = total_rev
conv_value.value = avg_conv
cr_value.value = cr / 100

pn.state.add_periodic_callback(update_kpis, period=1000, start=True)

We simulate a rolling stream of KPIs that update every second, creating a live-dashboard experience. We compute total revenue, average conversions, and conversion rate inside a sliding window and push the values to Panel’s numeric indicators. This lets us observe how metrics evolve continuously, just like a real monitoring system. Check out the Full Codes here.

Copy CodeCopiedUse a different Browsercontrols = pn.WidgetBox(
“### Global Controls”,
segment_sel,
region_sel,
metric_sel,
date_range,
smooth_slider,
sizing_mode=”stretch_width”,
)

page_overview = pn.Column(
pn.pane.Markdown(“## Overview: Filtered Time Series”),
controls,
timeseries_plot,
)

page_insights = pn.Column(
pn.pane.Markdown(“## Segment & Region Insights”),
pn.Row(segment_bar, region_heatmap),
)

page_live = pn.Column(
pn.pane.Markdown(“## Live KPI Window (simulated streaming)”),
pn.Row(kpi_value, conv_value, cr_value),
)

dashboard = pn.Tabs(
(“Overview”, page_overview),
(“Insights”, page_insights),
(“Live KPIs”, page_live),
)

dashboard

We assemble all components into a clean multi-page layout using Tabs. We organize the dashboard into an overview page, an insights page, and a live-KPI page, making navigation simple and intuitive. With this structure, we get a polished, interactive analytics application ready to run directly in Google Colab.

In conclusion, we see how seamlessly we can combine Panel widgets, hvPlot visualizations, and periodic callbacks to build a powerful analytics dashboard. We appreciate how every module, from filtering logic to bar charts to the live KPI stream, fits together to produce a cohesive multi-page interface that runs effortlessly. We finish with a complete, interactive system that we can extend into real-world reporting, experimentation, or production-grade dashboards.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design an Advanced Multi-Page Interactive Analytics Dashboard with Dynamic Filtering, Live KPIs, and Rich Visual Exploration Using Panel appeared first on MarkTechPost.

Meta AI Researchers Introduce Matrix: A Ray Native a Decentralized Fra …

How do you keep synthetic data fresh and diverse for modern AI models without turning a single orchestration pipeline into the bottleneck? Meta AI researchers introduce Matrix, a decentralized framework where both control and data flow are serialized into messages that move through distributed queues. As LLM training increasingly relies on synthetic conversations, tool traces and reasoning chains, most existing systems still depend on a central controller or domain specific setups, which wastes GPU capacity, adds coordination overhead and limits data diversity. Matrix instead uses peer to peer agent scheduling on a Ray cluster and delivers 2 to 15 times higher token throughput on real workloads while maintaining comparable quality.

https://arxiv.org/pdf/2511.21686

From Centralized Controllers to Peer to Peer Agents

Traditional agent frameworks keep workflow state and control logic inside a central orchestrator. Every agent call, tool call and retry goes through that controller. This model is easy to reason about, but it does not scale well when you need tens of thousands of concurrent synthetic dialogues or tool trajectories.

Matrix takes a different approach. It serializes both control flow and data flow into a message object called an orchestrator. The orchestrator holds the task state, including conversation history, intermediate results and routing logic. Stateless agents, implemented as Ray actors, pull an orchestrator from a distributed queue, apply their role specific logic, update the state and then send it directly to the next agent selected by the orchestrator. There is no central scheduler in the inner loop. Each task advances independently at row level, rather than waiting for batch level barriers as in Spark or Ray Data.

This design reduces idle time when different trajectories have very different lengths. It also makes fault handling local to a task. If one orchestrator fails it does not stall a batch.

https://arxiv.org/pdf/2511.21686

System Stack and Services

Matrix runs on a Ray cluster that is usually launched on SLURM. Ray provides distributed actors and queues. Ray Serve exposes LLM endpoints behind vLLM and SGLang, and can also route to external APIs such as Azure OpenAI or Gemini through proxy servers.

Tool calls and other complex services run inside Apptainer containers. This isolates the agent runtime from code execution sandboxes, HTTP tools or custom evaluators. Hydra manages configuration for agent roles, orchestrator types, resource allocations and I or O schemas. Grafana integrates with Ray metrics to track queue length, pending tasks, token throughput and GPU utilization in real time.

Matrix also introduces message offloading. When conversation history grows beyond a size threshold, large payloads are stored in Ray’s object store and only object identifiers are kept in the orchestrator. This reduces cluster bandwidth while still allowing agents to reconstruct prompts when needed.

Case Study 1: Collaborative Reasoner

Collaborative Reasoner, also known as Coral, evaluates multi agent dialogue where two LLM agents discuss a question, disagree when needed and reach a final answer. In the original implementation a central controller manages thousands of self collaboration trajectories. Matrix reimplements the same protocol using peer to peer orchestrators and stateless agents.

On 31 A100 nodes, using LLaMA 3.1 8B Instruct, Matrix configures concurrency as 248 GPUs with 50 queries per GPU, so 12,400 concurrent conversations. The Coral baseline runs at its optimal concurrency of 5,000. Under identical hardware, Matrix generates about 2 billion tokens in roughly 4 hours, while Coral produces about 0.62 billion tokens in about 9 hours. That is a 6.8 times increase in token throughput with almost identical agreement correctness around 0.47.

https://arxiv.org/pdf/2511.21686

Case Study 2: NaturalReasoning Web Data Curation

NaturalReasoning constructs a reasoning dataset from large web corpora. Matrix models the pipeline with three agents. A Filter agent uses a smaller classifier model to select English passages that likely contain reasoning. A Score agent uses a larger instruction tuned model to assign quality scores. A Question agent extracts questions, answers and reasoning chains.

On 25 million DCLM web documents, only about 5.45 percent survive all filters, yielding around 1.19 million question answer pairs with associated reasoning steps. Matrix then compares different parallelism strategies on a 500 thousand document subset. The best configuration combines data parallelism and task parallelism, with 20 data partitions and 700 concurrent tasks per partition. This achieves about 1.61 times higher throughput than a setting that only scales task concurrency.

Over the full 25 million document run, Matrix reaches 5,853 tokens per second, compared to 2,778 tokens per second for a Ray Data batch baseline with 14,000 concurrent tasks. That corresponds to a 2.1 times throughput gain that comes purely from peer to peer row level scheduling, not from different models.

https://arxiv.org/pdf/2511.21686

Case Study 3, Tau2-Bench Tool Use Trajectories

Tau2-Bench evaluates conversational agents that must use tools and a database in a customer support setting. Matrix represents this environment with four agents, a user simulator, an assistant, a tool executor and a reward calculator, plus a sink that collects metrics. Tool APIs and reward logic are reused from the Tau2 reference implementation and are wrapped in containers.

On a cluster with 13 H100 nodes and dozens of LLM replicas, Matrix generates 22,800 trajectories in about 1.25 hours. That corresponds to roughly 41,000 tokens per second. The baseline Tau2-agent implementation on a single node, configured with 500 concurrent threads, reaches about 2,654 tokens per second and 1,519 trajectories. Average reward stays almost unchanged across both systems, which confirms that the speedup does not come from cutting corners in the environment. Overall, Matrix delivers about 15.4 times higher token throughput on this benchmark.

https://arxiv.org/pdf/2511.21686

Key Takeaways

Matrix replaces centralized orchestrators with a peer to peer, message driven agent architecture that treats each task as an independent state machine moving through stateless agents.

The framework is built entirely on an open source stack, SLURM, Ray, vLLM, SGLang and Apptainer, and scales to tens of thousands of concurrent multi agent workflows for synthetic data generation, benchmarking and data processing.

Across three case studies, Collaborative Reasoner, NaturalReasoning and Tau2-Bench, Matrix delivers about 2 to 15.4 times higher token throughput than specialized baselines under identical hardware, while maintaining comparable output quality and rewards.

Matrix offloads large conversation histories to Ray’s object store and keeps only lightweight references in messages, which reduces peak network bandwidth and supports high throughput LLM serving with gRPC based model backends.

Editorial Notes

Matrix is a pragmatic systems contribution that takes multi agent synthetic data generation from bespoke scripts to an operational runtime. By encoding control flow and data flow into orchestrators, then pushing execution into stateless P2P agents on Ray, it cleanly separates scheduling, LLM inference and tools. The case studies on Collaborative Reasoner, NaturalReasoning and Tau2-Bench show that careful systems design, not new model architectures, is now the main lever for scaling synthetic data pipelines.

Check out the Paper and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Researchers Introduce Matrix: A Ray Native a Decentralized Framework for Multi Agent Synthetic Data Generation appeared first on MarkTechPost.

StepFun AI Releases Step-Audio-R1: A New Audio LLM that Finally Benefi …

Why do current audio AI models often perform worse when they generate longer reasoning instead of grounding their decisions in the actual sound. StepFun research team releases Step-Audio-R1, a new audio LLM designed for test time compute scaling, address this failure mode by showing that the accuracy drop with chain of thought is not an audio limitation but a training and modality grounding problem?

https://arxiv.org/pdf/2511.15848

The Core Problem, Audio Models Reason over Text Surrogates

Most current audio models inherit their reasoning behavior from text training. They learn to reason as if they read transcripts, not as if they listen. The StepFun team calls this Textual Surrogate Reasoning. The model uses imagined words and descriptions instead of acoustic cues such as pitch contour, rhythm, timbre or background noise patterns.

This mismatch explains why longer chain of thought often hurts performance in audio. The model spends more tokens elaborating wrong or modality irrelevant assumptions. Step-Audio-R1 attacks this by forcing the model to justify answers using acoustic evidence. The training pipeline is organized around Modality Grounded Reasoning Distillation, MGRD, which selects and distills reasoning traces that explicitly reference audio features.

Architecture

The architecture stays close to the previous Step Audio systems:

A Qwen2 based audio encoder processes raw waveforms at 25 Hz.

An audio adaptor downsamples the encoder output by a factor of 2, to 12.5 Hz, and aligns frames to the language token stream.

A Qwen2.5 32B decoder consumes the audio features and generates text.

The decoder always produces an explicit reasoning block inside <think> and </think> tags, followed by the final answer. This separation lets training objectives shape the structure and content of reasoning without losing focus on task accuracy. The model is released as a 33B parameter audio text to text model on Hugging Face under Apache 2.0.

https://arxiv.org/pdf/2511.15848

Training Pipeline, from Cold Start to Audio Grounded RL

The pipeline has a supervised cold start stage and a reinforcement learning stage that both mix text and audio tasks.

Cold start uses about 5 million examples, covering 1 billion tokens of text only data and 4 billion tokens from audio paired data. Audio tasks include automatic speech recognition, paralinguistic understanding and audio question text answer style dialogs. A fraction of the audio data carries audio chain of thought traces generated by an earlier model. Text data covers multi turn dialog, knowledge question answering, math and code reasoning. All samples share a format where reasoning is wrapped in <think> tags, even when the reasoning block is initially empty.

Supervised learning trains Step-Audio-R1 to follow this format and to generate useful reasoning for both audio and text. This gives a baseline chain of thought behavior, but it is still biased toward text based reasoning.

Modality Grounded Reasoning Distillation MGRD

MGRD is applied in several iterations. For each round, the research team samples audio questions where the label depends on real acoustic properties. For example, questions about speaker emotion, background events in sound scenes or musical structure. The current model produces multiple reasoning and answer candidates per question. A filter keeps only chains that meet three constraints:

They reference acoustic cues, not just textual descriptions or imagined transcripts.

They are logically coherent as short step by step explanations.

Their final answers are correct according to labels or programmatic checks.

These accepted traces form a distilled audio chain of thought dataset. The model is fine tuned on this dataset together with the original text reasoning data. This is followed by Reinforcement Learning with Verified Rewards, RLVR. For text questions, rewards are based on answer correctness. For audio questions, the reward mixes answer correctness and reasoning format, with a typical weighting of 0.8 for accuracy and 0.2 for reasoning. Training uses PPO with about 16 responses sampled per prompt and supports sequences up to around 10 240 tokens to allow long deliberation.

https://arxiv.org/pdf/2511.15848

Benchmarks, closing the gap to Gemini 3 Pro

On a combined speech to text benchmark suite that includes Big Bench Audio, Spoken MQA, MMSU, MMAU and Wild Speech, Step-Audio-R1 reaches an average score of about 83.6 percent. Gemini 2.5 Pro reports about 81.5 percent and Gemini 3 Pro reaches about 85.1 percent. On Big Bench Audio alone, Step-Audio-R1 reaches about 98.7 percent, which is higher than both Gemini versions.

For speech to speech reasoning, the Step-Audio-R1 Realtime variant adopts listen while thinking and think while speaking style streaming. On Big Bench Audio speech to speech, it reaches about 96.1 percent reasoning accuracy with first packet latency around 0.92 seconds. This score surpasses GPT based realtime baselines and Gemini 2.5 Flash style native audio dialogs while keeping sub second interaction.

https://arxiv.org/pdf/2511.15848

Ablations, what matters for audio reasoning

The ablation section provides several design signals for engineers:

A reasoning format reward is necessary. Without it, reinforcement learning tends to shorten or remove chain of thought, which lowers audio benchmark scores.

RL data should target medium difficulty problems. Selecting questions where pass at 8 lies in a middle band gives more stable rewards and maintains long reasoning.

Scaling RL audio data without such selection does not help. Quality of prompts and labels matters more than raw size.

The researchers also describe a self cognition correction pipeline that reduces the frequency of answers such as ‘I can only read text and cannot hear audio’ in a model that is trained to process sound. This uses Direct Preference Optimization on curated preference pairs where correct behavior is to acknowledge and use audio input.

Key Takeaways

Step-Audio-R1 is one of the first audio language model that turns longer chain of thought into a consistent accuracy gain for audio tasks, solving the inverted scaling failure seen in previous audio LLMs.

The model explicitly targets Textual Surrogate Reasoning by using Modality Grounded Reasoning Distillation, which filters and distills only those reasoning traces that rely on acoustic cues such as pitch, timbre and rhythm instead of imagined transcripts.

Architecturally, Step-Audio-R1 combines a Qwen2 based audio encoder with an adaptor and a Qwen2.5 32B decoder that always generates <think> reasoning segments before answers, and is released as a 33B audio text to text model under Apache 2.0.

Across comprehensive audio understanding and reasoning benchmarks covering speech, environmental sounds and music, Step-Audio-R1 surpasses Gemini 2.5 Pro and reaches performance comparable to Gemini 3 Pro, while also supporting a realtime variant for low latency speech to speech interaction.

The training recipe combines large scale supervised chain of thought, modality grounded distillation and Reinforcement Learning with Verified Rewards, providing a concrete and reproducible blueprint for building future audio reasoning models that actually benefit from test time compute scaling.

Editorial Notes

Step-Audio-R1 is an important release because it converts chain of thought from a liability into a useful tool for audio reasoning by directly addressing Textual Surrogate Reasoning with Modality Grounded Reasoning Distillation and Reinforcement Learning with Verified Rewards. It shows that test time compute scaling can benefit audio models when reasoning is anchored in acoustic features and delivers benchmark results comparable to Gemini 3 Pro while remaining open and practically usable for engineers. Overall this research work turns extended deliberation in audio LLMs from a consistent failure mode into a controllable and reproducible design pattern.

Check out the Paper, Repo, Project Page and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post StepFun AI Releases Step-Audio-R1: A New Audio LLM that Finally Benefits from Test Time Compute Scaling appeared first on MarkTechPost.

NVIDIA AI Releases Orchestrator-8B: A Reinforcement Learning Trained C …

How can an AI system learn to pick the right model or tool for each step of a task instead of always relying on one large model for everything? NVIDIA researchers release ToolOrchestra, a novel method for training a small language model to act as the orchestrator- the ‘brain’ of a heterogeneous tool-use agent

https://arxiv.org/pdf/2511.21689

From Single Model Agents to an Orchestration Policy

Most current agents follow a simple pattern. A single large model such as GPT-5 receives a prompt that describes available tools, then decides when to call web search or a code interpreter. All high level reasoning still stays inside the same model. ToolOrchestra changes this setup. It trains a dedicated controller model called as ‘Orchestrator-8B‘, that treats both classic tools and other LLMs as callable components.

A pilot study in the same research shows why naive prompting is not enough. When Qwen3-8B is prompted to route between GPT-5, GPT-5 mini, Qwen3-32B and Qwen2.5-Coder-32B, it delegates 73 percent of cases to GPT-5. When GPT-5 acts as its own orchestrator, it calls GPT-5 or GPT-5 mini in 98 percent of cases. The research team call these self enhancement and other enhancement biases. The routing policy over uses strong models and ignores cost instructions.

ToolOrchestra instead trains a small orchestrator explicitly for this routing problem, using reinforcement learning over full multi turn trajectories.

What is Orchestrator 8B?

Orchestrator-8B is an 8B parameter decoder only Transformer. It is built by fine tuning Qwen3-8B as an orchestration model and released on Hugging Face.

At inference time, the system runs a multi turn loop that alternates reasoning and tool calls. The rollout has three main steps. First, Orchestrator 8B reads the user instruction and an optional natural language preference description, for example a request to prioritize low latency or to avoid web search. Second, it generates internal chain of thought style reasoning and plans an action. Third, it chooses a tool from the available set and emits a structured tool call in a unified JSON format. The environment executes that call, appends the result as an observation and feeds it back into the next step. The process stops when a termination signal is produced or a maximum of 50 turns is reached.

Tools cover three main groups. Basic tools include Tavily web search, a Python sandbox code interpreter and a local Faiss index built with Qwen3-Embedding-8B. Specialized LLMs include Qwen2.5-Math-72B, Qwen2.5-Math-7B and Qwen2.5-Coder-32B. Generalist LLM tools include GPT-5, GPT-5 mini, Llama 3.3-70B-Instruct and Qwen3-32B. All tools share the same schema with names, natural language descriptions and typed parameter specs.

End to End Reinforcement Learning with Multi Objective Rewards

ToolOrchestra formulates the whole workflow as a Markov Decision Process. The state contains the conversation history, past tool calls and observations, and user preferences. Actions are the next text step, including both reasoning tokens and a tool call schema. After up to 50 steps, the environment computes a scalar reward for the full trajectory.

The reward has three components. Outcome reward is binary and depends on whether the trajectory solves the task. For open-ended answers, GPT-5 is used as a judge to compare the model output with the reference. Efficiency rewards penalize both monetary cost and wall clock latency. Token usage for proprietary and open source tools is mapped to monetary cost using public API and Together AI pricing. Preference reward measures how well tool usage matches a user preference vector that can increase or decrease the weight on cost, latency or specific tools. These components are combined into a single scalar using the preference vector.

The policy is optimized with Group Relative Policy Optimization GRPO, a variant of policy gradient reinforcement learning that normalizes rewards within groups of trajectories for the same task. The training process includes filters that drop trajectories with invalid tool call format or weak reward variance to stabilize optimization.

https://arxiv.org/pdf/2511.21689

To make this training possible at scale, the research team plans to introduce ToolScale, a synthetic dataset of multi step tool calling tasks. For each domain, an LLM generates a database schema, database entries, domain specific APIs and then diverse user tasks with ground truth sequences of function calls and required intermediate information.

Benchmark results and cost profile

NVIDIA research team evaluates Orchestrator-8B on three challenging benchmarks, Humanity’s Last Exam, FRAMES and τ² Bench. These benchmarks target long horizon reasoning, factuality under retrieval and function calling in a dual control environment.

On Humanity’s Last Exam text only questions, Orchestrator-8B reaches 37.1 percent accuracy. GPT-5 with basic tools reaches 35.1 percent in the same setting. On FRAMES, Orchestrator-8B achieves 76.3 percent versus 74.0 percent for GPT-5 with tools. On τ² Bench, Orchestrator-8B scores 80.2 percent versus 77.7 percent for GPT-5 with basic tools.

https://arxiv.org/pdf/2511.21689

The efficiency gap is larger. In the configuration that uses basic tools plus specialized and generalist LLM tools, Orchestrator-8B has average cost 9.2 cents and latency 8.2 minutes per query, averaged over Humanity’s Last Exam and FRAMES. In the same configuration, GPT-5 costs 30.2 cents and takes 19.8 minutes on average. The model card summarizes this as about 30 percent of the monetary cost and 2.5 times faster for Orchestrator-8B compared to GPT-5.

Tool use analysis supports this picture. Claude Opus 4.1 used as an orchestrator calls GPT-5 most of the time. GPT-5 used as an orchestrator prefers GPT-5 mini. Orchestrator-8B spreads calls more evenly across strong models, cheaper models, search, local retrieval and the code interpreter, and reaches higher accuracy at lower cost for the same turn budget.

https://arxiv.org/pdf/2511.21689

Generalization experiments replace the training time tools with unseen models such as OpenMath Llama-2-70B, DeepSeek-Math-7B-Instruct, Codestral-22B-v0.1, Claude Sonnet-4.1 and Gemma-3-27B. Orchestrator-8B still achieves the best trade off between accuracy, cost and latency among all baselines in this setting. A separate preference aware test set shows that Orchestrator-8B also tracks user tool usage preferences more closely than GPT-5, Claude Opus-4.1 and Qwen3-235B-A22B under the same reward metric.

Key Takeaways

ToolOrchestra trains an 8B parameter orchestration model, Orchestrator-8B, that selects and sequences tools and LLMs to solve multi step agentic tasks using reinforcement learning with outcome, efficiency and preference aware rewards.

Orchestrator-8B is released as an open weight model on Hugging Face. It is designed to coordinate diverse tools such as web search, code execution, retrieval and specialist LLMs through a unified schema.

On Humanity’s Last Exam, Orchestrator-8B reaches 37.1 percent accuracy, surpassing GPT-5 at 35.1 percent, while being about 2.5 times more efficient, and on τ² Bench and FRAMES it outperforms GPT-5 while using roughly 30 percent of the cost.

The framework shows that naive prompting of a frontier LLM as its own router leads to self enhancement bias where it overuses itself or a small set of strong models, while a trained orchestrator learns a more balanced, cost aware routing policy over multiple tools.

Editorial Notes

NVIDIA’s ToolOrchestra is a practical step toward compound AI systems where an 8B orchestration model, Orchestrator-8B, learns an explicit routing policy over tools and LLMs instead of relying on a single frontier model. It shows clear gains on Humanity’s Last Exam, FRAMES and τ² Bench with about 30 percent of the cost and around 2.5 times better efficiency than GPT-5 based baselines, which makes it directly relevant for teams that care about accuracy, latency and budget. This launch makes orchestration policy a first class optimization target in AI systems.

Check out the Paper, Repo, Project Page and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post NVIDIA AI Releases Orchestrator-8B: A Reinforcement Learning Trained Controller for Efficient Tool and Model Selection appeared first on MarkTechPost.