i-genie, Author at i-genie.co.uk

OpenAGI Foundation Launches Lux: A Foundation Computer Use Model that …

Posted on December 7, 2025 by i-genie

How do you turn slow, manual click work across browsers and desktops into a reliable, automated system that can actually use a computer for you at scale? Lux is the latest example of computer use agents moving from research demo to infrastructure. OpenAGI Foundation team has released Lux, a foundation model that operates real desktops and browsers and reports a score of 83.6 on the Online Mind2Web benchmark, which covers more than 300 real world computer use tasks. This is ahead of Google Gemini CUA at 69.0, OpenAI Operator at 61.3 and Anthropic Claude Sonnet 4 at 61.0.

https://agiopen.org/blog

What Lux Actually Does?

Lux is a computer use model, not a chat model with a browser plugin. It takes a natural language goal, views the screen, and outputs low level actions such as clicks, key presses and scroll events. It can drive browsers, editors, spreadsheets, email clients and other desktop applications because it works on rendered UI, not on application specific APIs.

From a developer point of view, Lux is available through the OpenAGI SDK and API console. The research team describes target workloads that include software QA flows, deep research runs, social media management, online store operations and bulk data entry. In all of these settings the agent needs to sequence dozens or hundreds of UI actions while staying aligned with a natural language task description.

https://agiopen.org/blog

Three Execution Modes For Different Control Levels

Lux ships with three execution modes that expose different tradeoffs between speed, autonomy and control.

Actor mode is the fast path. It runs around 1 second per step and is aimed at clearly specified tasks such as filling a form, pulling a report from a dashboard or extracting a small set of fields from a page. Think of it as a low latency macro engine that still understands natural language.

Thinker mode handles vague or multi step goals. It decomposes the high level instruction into smaller sub tasks and then executes them. Example workloads include multi page research, triage of long email queues or navigation of analytics interfaces where the exact click path is not specified in advance.

Tasker mode gives maximum determinism. The caller supplies an explicit Python list of steps that Lux executes one by one and it retries until the sequence completes or hits a hard failure. This allows teams to keep task graphs, guardrails and failure policies in their own code while delegating UI control to the model.

Tasker, Actor and Thinker are the three primary modes for procedural workflows, fast execution and complex goal solving.

Benchmarks, Latency And Cost

On Online Mind2Web, Lux reaches a success rate of 83.6 percent. The same benchmark reports 69.0 percent for Gemini CUA, 61.3 percent for OpenAI Operator and 61.0 percent for Claude Sonnet 4. The benchmark contains more than 300 web based tasks collected from real services, so it is a useful proxy for practical agents that drive browsers and web apps.

Latency and cost are where the numbers become important for engineering teams. OpenAGI team reports that Lux completes each step in about 1 second, while OpenAI Operator is around 3 seconds per step in the same evaluation setting. The research team also states that Lux is about 10 times cheaper per token than Operator. For any agent that can easily run hundreds of steps in a session, these constant factors determine whether a workload is viable in production.

Agentic Active Pre-training and Why OSGym Matters?

Lux is trained with a method that OpenAGI research team calls Agentic Active Pre-training. The team contrasts this with standard language model pre-training that passively ingests text from the internet. The idea is that Lux learns by acting in digital environments and refining its behavior through large scale interaction, rather than only minimizing token prediction loss on static logs. The optimization objective differs from classical reinforcement learning, and is set up to favor self driven exploration and understanding instead of a manually shaped reward.

This training setup depends on a data engine that can expose many operating system environments in parallel. OpenAGI team has already open sourced that engine as OSGym, under an MIT license that allows both research and commercial use. OSGym runs full operating system replicas, not only browser sandboxes, and supports tasks that span office software, browsers, development tools and multi application workflows.

Key Takeaways

Lux is a foundation computer use model that operates full desktops and browsers and reaches 83.6 percent success on the Online Mind2Web benchmark, ahead of Gemini CUA, OpenAI Operator and Claude Sonnet-4.

Lux exposes 3 modes, Actor, Thinker and Tasker, which cover low latency UI macros, multi step goal decomposition and deterministic scripted execution for production workflows.

Lux is reported to run around 1 second per step and to be about 10 times cheaper per token than OpenAI Operator, which matters for long horizon agents that run hundreds of actions per task.

Lux is trained with Agentic Active Pre-training, where the model learns by acting in environments, rather than only consuming static web text, which targets robust screen to action behavior instead of pure language modeling.

OSGym, the open source data engine behind Lux, can run more than 1,000 OS replicas and generate more than 1,400 multi turn trajectories per minute at low per replica cost, which gives teams a practical way to train and evaluate their own computer use agents.

Check out the Official Announcement, Project and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OpenAGI Foundation Launches Lux: A Foundation Computer Use Model that Tops Online Mind2Web with OSGym At Scale appeared first on MarkTechPost.

Kernel Principal Component Analysis (PCA): Explained with an Example

Posted on December 7, 2025 by i-genie

Dimensionality reduction techniques like PCA work wonderfully when datasets are linearly separable—but they break down the moment nonlinear patterns appear. That’s exactly what happens with datasets such as two moons: PCA flattens the structure and mixes the classes together.

Kernel PCA fixes this limitation by mapping the data into a higher-dimensional feature space where nonlinear patterns become linearly separable. In this article, we’ll walk through how Kernel PCA works and use a simple example to visually compare PCA vs. Kernel PCA, showing how a nonlinear dataset that PCA fails to separate becomes perfectly separable after applying Kernel PCA.

What is PCA and how is it different from Kernel PCA?

Principal Component Analysis (PCA) is a linear dimensionality-reduction technique that identifies the directions (principal components) along which the data varies the most. It works by computing orthogonal linear combinations of the original features and projecting the dataset onto the directions of maximum variance.

These components are uncorrelated and ordered so that the first few capture most of the information in the data. PCA is powerful, but it comes with one important limitation: it can only uncover linear relationships in the data. When applied to nonlinear datasets—like the “two moons” example—it often fails to separate the underlying structure.

Kernel PCA extends PCA to handle nonlinear relationships. Instead of directly applying PCA in the original feature space, Kernel PCA first uses a kernel function (such as RBF, polynomial, or sigmoid) to implicitly project the data into a higher-dimensional feature space where the nonlinear structure becomes linearly separable.

PCA is then performed in this transformed space using a kernel matrix, without explicitly computing the higher-dimensional projection. This “kernel trick” allows Kernel PCA to capture complex patterns that standard PCA cannot.

We will now create a dataset that is nonlinear and then apply PCA to the dataset.

Code Implementation

Generating the dataset

We generate a nonlinear “two moons” dataset using make_moons, which is ideal for demonstrating why PCA fails and Kernel PCA succeeds.

Copy CodeCopiedUse a different Browserimport matplotlib.pyplot as plt
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.02, random_state=123)

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

Applying PCA on the dataset

Copy CodeCopiedUse a different Browserfrom sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.title(“PCA”)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.xlabel(“Component 1”)
plt.ylabel(“Component 2”)
plt.show()

The PCA visualization shows that the two moon-shaped clusters remain intertwined even after dimensionality reduction. This happens because PCA is a strictly linear technique—it can only rotate, scale, or flatten the data along straight directions of maximum variance.

Since the “two moons” dataset has a nonlinear structure, PCA is unable to separate the classes or untangle the curved shapes. As a result, the transformed data still looks almost identical to the original pattern, and the two classes remain overlapped in the projected space.

Applying Kernel PCA on the dataset

We now apply Kernel PCA using an RBF kernel, which maps the nonlinear data into a higher-dimensional space where it becomes linearly separable. In the kernel space the two classes in our dataset are linearly separable. Kernel PCA uses a kernel function to project the dataset into a higher-dimensional space, where it is linearly separable.

Copy CodeCopiedUse a different Browserfrom sklearn.decomposition import KernelPCA
kpca = KernelPCA(kernel=’rbf’, gamma=15)
X_kpca = kpca.fit_transform(X)

plt.title(“Kernel PCA”)
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c=y)
plt.show()

The goal of PCA (and dimensionality reduction in general) is not just to compress the data—it’s to reveal the underlying structure in a way that preserves meaningful variation. In nonlinear datasets like the two-moons example, traditional PCA cannot “unfold” the curved shapes because it only applies linear transformations.

Kernel PCA, however, performs a nonlinear mapping before applying PCA, allowing the algorithm to untangle the moons into two clearly separated clusters. This separation is valuable because it makes downstream tasks like visualization, clustering, and even classification far more effective. When the data becomes linearly separable after transformation, simple models—such as linear classifiers—can successfully distinguish between the classes, something that would be impossible in the original or PCA-transformed space.

Challenges involved with Kernel PCA

While Kernel PCA is powerful for handling nonlinear datasets, it comes with several practical challenges. The biggest drawback is computational cost—because it relies on computing pairwise similarities between all data points, the algorithm has O(n²) time and memory complexity, making it slow and memory-heavy for large datasets.

Another challenge is model selection: choosing the right kernel (RBF, polynomial, etc.) and tuning parameters like gamma can be tricky and often requires experimentation or domain expertise.

Kernel PCA can also be harder to interpret, since the transformed components no longer correspond to intuitive directions in the original feature space. Finally, it is sensitive to missing values and outliers, which can distort the kernel matrix and degrade performance.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Kernel Principal Component Analysis (PCA): Explained with an Example appeared first on MarkTechPost.

How to Design a Fully Local Multi-Agent Orchestration System Using Tin …

Posted on December 7, 2025 by i-genie

In this tutorial, we explore how we can orchestrate a team of specialized AI agents locally using an efficient manager-agent architecture powered by TinyLlama. We walk through how we build structured task decomposition, inter-agent collaboration, and autonomous reasoning loops without relying on any external APIs. By running everything directly through the transformers library, we create a fully offline, lightweight, and transparent multi-agent system that we can customize, inspect, and extend. Through the snippets, we observe how each component, from task structures to agent prompts to result synthesis, comes together to form a coherent human-AI workflow that we control end-to-end. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install transformers torch accelerate bitsandbytes -q

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
import re
from typing import List, Dict, Any
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class Task:
id: str
description: str
assigned_to: str = None
status: str = “pending”
result: Any = None
dependencies: List[str] = None

def __post_init__(self):
if self.dependencies is None:
self.dependencies = []

@dataclass
class Agent:
name: str
role: str
expertise: str
system_prompt: str

We set up all the core imports and define the fundamental data structures needed to manage tasks and agents. We define Task and Agent as structured entities to cleanly orchestrate work. By doing this, we ensure that every part of the system has a consistent and reliable foundation. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserAGENT_REGISTRY = {
“researcher”: Agent(
name=”researcher”,
role=”Research Specialist”,
expertise=”Information gathering, analysis, and synthesis”,
system_prompt=”You are a research specialist. Provide thorough research on topics.”
),
“coder”: Agent(
name=”coder”,
role=”Software Engineer”,
expertise=”Writing clean, efficient code with best practices”,
system_prompt=”You are an expert programmer. Write clean, well-documented code.”
),
“writer”: Agent(
name=”writer”,
role=”Content Writer”,
expertise=”Clear communication and documentation”,
system_prompt=”You are a professional writer. Create clear, engaging content.”
),
“analyst”: Agent(
name=”analyst”,
role=”Data Analyst”,
expertise=”Data interpretation and insights”,
system_prompt=”You are a data analyst. Provide clear insights from data.”
)
}

class LocalLLM:
def __init__(self, model_name: str = “TinyLlama/TinyLlama-1.1B-Chat-v1.0″):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
) if torch.cuda.is_available() else None
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map=”auto”,
low_cpu_mem_usage=True
)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token

def generate(self, prompt: str, max_tokens: int = 300) -> str:
formatted_prompt = f”<|system|>nYou are a helpful AI assistant.</s>n<|user|>n{prompt}</s>n<|assistant|>n”
inputs = self.tokenizer(
formatted_prompt,
return_tensors=”pt”,
truncation=True,
max_length=1024,
padding=True
)
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
top_p=0.9,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
use_cache=True
)
full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
if “<|assistant|>” in full_response:
return full_response.split(“<|assistant|>”)[-1].strip()
return full_response[len(formatted_prompt):].strip()

We register all our specialized agents and implement the local LLM wrapper that powers the system. We load TinyLlama in an efficient 4-bit mode so we can run everything smoothly on Colab or local hardware. With this, we give ourselves a flexible and fully local way to generate responses for each agent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ManagerAgent:
def __init__(self, model_name: str = “TinyLlama/TinyLlama-1.1B-Chat-v1.0”):
self.llm = LocalLLM(model_name)
self.agents = AGENT_REGISTRY
self.tasks: Dict[str, Task] = {}
self.execution_log = []

def log(self, message: str):
timestamp = datetime.now().strftime(“%H:%M:%S”)
log_entry = f”[{timestamp}] {message}”
self.execution_log.append(log_entry)
print(log_entry)

def decompose_goal(self, goal: str) -> List[Task]:
self.log(f” Decomposing goal: {goal}”)
agent_info = “n”.join([f”- {name}: {agent.expertise}” for name, agent in self.agents.items()])
prompt = f”””Break down this goal into 3 specific subtasks. Assign each to the best agent.

Goal: {goal}

Available agents:
{agent_info}

Respond ONLY with a JSON array.”””
response = self.llm.generate(prompt, max_tokens=250)
try:
json_match = re.search(r'[s*{.*?}s*]’, response, re.DOTALL)
if json_match:
tasks_data = json.loads(json_match.group())
else:
raise ValueError(“No JSON found”)
except:
tasks_data = self._create_default_tasks(goal)

tasks = []
for i, task_data in enumerate(tasks_data[:3]):
task = Task(
id=task_data.get(‘id’, f’task_{i+1}’),
description=task_data.get(‘description’, f’Work on: {goal}’),
assigned_to=task_data.get(‘assigned_to’, list(self.agents.keys())[i % len(self.agents)]),
dependencies=task_data.get(‘dependencies’, [] if i == 0 else [f’task_{i}’])
)
self.tasks[task.id] = task
tasks.append(task)
self.log(f” ✓ {task.id}: {task.description[:50]}… → {task.assigned_to}”)

return tasks

We begin constructing the ManagerAgent class and focus on how we decompose a high-level goal into well-defined subtasks. We generate structured JSON-based tasks and automatically assign them to the right agent. By doing this, we allow the system to think step by step and organize work just like a human project manager. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def _create_default_tasks(self, goal: str) -> List[Dict]:
if any(word in goal.lower() for word in [‘code’, ‘program’, ‘implement’, ‘algorithm’]):
return [
{“id”: “task_1”, “description”: f”Research and explain the concept: {goal}”, “assigned_to”: “researcher”, “dependencies”: []},
{“id”: “task_2”, “description”: f”Write code implementation for: {goal}”, “assigned_to”: “coder”, “dependencies”: [“task_1”]},
{“id”: “task_3”, “description”: f”Create documentation and examples”, “assigned_to”: “writer”, “dependencies”: [“task_2”]}
]
return [
{“id”: “task_1”, “description”: f”Research: {goal}”, “assigned_to”: “researcher”, “dependencies”: []},
{“id”: “task_2”, “description”: f”Analyze findings and structure content”, “assigned_to”: “analyst”, “dependencies”: [“task_1”]},
{“id”: “task_3”, “description”: f”Write comprehensive response”, “assigned_to”: “writer”, “dependencies”: [“task_2″]}
]

def execute_task(self, task: Task, context: Dict[str, Any] = None) -> str:
self.log(f” Executing {task.id} with {task.assigned_to}”)
task.status = “in_progress”
agent = self.agents[task.assigned_to]
context_str = “”
if context and task.dependencies:
context_str = “nnContext from previous tasks:n”
for dep_id in task.dependencies:
if dep_id in context:
context_str += f”- {context[dep_id][:150]}…n”

prompt = f”””{agent.system_prompt}

Task: {task.description}{context_str}

Provide a clear, concise response:”””
result = self.llm.generate(prompt, max_tokens=250)
task.result = result
task.status = “completed”
self.log(f” ✓ Completed {task.id}”)
return result

We define fallback task logic and the full execution flow for each task. We guide each agent with its own system prompt and provide contextual information to keep results coherent. This allows us to execute tasks intelligently while respecting dependency order. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef synthesize_results(self, goal: str, results: Dict[str, str]) -> str:
self.log(” Synthesizing final results”)
results_text = “nn”.join([f”Task {tid}:n{res[:200]}” for tid, res in results.items()])
prompt = f”””Combine these task results into one final coherent answer.

Original Goal: {goal}

Task Results:
{results_text}

Final comprehensive answer:”””
return self.llm.generate(prompt, max_tokens=350)

def execute_goal(self, goal: str) -> Dict[str, Any]:
self.log(f”n{‘=’*60}n Starting Manager Agentn{‘=’*60}”)
tasks = self.decompose_goal(goal)
results = {}
completed = set()
max_iterations = len(tasks) * 2
iteration = 0

while len(completed) < len(tasks) and iteration < max_iterations:
iteration += 1
for task in tasks:
if task.id in completed:
continue
deps_met = all(dep in completed for dep in task.dependencies)
if deps_met:
result = self.execute_task(task, results)
results[task.id] = result
completed.add(task.id)

final_output = self.synthesize_results(goal, results)
self.log(f”n{‘=’*60}n Execution Complete!n{‘=’*60}n”)

return {
“goal”: goal,
“tasks”: [asdict(task) for task in tasks],
“final_output”: final_output,
“execution_log”: self.execution_log
}

We synthesize the outputs from all subtasks and convert them into one unified final answer. We also implement an orchestration loop that ensures each task runs only after its dependencies are complete. This snippet shows how we bring everything together into a smooth multi-step reasoning pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_basic():
manager = ManagerAgent()
goal = “Explain binary search algorithm with a simple example”
result = manager.execute_goal(goal)
print(“n” + “=”*60)
print(“FINAL OUTPUT”)
print(“=”*60)
print(result[“final_output”])
return result

def demo_coding():
manager = ManagerAgent()
goal = “Implement a function to find the maximum element in a list”
result = manager.execute_goal(goal)
print(“n” + “=”*60)
print(“FINAL OUTPUT”)
print(“=”*60)
print(result[“final_output”])
return result

def demo_custom(custom_goal: str):
manager = ManagerAgent()
result = manager.execute_goal(custom_goal)
print(“n” + “=”*60)
print(“FINAL OUTPUT”)
print(“=”*60)
print(result[“final_output”])
return result

if __name__ == “__main__”:
print(” Manager Agent Tutorial – APIless Local Version”)
print(“=”*60)
print(“Using TinyLlama (1.1B) – Fast & efficient!n”)
result = demo_basic()
print(“nn Try more:”)
print(” – demo_coding()”)
print(” – demo_custom(‘your goal here’)”)

We provide demonstration functions to easily test our system with different goals. We run sample tasks to observe how the manager decomposes, executes, and synthesizes work in real time. This gives us an interactive way to understand the entire workflow and refine it further.

In conclusion, we demonstrate how to design and operate a complete multi-agent orchestration system locally with minimal dependencies. We now understand how the manager breaks down goals, routes tasks to the right expert agents, collects their outputs, resolves dependencies, and synthesizes the final result. This implementation allows us to appreciate how modular, predictable, and powerful local agentic patterns can be when built from scratch.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Fully Local Multi-Agent Orchestration System Using TinyLlama for Intelligent Task Decomposition and Autonomous Collaboration appeared first on MarkTechPost.

Apple Researchers Release CLaRa: A Continuous Latent Reasoning Framewo …

Posted on December 6, 2025 by i-genie

How do you keep RAG systems accurate and efficient when every query tries to stuff thousands of tokens into the context window and the retriever and generator are still optimized as 2 separate, disconnected systems? A team of researchers from Apple and University of Edinburgh released CLaRa, Continuous Latent Reasoning, (CLaRa-7B-Base, CLaRa-7B-Instruct and CLaRa-7B-E2E) a retrieval augmented generation framework that compresses documents into continuous memory tokens and then performs both retrieval and generation in that shared latent space. The goal is simple. Shorten context, avoid double encoding, and let the generator teach the retriever what actually matters for downstream answers.

https://arxiv.org/pdf/2511.18659

From raw documents to continuous memory tokens

CLaRa starts with a semantic compressor that attaches a small number of learned memory tokens to each document. During Salient Compressor Pretraining, SCP, the base model is a Mistral 7B style transformer with LoRA adapters that switch between a compressor role and a generator role. The final layer hidden states of the memory tokens become the compressed representation for that document.

SCP is trained on about 2M passages from Wikipedia 2021. A local Qwen-32B model generates 3 supervision signals for each passage. Simple QA pairs cover atomic facts. Complex QA pairs connect several facts in one question to enforce multi hop reasoning. Paraphrases reorder and compress the text while preserving semantics. A verification loop checks factual consistency and coverage and can regenerate missing questions or paraphrases for up to 10 rounds before accepting a sample.

Training uses 2 losses. A cross entropy term trains the generator to answer questions or produce paraphrases conditioned only on the memory tokens and an instruction prefix. A mean squared error term aligns the average hidden state of document tokens with the average hidden state of the memory tokens. The MSE loss gives modest but consistent gains of about 0.3 to 0.6 F1 points at compression ratios 32 and 128 and keeps compressed and original representations in the same semantic region.

https://arxiv.org/pdf/2511.18659

Joint retrieval and generation in a shared space

After offline compression, each document is represented only by its memory tokens. CLaRa then trains a query reasoner and an answer generator on top of the same backbone. The query reasoner is another LoRA adapter that maps an input question into the same number of memory tokens used for documents. Retrieval becomes pure embedding search. The system computes cosine similarity between the query embedding and each candidate document embedding.

The best compressed document embeddings for a query are concatenated with the query tokens and fed into the generator adapter. Training uses only a standard next token prediction loss on the final answer. There are no explicit relevance labels. The key trick is a differentiable top k selector implemented with a Straight Through estimator. During the forward pass the model uses hard top k selection. During the backward pass a softmax distribution over document scores allows gradients from the generator to flow into the query reasoner parameters.

The research team shows 2 effects in the gradient analysis. First, the retriever is encouraged to assign higher probability to documents that increase answer likelihood. Second, because retrieval and generation share the same compressed representations, generator gradients reshape the latent document space to make it easier to reason over. Logit lens analysis of the query embeddings recovers topic tokens such as “NFL” and “Oklahoma” for a question about the nephew of Ivory Lee Brown, even though those tokens are not in the raw query but are present in the supporting articles.

https://arxiv.org/pdf/2511.18659

Compression quality and QA accuracy

The compressor is evaluated on 4 QA datasets: Natural Questions, HotpotQA, MuSiQue and 2WikiMultihopQA. Under the Normal setting, where the system retrieves the top 5 Wikipedia 2021 documents per query, SCP-Mistral-7B at 4 times compression reaches an average F1 of 39.86. This is 5.37 points better than the hard compression baseline LLMLingua 2 and 1.13 points better than the best soft compression baseline PISCO.

Under the Oracle setting, where the gold document is guaranteed to be in the candidate set, SCP-Mistral-7B at 4 times compression reaches an average F1 of 66.76. That is 17.31 points above LLMLingua-2 and 5.35 points above PISCO. Even more interesting, the compressed representations outperform a BGE based text retriever plus full document Mistral-7B generator by about 2.36 average F1 points for Mistral and about 6.36 points for Phi 4 mini. Well trained soft compression can exceed full text RAG while cutting context length by factors from 4 to 128.

https://arxiv.org/pdf/2511.18659

The performance at very high compression ratios, above 32 in Oracle, does drop, but the decline remains moderate in Normal retrieval conditions. The key explanation as per the research team is, weak document relevance bottlenecks the system before compression quality does.

End to end QA and retrieval behavior

For end to end QA, CLaRa uses 20 candidate documents per query with compression ratios 4, 16 and 32. On the Normal setting, CLaRa-Mistral-7B with instruction initialized weights and 16 times compression reaches F1 equal to 50.89 on Natural Questions and 44.66 on 2WikiMultihopQA. This is comparable to DRO-Mistral-7B, which reads full uncompressed text, while using 16 times shorter document representations. On some datasets, CLaRa at 16 times compression slightly improves F1 over DRO, for example from 43.65 to 47.18 on 2Wiki.

In the Oracle setting, CLaRa-Mistral-7B exceeds 75, F1 on both Natural Questions and HotpotQA at 4 times compression. This shows that the generator can fully exploit accurate retrieval even when all evidence is stored only in compressed memory tokens. Instruction initialized CLaRa generally wins over pre-training initialized CLaRa in the Normal setting, while the gap narrows in Oracle, where retrieval noise is limited.

On the retrieval side, CLaRa used as a reranker under Oracle conditions delivers strong Recall at 5. With pretraining initialization at compression 4 on HotpotQA, CLaRa-Mistral-7B reaches Recall at 5 equal to 96.21. This beats the supervised BGE Reranker baseline at 85.93 by 10.28 points and even outperforms a fully supervised Sup Instruct retriever trained with contrastive relevance labels.

https://arxiv.org/pdf/2511.18659

What Apple has released?

Apple’s research team released 3 models on Hugging Face: CLaRa-7B-Base, CLaRa-7B-Instruct and CLaRa-7B-E2E. CLaRa-7B-Instruct is described as an instruction tuned unified RAG model with built in document compression at 16 and 128 times. It answers instruction style questions directly from compressed representations and uses Mistral-7B-Instruct v0.2 as the base model.

Key Takeaways

CLaRa replaces raw documents with a small set of continuous memory tokens learned via QA guided and paraphrase guided semantic compression, which preserves key reasoning signals even at 16 times and 128 times compression.

Retrieval and generation are trained in a single shared latent space, the query encoder and generator share the same compressed representations and are optimized together with one language modeling loss.

A differentiable top-k estimator lets gradients flow from answer tokens back into the retriever, which aligns document relevance with answer quality and removes the usual disjoint tuning loop for RAG systems.

On multi hop QA benchmarks like Natural Questions, HotpotQA, MuSiQue and 2WikiMultihopQA, CLaRa’s SCP compressor at 4 times compression outperforms strong text based baselines such as LLMLingua 2 and PISCO and can even beat full text BGE/ Mistral pipelines on average F1.

Apple has released 3 practical models, CLaRa-7B-Base, CLaRa-7B-Instruct and CLaRa-7B-E2E, along with the full training pipeline on GitHub.

Editorial Notes

CLaRa is an important step for retrieval augmented generation because it treats semantic document compression and joint optimization in a shared continuous space as first class citizens, not afterthoughts bolted onto a text only pipeline. It shows that embedding based compression with SCP, combined with end to end training via a differentiable top-k estimator and a single language modeling loss, can match or surpass text based RAG baselines while using far shorter contexts and simpler retrieval stacks. Overall, CLaRa demonstrates that unified continuous latent reasoning is a credible alternative to classic chunk and retrieve RAG for real world QA workloads.

Check out the Paper, Model Weights on HF and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Apple Researchers Release CLaRa: A Continuous Latent Reasoning Framework for Compression‑Native RAG with 16x–128x Semantic Document Compression appeared first on MarkTechPost.

AI Interview Series #4: Transformers vs Mixture of Experts (MoE)

Posted on December 5, 2025 by i-genie

Question:

MoE models contain far more parameters than Transformers, yet they can run faster at inference. How is that possible?

Difference between Transformers & Mixture of Experts (MoE)

Transformers and Mixture of Experts (MoE) models share the same backbone architecture—self-attention layers followed by feed-forward layers—but they differ fundamentally in how they use parameters and compute.

Feed-Forward Network vs Experts

Transformer: Each block contains a single large feed-forward network (FFN). Every token passes through this FFN, activating all parameters during inference.

MoE: Replaces the FFN with multiple smaller feed-forward networks, called experts. A routing network selects only a few experts (Top-K) per token, so only a small fraction of total parameters is active.

Parameter Usage

Transformer: All parameters across all layers are used for every token → dense compute.

MoE: Has more total parameters, but activates only a small portion per token → sparse compute. Example: Mixtral 8×7B has 46.7B total parameters, but uses only ~13B per token.

Inference Cost

Transformer: High inference cost due to full parameter activation. Scaling to models like GPT-4 or Llama 2 70B requires powerful hardware.

MoE: Lower inference cost because only K experts per layer are active. This makes MoE models faster and cheaper to run, especially at large scales.

Token Routing

Transformer: No routing. Every token follows the exact same path through all layers.

MoE: A learned router assigns tokens to experts based on softmax scores. Different tokens select different experts. Different layers may activate different experts which increases specialization and model capacity.

Model Capacity

Transformer: To scale capacity, the only option is adding more layers or widening the FFN—both increase FLOPs heavily.

MoE: Can scale total parameters massively without increasing per-token compute. This enables “bigger brains at lower runtime cost.”

While MoE architectures offer massive capacity with lower inference cost, they introduce several training challenges. The most common issue is expert collapse, where the router repeatedly selects the same experts, leaving others under-trained.

Load imbalance is another challenge—some experts may receive far more tokens than others, leading to uneven learning. To address this, MoE models rely on techniques like noise injection in routing, Top-K masking, and expert capacity limits.

These mechanisms ensure all experts stay active and balanced, but they also make MoE systems more complex to train compared to standard Transformers.

AI Interview Series #3: Explain Federated Learning

The post AI Interview Series #4: Transformers vs Mixture of Experts (MoE) appeared first on MarkTechPost.

How to Build a Meta-Cognitive AI Agent That Dynamically Adjusts Its Ow …

Posted on December 5, 2025 by i-genie

In this tutorial, we build an advanced meta-cognitive control agent that learns how to regulate its own depth of thinking. We treat reasoning as a spectrum, ranging from fast heuristics to deep chain-of-thought to precise tool-like solving, and we train a neural meta-controller to decide which mode to use for each task. By optimizing the trade-off between accuracy, computation cost, and a limited reasoning budget, we explore how an agent can monitor its internal state and adapt its reasoning strategy in real time. Through each snippet, we experiment, observe patterns, and understand how meta-cognition emerges when an agent learns to think about its own thinking. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserimport random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

OPS = [‘+’, ‘*’]

def make_task():
op = random.choice(OPS)
if op == ‘+’:
a, b = random.randint(1, 99), random.randint(1, 99)
else:
a, b = random.randint(2, 19), random.randint(2, 19)
return a, b, op

def true_answer(a, b, op):
return a + b if op == ‘+’ else a * b

def true_difficulty(a, b, op):
if op == ‘+’ and a <= 30 and b <= 30:
return 0
if op == ‘*’ and a <= 10 and b <= 10:
return 1
return 2

def heuristic_difficulty(a, b, op):
score = 0
if op == ‘*’:
score += 0.6
score += max(a, b) / 100.0
return min(score, 1.0)

def fast_heuristic(a, b, op):
if op == ‘+’:
base = a + b
noise = random.choice([-2, -1, 0, 0, 0, 1, 2, 3])
else:
base = int(0.8 * a * b)
noise = random.choice([-5, -3, 0, 0, 2, 5, 8])
return base + noise, 0.5

def deep_chain_of_thought(a, b, op, verbose=False):
if op == ‘+’:
x, y = a, b
carry = 0
pos = 1
result = 0
step = 0
while x > 0 or y > 0 or carry:
dx, dy = x % 10, y % 10
s = dx + dy + carry
carry, digit = divmod(s, 10)
result += digit * pos
x //= 10; y //= 10; pos *= 10
step += 1
else:
result = 0
step = 0
for i, d in enumerate(reversed(str(b))):
row = a * int(d) * (10 ** i)
result += row
step += 1
return result, max(2.0, 0.4 * step)

def tool_solver(a, b, op):
return eval(f”{a}{op}{b}”), 1.2

ACTION_NAMES = [“fast”, “deep”, “tool”]

We set up the world our meta-agent operates in. We generate arithmetic tasks, define ground-truth answers, estimate difficulty, and implement three different reasoning modes. As we run it, we observe how each solver behaves differently in terms of accuracy and computational cost, which form the foundation of the agent’s decision space. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserdef encode_state(a, b, op, rem_budget, error_ema, last_action):
a_n = a / 100.0
b_n = b / 100.0
op_plus = 1.0 if op == ‘+’ else 0.0
op_mul = 1.0 – op_plus
diff_hat = heuristic_difficulty(a, b, op)
rem_n = rem_budget / MAX_BUDGET
last_onehot = [0.0, 0.0, 0.0]
if last_action is not None:
last_onehot[last_action] = 1.0
feats = [
a_n, b_n, op_plus, op_mul,
diff_hat, rem_n, error_ema
] + last_onehot
return torch.tensor(feats, dtype=torch.float32, device=device)

STATE_DIM = 10
N_ACTIONS = 3

class PolicyNet(nn.Module):
def __init__(self, state_dim, hidden=48, n_actions=3):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.Tanh(),
nn.Linear(hidden, hidden),
nn.Tanh(),
nn.Linear(hidden, n_actions)
)
def forward(self, x):
return self.net(x)

policy = PolicyNet(STATE_DIM, hidden=48, n_actions=N_ACTIONS).to(device)
optimizer = optim.Adam(policy.parameters(), lr=3e-3)

We encode each task into a structured state that captures operands, operation type, predicted difficulty, remaining budget, and recent performance. We then define a neural policy network that maps this state to a probability distribution over actions. As we work through it, we see how the policy becomes the core mechanism through which the agent learns to regulate its thinking. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different BrowserGAMMA = 0.98
COST_PENALTY = 0.25
MAX_BUDGET = 25.0
EPISODES = 600
STEPS_PER_EP = 20
ERROR_EMA_DECAY = 0.9

def run_episode(train=True):
log_probs = []
rewards = []
info = []
rem_budget = MAX_BUDGET
error_ema = 0.0
last_action = None

for _ in range(STEPS_PER_EP):
a, b, op = make_task()
state = encode_state(a, b, op, rem_budget, error_ema, last_action)
logits = policy(state)
dist = torch.distributions.Categorical(logits=logits)
action = dist.sample() if train else torch.argmax(logits)
act_idx = int(action.item())

if act_idx == 0:
pred, cost = fast_heuristic(a, b, op)
elif act_idx == 1:
pred, cost = deep_chain_of_thought(a, b, op, verbose=False)
else:
pred, cost = tool_solver(a, b, op)

correct = (pred == true_answer(a, b, op))
acc_reward = 1.0 if correct else 0.0
budget_penalty = 0.0

rem_budget -= cost
if rem_budget < 0:
budget_penalty = -1.5 * (abs(rem_budget) / MAX_BUDGET)

step_reward = acc_reward – COST_PENALTY * cost + budget_penalty
rewards.append(step_reward)

if train:
log_probs.append(dist.log_prob(action))

err = 0.0 if correct else 1.0
error_ema = ERROR_EMA_DECAY * error_ema + (1 – ERROR_EMA_DECAY) * err
last_action = act_idx

info.append({
“correct”: correct,
“cost”: cost,
“difficulty”: true_difficulty(a, b, op),
“action”: act_idx
})

if train:
returns = []
G = 0.0
for r in reversed(rewards):
G = r + GAMMA * G
returns.append(G)
returns = list(reversed(returns))
returns_t = torch.tensor(returns, dtype=torch.float32, device=device)
baseline = returns_t.mean()
adv = returns_t – baseline
loss = -(torch.stack(log_probs) * adv).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()

return rewards, info

We implement the heart of learning using the REINFORCE policy gradient algorithm. We run multi-step episodes, collect log-probabilities, accumulate rewards, and compute returns. As we execute this part, we watch the meta-controller adjust its strategy by reinforcing decisions that balance accuracy with cost. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserprint(“Training meta-cognitive controller…”)
for ep in range(EPISODES):
rewards, _ = run_episode(train=True)
if (ep + 1) % 100 == 0:
print(f” episode {ep+1:4d} | avg reward {np.mean(rewards):.3f}”)

def evaluate(n_episodes=50):
all_actions = {0: [0,0,0], 1: [0,0,0], 2: [0,0,0]}
stats = {0: {“n”:0,”acc”:0,”cost”:0},
1: {“n”:0,”acc”:0,”cost”:0},
2: {“n”:0,”acc”:0,”cost”:0}}

for _ in range(n_episodes):
_, info = run_episode(train=False)
for step in info:
d = step[“difficulty”]
a_idx = step[“action”]
all_actions[d][a_idx] += 1
stats[d][“n”] += 1
stats[d][“acc”] += 1 if step[“correct”] else 0
stats[d][“cost”] += step[“cost”]

for d in [0,1,2]:
if stats[d][“n”] == 0:
continue
n = stats[d][“n”]
print(f”Difficulty {d}:”)
print(” action counts [fast, deep, tool]:”, all_actions[d])
print(” accuracy:”, stats[d][“acc”]/n)
print(” avg cost:”, stats[d][“cost”]/n)
print()

print(“Policy behavior by difficulty:”)
evaluate()

We train the meta-cognitive agent over hundreds of episodes and evaluate its behavior across difficulty levels. We observe how the policy evolves, using fast heuristics for simple tasks while resorting to deeper reasoning for harder ones. As we analyze the outputs, we understand how training shapes the agent’s reasoning choices. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserprint(“nExample hard task with meta-selected thinking mode:”)
a, b, op = 47, 18, ‘*’
state = encode_state(a, b, op, MAX_BUDGET, 0.3, None)
with torch.no_grad():
logits = policy(state)
act = int(torch.argmax(logits).item())

print(f”Task: {a} {op} {b}”)
print(“Chosen mode:”, ACTION_NAMES[act])

if act == 1:
pred, cost = deep_chain_of_thought(a, b, op, verbose=True)
elif act == 0:
pred, cost = fast_heuristic(a, b, op)
print(“Fast heuristic:”, pred)
else:
pred, cost = tool_solver(a, b, op)
print(“Tool solver:”, pred)

print(“True:”, true_answer(a,b,op), “| cost:”, cost)

We inspect a detailed reasoning trace for a hard example chosen by the trained policy. We see the agent confidently pick a mode and walk through the reasoning steps, allowing us to witness its meta-cognitive behavior in action. As we test different tasks, we appreciate how the model adapts its thinking based on context.

In conclusion, we have seen how a neural controller can learn to dynamically choose the most effective reasoning pathway based on the task’s difficulty and the constraints of the moment. We observe how the agent gradually discovers when quick heuristics are sufficient, when deeper reasoning is necessary, and when calling a precise solver is worth the cost. Through this process, we experience how metacognitive control transforms decision-making, leading to more efficient and adaptable reasoning systems.

Check out the FULL CODE NOTEBOOK. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Meta-Cognitive AI Agent That Dynamically Adjusts Its Own Reasoning Depth for Efficient Problem Solving appeared first on MarkTechPost.

NVIDIA and Mistral AI Bring 10x Faster Inference for the Mistral 3 Fam …

Posted on December 3, 2025 by i-genie

NVIDIA announced today a significant expansion of its strategic collaboration with Mistral AI. This partnership coincides with the release of the new Mistral 3 frontier open model family, marking a pivotal moment where hardware acceleration and open-source model architecture have converged to redefine performance benchmarks.

This collaboration is a massive leap in inference speed: the new models now run up to 10x faster on NVIDIA GB200 NVL72 systems compared to the previous generation H200 systems. This breakthrough unlocks unprecedented efficiency for enterprise-grade AI, promising to solve the latency and cost bottlenecks that have historically plagued the large-scale deployment of reasoning models.

A Generational Leap: 10x Faster on Blackwell

As enterprise demand shifts from simple chatbots to high-reasoning, long-context agents, inference efficiency has become the critical bottleneck. The collaboration between NVIDIA and Mistral AI addresses this head-on by optimizing the Mistral 3 family specifically for the NVIDIA Blackwell architecture.

Where production AI systems must deliver both strong user experience (UX) and cost-efficient scale, the NVIDIA GB200 NVL72 provides up to 10x higher performance than the previous-generation H200. This is not merely a gain in raw speed; it translates to significantly higher energy efficiency. The system exceeds 5,000,000 tokens per second per megawatt (MW) at user interactivity rates of 40 tokens per second.

Created by MarkTechpost.com and source NVIDIA

For data centers grappling with power constraints, this efficiency gain is as critical as the performance boost itself. This generational leap ensures a lower per-token cost while maintaining the high throughput required for real-time applications.

A New Mistral 3 Family

The engine driving this performance is the newly released Mistral 3 family. This suite of models delivers industry-leading accuracy, efficiency, and customization capabilities, covering the spectrum from massive data center workloads to edge device inference.

Mistral Large 3: The Flagship MoE

At the top of the hierarchy sits Mistral Large 3, a state-of-the-art sparse Multimodal and Multilingual Mixture-of-Experts (MoE) model.

Total Parameters: 675 Billion

Active Parameters: 41 Billion

Context Window: 256K tokens

Trained on NVIDIA Hopper GPUs, Mistral Large 3 is designed to handle complex reasoning tasks, offering parity with top-tier closed models while retaining the flexibility of open weights.

Ministral 3: Dense Power at the Edge

Complementing the large model is the Ministral 3 series, a suite of small, dense, high-performance models designed for speed and versatility.

Sizes: 3B, 8B, and 14B parameters.

Variants: Base, Instruct, and Reasoning for each size (nine models total).

Context Window: 256K tokens across the board.

The Ministral 3 series excel at GPQA Diamond Accuracy benchmark by utilizing 100 less tokens while delivery higher accuracy :

Significant Engineering Behind the Speed: A Comprehensive Optimization Stack

The “10x” performance claim is driven by a comprehensive stack of optimizations co-developed by Mistral and NVIDIA engineers. The teams adopted an “extreme co-design” approach, merging hardware capabilities with model architecture adjustments.

TensorRT-LLM Wide Expert Parallelism (Wide-EP)

To fully exploit the massive scale of the GB200 NVL72, NVIDIA employed Wide Expert Parallelism within TensorRT-LLM. This technology provides optimized MoE GroupGEMM kernels, expert distribution, and load balancing.

Crucially, Wide-EP exploits the NVL72’s coherent memory domain and NVLink fabric. It is highly resilient to architectural variations across large MoEs. For instance, Mistral Large 3 utilizes roughly 128 experts per layer, about half as many as comparable models like DeepSeek-R1. Despite this difference, Wide-EP enables the model to realize the high-bandwidth, low-latency, non-blocking benefits of the NVLink fabric, ensuring that the model’s massive size does not result in communication bottlenecks.

Native NVFP4 Quantization

One of the most significant technical advancements in this release is the support for NVFP4, a quantization format native to the Blackwell architecture.

For Mistral Large 3, developers can deploy a compute-optimized NVFP4 checkpoint quantized offline using the open-source llm-compressor library.

This approach reduces compute and memory costs while strictly maintaining accuracy. It leverages NVFP4’s higher-precision FP8 scaling factors and finer-grained block scaling to control quantization error. The recipe specifically targets the MoE weights while keeping other components at original precision, allowing the model to deploy seamlessly on the GB200 NVL72 with minimal accuracy loss.

Disaggregated Serving with NVIDIA Dynamo

Mistral Large 3 utilizes NVIDIA Dynamo, a low-latency distributed inference framework, to disaggregate the prefill and decode phases of inference.

In traditional setups, the prefill phase (processing the input prompt) and the decode phase (generating the output) compete for resources. By rate-matching and disaggregating these phases, Dynamo significantly boosts performance for long-context workloads, such as 8K input/1K output configurations. This ensures high throughput even when utilizing the model’s massive 256K context window.

From Cloud to Edge: Ministral 3 Performance

The optimization efforts extend beyond the massive data centers. Recognizing the growing need for local AI, the Ministral 3 series is engineered for edge deployment, offering flexibility for a variety of needs.

RTX and Jetson Acceleration

The dense Ministral models are optimized for platforms like the NVIDIA GeForce RTX AI PC and NVIDIA Jetson robotics modules.

RTX 5090: The Ministral-3B variants can reach blistering inference speeds of 385 tokens per second on the NVIDIA RTX 5090 GPU. This brings workstation-class AI performance to local PCs, enabling fast iteration and greater data privacy.

Jetson Thor: For robotics and edge AI, developers can use the vLLM container on NVIDIA Jetson Thor. The Ministral-3-3B-Instruct model achieves 52 tokens per second for single concurrency, scaling up to 273 tokens per second with a concurrency of 8.

Broad Framework Support

NVIDIA has collaborated with the open-source community to ensure these models are usable everywhere.

Llama.cpp & Ollama: NVIDIA collaborated with these popular frameworks to ensure faster iteration and lower latency for local development.

SGLang: NVIDIA collaborated with SGLang to create an implementation of Mistral Large 3 that supports both disaggregation and speculative decoding.

vLLM: NVIDIA worked with vLLM to expand support for kernel integrations, including speculative decoding (EAGLE), Blackwell support, and expanded parallelism.

Production-Ready with NVIDIA NIM

To streamline enterprise adoption, the new models will be available through NVIDIA NIM microservices.

Mistral Large 3 and Ministral-14B-Instruct are currently available through the NVIDIA API catalog and preview API. Soon, enterprise developers will be able to use downloadable NVIDIA NIM microservices. This provides a containerized, production-ready solution that allows enterprises to deploy the Mistral 3 family with minimal setup on any GPU-accelerated infrastructure.

This availability ensures that the specific “10x” performance advantage of the GB200 NVL72 can be realized in production environments without complex custom engineering, democratizing access to frontier-class intelligence.

Conclusion: A New Standard for Open Intelligence

The release of the NVIDIA-accelerated Mistral 3 open model family represents a major leap for AI in the open-source community. By offering frontier-level performance under an open source license, and backing it with a robust hardware optimization stack, Mistral and NVIDIA are meeting developers where they are.

From the massive scale of the GB200 NVL72 utilizing Wide-EP and NVFP4, to the edge-friendly density of Ministral on an RTX 5090, this partnership delivers a scalable, efficient path for artificial intelligence. With upcoming optimizations such as speculative decoding with multitoken prediction (MTP) and EAGLE-3 expected to push performance even further, the Mistral 3 family is poised to become a foundational element of the next generation of AI applications.

Available to test!

If you are a developer looking to benchmark these performance gains, you can download the Mistral 3 models directly from Hugging Face or test the deployment-free hosted versions on build.nvidia.com/mistralai to evaluate the latency and throughput for your specific use case.

Check out the Models on Hugging Face. You can find details on Corporate Blog and Technical/Developer Blog.

Thanks to the NVIDIA AI team for the thought leadership/ Resources for this article. NVIDIA AI team has supported this content/article.
The post NVIDIA and Mistral AI Bring 10x Faster Inference for the Mistral 3 Family on GB200 NVL72 GPU Systems appeared first on MarkTechPost.

How We Learn Step-Level Rewards from Preferences to Solve Sparse-Rewar …

Posted on December 3, 2025 by i-genie

In this tutorial, we explore Online Process Reward Learning (OPRL) and demonstrate how we can learn dense, step-level reward signals from trajectory preferences to solve sparse-reward reinforcement learning tasks. We walk through each component, from the maze environment and reward-model network to preference generation, training loops, and evaluation, while observing how the agent gradually improves its behaviour through online preference-driven shaping. By running this end-to-end implementation, we gain a practical understanding of how OPRL enables better credit assignment, faster learning, and more stable policy optimization in challenging environments where the agent would otherwise struggle to discover meaningful rewards. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserimport numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
import matplotlib.pyplot as plt
from collections import deque
import random

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

class MazeEnv:
def __init__(self, size=8):
self.size = size
self.start = (0, 0)
self.goal = (size-1, size-1)
self.obstacles = set([(i, size//2) for i in range(1, size-2)])
self.reset()

def reset(self):
self.pos = self.start
self.steps = 0
return self._get_state()

def _get_state(self):
state = np.zeros(self.size * self.size)
state[self.pos[0] * self.size + self.pos[1]] = 1
return state

def step(self, action):
moves = [(-1,0), (0,1), (1,0), (0,-1)]
new_pos = (self.pos[0] + moves[action][0],
self.pos[1] + moves[action][1])
if (0 <= new_pos[0] < self.size and
0 <= new_pos[1] < self.size and
new_pos not in self.obstacles):
self.pos = new_pos
self.steps += 1
done = self.pos == self.goal or self.steps >= 60
reward = 10.0 if self.pos == self.goal else 0.0
return self._get_state(), reward, done

def render(self):
grid = [[‘.’ for _ in range(self.size)] for _ in range(self.size)]
for obs in self.obstacles:
grid[obs[0]][obs[1]] = ‘█’
grid[self.goal[0]][self.goal[1]] = ‘G’
grid[self.pos[0]][self.pos[1]] = ‘A’
return ‘n’.join([”.join(row) for row in grid])

class ProcessRewardModel(nn.Module):
def __init__(self, state_dim, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.LayerNorm(hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.LayerNorm(hidden),
nn.ReLU(),
nn.Linear(hidden, 1),
nn.Tanh()
)
def forward(self, states):
return self.net(states)
def trajectory_reward(self, states):
return self.forward(states).sum()

class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden=128):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU()
)
self.actor = nn.Linear(hidden, action_dim)
self.critic = nn.Linear(hidden, 1)
def forward(self, state):
features = self.backbone(state)
return self.actor(features), self.critic(features)

We set up the entire foundation of our OPRL system by importing libraries, defining the maze environment, and building the reward and policy networks. We establish how states are represented, how obstacles block movement, and how the sparse reward structure works. We also design the core neural models that will later learn process rewards and drive the policy’s decisions. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserclass OPRLAgent:
def __init__(self, state_dim, action_dim, lr=3e-4):
self.policy = PolicyNetwork(state_dim, action_dim)
self.reward_model = ProcessRewardModel(state_dim)
self.policy_opt = Adam(self.policy.parameters(), lr=lr)
self.reward_opt = Adam(self.reward_model.parameters(), lr=lr)
self.trajectories = deque(maxlen=200)
self.preferences = deque(maxlen=500)
self.action_dim = action_dim

def select_action(self, state, epsilon=0.1):
if random.random() < epsilon:
return random.randint(0, self.action_dim – 1)
state_t = torch.FloatTensor(state).unsqueeze(0)
with torch.no_grad():
logits, _ = self.policy(state_t)
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, 1).item()

def collect_trajectory(self, env, epsilon=0.1):
states, actions, rewards = [], [], []
state = env.reset()
done = False
while not done:
action = self.select_action(state, epsilon)
next_state, reward, done = env.step(action)
states.append(state)
actions.append(action)
rewards.append(reward)
state = next_state
traj = {
‘states’: torch.FloatTensor(np.array(states)),
‘actions’: torch.LongTensor(actions),
‘rewards’: torch.FloatTensor(rewards),
‘return’: float(sum(rewards))
}
self.trajectories.append(traj)
return traj

We begin constructing the OPRL agent by implementing action selection and trajectory collection. We use an ε-greedy strategy to ensure exploration and gather sequences of states, actions, and returns. As we run the agent through the maze, we store entire trajectories that will later serve as preference data for shaping the reward model. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browser def generate_preference(self):
if len(self.trajectories) < 2:
return
t1, t2 = random.sample(list(self.trajectories), 2)
label = 1.0 if t1[‘return’] > t2[‘return’] else 0.0
self.preferences.append({‘t1’: t1, ‘t2’: t2, ‘label’: label})

def train_reward_model(self, n_updates=5):
if len(self.preferences) < 32:
return 0.0
total_loss = 0.0
for _ in range(n_updates):
batch = random.sample(list(self.preferences), 32)
loss = 0.0
for item in batch:
r1 = self.reward_model.trajectory_reward(item[‘t1’][‘states’])
r2 = self.reward_model.trajectory_reward(item[‘t2’][‘states’])
logit = r1 – r2
pred_prob = torch.sigmoid(logit)
label = item[‘label’]
loss += -(label * torch.log(pred_prob + 1e-8) +
(1-label) * torch.log(1 – pred_prob + 1e-8))
loss = loss / len(batch)
self.reward_opt.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.reward_model.parameters(), 1.0)
self.reward_opt.step()
total_loss += loss.item()
return total_loss / n_updates

We generate preference pairs from collected trajectories and train the process reward model using the Bradley–Terry formulation. We compare trajectory-level scores, compute probabilities, and update the reward model to reflect which behaviours appear better. This allows us to learn dense, differentiable, step-level rewards that guide the agent even when the environment itself is sparse. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browser def train_policy(self, n_updates=3, gamma=0.98):
if len(self.trajectories) < 5:
return 0.0
total_loss = 0.0
for _ in range(n_updates):
traj = random.choice(list(self.trajectories))
with torch.no_grad():
process_rewards = self.reward_model(traj[‘states’]).squeeze()
shaped_rewards = traj[‘rewards’] + 0.1 * process_rewards
returns = []
G = 0
for r in reversed(shaped_rewards.tolist()):
G = r + gamma * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
returns = (returns – returns.mean()) / (returns.std() + 1e-8)
logits, values = self.policy(traj[‘states’])
log_probs = F.log_softmax(logits, dim=-1)
action_log_probs = log_probs.gather(1, traj[‘actions’].unsqueeze(1))
advantages = returns – values.squeeze().detach()
policy_loss = -(action_log_probs.squeeze() * advantages).mean()
value_loss = F.mse_loss(values.squeeze(), returns)
entropy = -(F.softmax(logits, dim=-1) * log_probs).sum(-1).mean()
loss = policy_loss + 0.5 * value_loss – 0.01 * entropy
self.policy_opt.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
self.policy_opt.step()
total_loss += loss.item()
return total_loss / n_updates

def train_oprl(episodes=500, render_interval=100):
env = MazeEnv(size=8)
agent = OPRLAgent(state_dim=64, action_dim=4, lr=3e-4)
returns, reward_losses, policy_losses = [], [], []
success_rate = []
for ep in range(episodes):
epsilon = max(0.05, 0.5 – ep / 1000)
traj = agent.collect_trajectory(env, epsilon)
returns.append(traj[‘return’])
if ep % 2 == 0 and ep > 10:
agent.generate_preference()
if ep > 20 and ep % 2 == 0:
rew_loss = agent.train_reward_model(n_updates=3)
reward_losses.append(rew_loss)
if ep > 10:
pol_loss = agent.train_policy(n_updates=2)
policy_losses.append(pol_loss)
success = 1 if traj[‘return’] > 5 else 0
success_rate.append(success)
if ep % render_interval == 0 and ep > 0:
test_env = MazeEnv(size=8)
agent.collect_trajectory(test_env, epsilon=0)
print(test_env.render())
return returns, reward_losses, policy_losses, success_rate

We train the policy using shaped rewards produced by the learned process reward model. We compute returns, advantages, value estimates, and entropy bonuses, enabling the agent to improve its strategy over time. We then build a full training loop in which exploration decays, preferences accumulate, and both the reward model and the policy are updated continuously. Check out the FULL CODE NOTEBOOK.

Copy CodeCopiedUse a different Browserprint(“Training OPRL Agent on Sparse Reward Maze…n”)
returns, rew_losses, pol_losses, success = train_oprl(episodes=500, render_interval=250)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0,0].plot(returns, alpha=0.3)
axes[0,0].plot(np.convolve(returns, np.ones(20)/20, mode=’valid’), linewidth=2)
axes[0,0].set_xlabel(‘Episode’)
axes[0,0].set_ylabel(‘Return’)
axes[0,0].set_title(‘Agent Performance’)
axes[0,0].grid(alpha=0.3)

success_smooth = np.convolve(success, np.ones(20)/20, mode=’valid’)
axes[0,1].plot(success_smooth, linewidth=2, color=’green’)
axes[0,1].set_xlabel(‘Episode’)
axes[0,1].set_ylabel(‘Success Rate’)
axes[0,1].set_title(‘Goal Success Rate’)
axes[0,1].grid(alpha=0.3)

axes[1,0].plot(rew_losses, linewidth=2, color=’orange’)
axes[1,0].set_xlabel(‘Update Step’)
axes[1,0].set_ylabel(‘Loss’)
axes[1,0].set_title(‘Reward Model Loss’)
axes[1,0].grid(alpha=0.3)

axes[1,1].plot(pol_losses, linewidth=2, color=’red’)
axes[1,1].set_xlabel(‘Update Step’)
axes[1,1].set_ylabel(‘Loss’)
axes[1,1].set_title(‘Policy Loss’)
axes[1,1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(“OPRL Training Complete!”)
print(“Process rewards, preference learning, reward shaping, and online updates demonstrated.”)

We visualize the learning dynamics by plotting returns, success rates, reward-model loss, and policy loss. We monitor how the agent’s performance evolves as OPRL shapes the reward landscape. By the end of the visualization, we clearly see the impact of process rewards on solving a challenging, sparse-reward maze.

In conclusion, we see how OPRL transforms sparse terminal outcomes into rich online feedback that continuously guides the agent’s behaviour. We watch the process reward model learn preferences, shape the return signal, and accelerate the policy’s ability to reach the goal. With larger mazes, varying shaping strengths, or even real human preference feedback, we appreciate how OPRL provides a flexible and powerful framework for credit assignment in complex decision-making tasks. We finish with a clear, hands-on understanding of how OPRL operates and how we can extend it to more advanced agentic RL settings.

Check out the FULL CODE NOTEBOOK and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How We Learn Step-Level Rewards from Preferences to Solve Sparse-Reward Environments Using Online Process Reward Learning appeared first on MarkTechPost.

Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem F …

Posted on December 3, 2025 by i-genie

Large language model agents are starting to store everything they see, but can they actually improve their policies at test time from those experiences rather than just replaying context windows?

Researchers from University of Illinois Urbana Champaign and Google DeepMind propose Evo-Memory, a streaming benchmark and agent framework that targets this exact gap. Evo-Memory evaluates test-time learning with self-evolving memory, asking whether agents can accumulate and reuse strategies from continuous task streams instead of relying only on static conversational logs.

https://arxiv.org/pdf/2511.20857

Conversational Recall vs Experience Reuse

Most current agents implement conversational recall. They store dialogue history, tool traces, and retrieved documents, which are then reintegrated into the context window for future queries. This type of memory serves as a passive buffer, capable of recovering facts or recalling previous steps, but it does not actively modify the agent’s approach for related tasks.

Evo-Memory instead focuses on experience reuse. Here each interaction is treated as an experience that encodes not only inputs and outputs, but also whether a task succeeded and which strategies were effective. The benchmark checks if agents can retrieve those experiences in later tasks, apply them as reusable procedures, and refine the memory over time.

Benchmark Design and Task Streams

The research team formalizes a memory augmented agent as a tuple ((F, U, R, C)). The base model (F) generates outputs. The retrieval module (R) searches a memory store. The context constructor (C) synthesizes a working prompt from the current input and retrieved items. The update function (U) writes new experience entries and evolves the memory after every step.

Evo-Memory restructures conventional benchmarks into sequential task streams. Each dataset becomes an ordered sequence of tasks where early items carry strategies that are useful for later ones. The suite covers AIME 24, AIME 25, GPQA Diamond, MMLU-Pro economics, engineering, philosophy, and ToolBench for tool use, along with multi turn environments from AgentBoard including AlfWorld, BabyAI, ScienceWorld, Jericho, and PDDL planning.

Evaluation is done along four axes. Single turn tasks use exact match or answer accuracy. Embodied environments report success rate and progress rate. Step efficiency measures average steps per successful task. Sequence robustness tests whether performance is stable when task order changes.

https://arxiv.org/pdf/2511.20857

ExpRAG, a Minimal Experience Reuse Baseline

To set a lower bound, the research team define ExpRAG. Each interaction becomes a structured experience text with template ⟨xi,yi^,fi⟩where xi is input, yi^ is model output and fi is feedback, for example a correctness signal. At a new step (t), the agent retrieves similar experiences from memory using a similarity score and concatenates them with the current input as in-context examples. Then it appends the new experience into memory.

ExpRAG does not change the agent control loop. It is still a single shot call to the backbone, but now augmented with explicitly stored prior tasks. The design is intentionally simple so that any gains on Evo-Memory can be attributed to task level experience retrieval, not to new planning or tool abstractions.

ReMem, Action Think Memory Refine

The main contribution on the agent side is ReMem, an action–think–memory refine pipeline built on top of the same backbone models. At each internal step, given the current input, memory state and past reasoning traces, the agent chooses one of three operations:

Think generates intermediate reasoning traces that decompose the task.

Act emits an environment action or final answer visible to the user.

Refine performs meta reasoning on memory by retrieving, pruning and reorganizing experience entries.

This loop induces a Markov decision process where the state includes the query, current memory and ongoing thoughts. Within a step the agent can interleave several Think and Refine operations, and the step terminates when an Act operation is issued. In contrast to standard ReAct style agents, memory is no longer a fixed buffer. It becomes an explicit object that the agent reasons about and edits during inference.

https://arxiv.org/pdf/2511.20857

Results on Reasoning, Tools and Embodied Environments

The research team instantiate all methods on Gemini 2.5 Flash and Claude 3.7 Sonnet under a unified search–predict–evolve protocol. This isolates the effect of memory architecture, since prompting, search and feedback are held constant across baselines.

On single turn benchmarks, evolving memory methods produce consistent but moderate gains. For Gemini 2.5 Flash, ReMem reaches average exact match 0.65 across AIME 24, AIME 25, GPQA Diamond and MMLU Pro subsets, and 0.85 and 0.71 API and accuracy on ToolBench. ExpRAG also performs strongly, with average 0.60, and outperforms several more complex designs such as Agent Workflow Memory and Dynamic Cheatsheet variants.

The impact is larger in multi turn environments. On Claude 3.7 Sonnet, ReMem reaches success and progress 0.92 and 0.96 on AlfWorld, 0.73 and 0.83 on BabyAI, 0.83 and 0.95 on PDDL and 0.62 and 0.89 on ScienceWorld, giving average 0.78 success and 0.91 progress across datasets. On Gemini 2.5 Flash, ReMem achieves average 0.50 success and 0.64 progress, improving over history and ReAct style baselines in all four environments.

Step efficiency is also improved. In AlfWorld, average steps to complete a task drop from 22.6 for a history baseline to 11.5 for ReMem. Lightweight designs such as ExpRecent and ExpRAG reduce steps as well, which indicates that even simple task level experience reuse can make behaviour more efficient without architectural changes to the backbone.

A further analysis links gains to task similarity inside each dataset. Using embeddings from the retriever encoder, the research team compute average distance from tasks to their cluster center. ReMem’s margin over a history baseline correlates strongly with this similarity measure, with reported Pearson correlation about 0.72 on Gemini 2.5 Flash and 0.56 on Claude 3.7 Sonnet. Structured domains such as PDDL and AlfWorld show larger improvements than diverse sets like AIME 25 or GPQA Diamond.

Key Takeaways

Evo-Memory is a comprehensive streaming benchmark that converts standard datasets into ordered task, so agents can retrieve, integrate and update memory over time rather than rely on static conversational recall.

The framework formalizes memory augmented agents as a tuple ((F, U, R, C)) and implements more than 10 representative memory modules, including retrieval based, workflow and hierarchical memories, evaluated on 10 single turn and multi turn datasets across reasoning, question answering, tool use and embodied environments.

ExpRAG provides a minimal experience reuse baseline that stores each task interaction as a structured text record with input, model output and feedback, then retrieves similar experiences as in context exemplars for new tasks, already giving consistent improvements over pure history based baselines.

ReMem extends the standard ReAct style loop with an explicit Think, Act, Refine Memory control cycle, which lets the agent actively retrieve, prune and reorganize its memory during inference, leading to higher accuracy, higher success rate and fewer steps on both single turn reasoning and long horizon interactive environments.

Across Gemini 2.5 Flash and Claude 3.7 Sonnet backbones, self evolving memories such as ExpRAG and especially ReMem make smaller models behave like stronger agents at test time, improving exact match, success and progress metrics without any retraining of base model weights.

Editorial Notes

Evo Memory is a useful step for evaluating self evolving memory in LLM agents. It forces models to operate on sequential task streams instead of isolated prompts. It compares more than 10 memory architectures under a single framework. Simple methods like ExpRAG already show clear gains. ReMem’s action, think, refine memory loop improves exact match, success and progress without retraining base weights. Overall, this research work makes test time evolution a concrete design target for LLM agent systems

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem Framework for Experience Reuse in LLM Agents appeared first on MarkTechPost.

DeepSeek Researchers Introduce DeepSeek-V3.2 and DeepSeek-V3.2-Special …

Posted on December 2, 2025 by i-genie

How do you get GPT-5-level reasoning on real long-context, tool-using workloads without paying the quadratic attention and GPU cost that usually makes those systems impractical? DeepSeek research introduces DeepSeek-V3.2 and DeepSeek-V3.2-Speciale. They are reasoning-first models built for agents and targets high quality reasoning, long context and agent workflows, with open weights and production APIs. The models combine DeepSeek Sparse Attention (DSA), a scaled GRPO reinforcement learning stack and an agent native tool protocol, and report performance comparable to GPT 5, with DeepSeek-V3.2-Speciale reaching Gemini 3.0 Pro level reasoning on public benchmarks and competitions.