Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem F …

Large language model agents are starting to store everything they see, but can they actually improve their policies at test time from those experiences rather than just replaying context windows?

Researchers from University of Illinois Urbana Champaign and Google DeepMind propose Evo-Memory, a streaming benchmark and agent framework that targets this exact gap. Evo-Memory evaluates test-time learning with self-evolving memory, asking whether agents can accumulate and reuse strategies from continuous task streams instead of relying only on static conversational logs.

https://arxiv.org/pdf/2511.20857

Conversational Recall vs Experience Reuse

Most current agents implement conversational recall. They store dialogue history, tool traces, and retrieved documents, which are then reintegrated into the context window for future queries. This type of memory serves as a passive buffer, capable of recovering facts or recalling previous steps, but it does not actively modify the agent’s approach for related tasks.

Evo-Memory instead focuses on experience reuse. Here each interaction is treated as an experience that encodes not only inputs and outputs, but also whether a task succeeded and which strategies were effective. The benchmark checks if agents can retrieve those experiences in later tasks, apply them as reusable procedures, and refine the memory over time.

Benchmark Design and Task Streams

The research team formalizes a memory augmented agent as a tuple ((F, U, R, C)). The base model (F) generates outputs. The retrieval module (R) searches a memory store. The context constructor (C) synthesizes a working prompt from the current input and retrieved items. The update function (U) writes new experience entries and evolves the memory after every step.

Evo-Memory restructures conventional benchmarks into sequential task streams. Each dataset becomes an ordered sequence of tasks where early items carry strategies that are useful for later ones. The suite covers AIME 24, AIME 25, GPQA Diamond, MMLU-Pro economics, engineering, philosophy, and ToolBench for tool use, along with multi turn environments from AgentBoard including AlfWorld, BabyAI, ScienceWorld, Jericho, and PDDL planning.

Evaluation is done along four axes. Single turn tasks use exact match or answer accuracy. Embodied environments report success rate and progress rate. Step efficiency measures average steps per successful task. Sequence robustness tests whether performance is stable when task order changes.

https://arxiv.org/pdf/2511.20857

ExpRAG, a Minimal Experience Reuse Baseline

To set a lower bound, the research team define ExpRAG. Each interaction becomes a structured experience text with template ⟨xi​,yi​^​,fi​⟩where xi​ is input, yi​^​ is model output and fi​ is feedback, for example a correctness signal. At a new step (t), the agent retrieves similar experiences from memory using a similarity score and concatenates them with the current input as in-context examples. Then it appends the new experience into memory.

ExpRAG does not change the agent control loop. It is still a single shot call to the backbone, but now augmented with explicitly stored prior tasks. The design is intentionally simple so that any gains on Evo-Memory can be attributed to task level experience retrieval, not to new planning or tool abstractions.

ReMem, Action Think Memory Refine

The main contribution on the agent side is ReMem, an action–think–memory refine pipeline built on top of the same backbone models. At each internal step, given the current input, memory state and past reasoning traces, the agent chooses one of three operations:

Think generates intermediate reasoning traces that decompose the task.

Act emits an environment action or final answer visible to the user.

Refine performs meta reasoning on memory by retrieving, pruning and reorganizing experience entries.

This loop induces a Markov decision process where the state includes the query, current memory and ongoing thoughts. Within a step the agent can interleave several Think and Refine operations, and the step terminates when an Act operation is issued. In contrast to standard ReAct style agents, memory is no longer a fixed buffer. It becomes an explicit object that the agent reasons about and edits during inference.

https://arxiv.org/pdf/2511.20857

Results on Reasoning, Tools and Embodied Environments

The research team instantiate all methods on Gemini 2.5 Flash and Claude 3.7 Sonnet under a unified search–predict–evolve protocol. This isolates the effect of memory architecture, since prompting, search and feedback are held constant across baselines.

On single turn benchmarks, evolving memory methods produce consistent but moderate gains. For Gemini 2.5 Flash, ReMem reaches average exact match 0.65 across AIME 24, AIME 25, GPQA Diamond and MMLU Pro subsets, and 0.85 and 0.71 API and accuracy on ToolBench. ExpRAG also performs strongly, with average 0.60, and outperforms several more complex designs such as Agent Workflow Memory and Dynamic Cheatsheet variants.

The impact is larger in multi turn environments. On Claude 3.7 Sonnet, ReMem reaches success and progress 0.92 and 0.96 on AlfWorld, 0.73 and 0.83 on BabyAI, 0.83 and 0.95 on PDDL and 0.62 and 0.89 on ScienceWorld, giving average 0.78 success and 0.91 progress across datasets. On Gemini 2.5 Flash, ReMem achieves average 0.50 success and 0.64 progress, improving over history and ReAct style baselines in all four environments.

Step efficiency is also improved. In AlfWorld, average steps to complete a task drop from 22.6 for a history baseline to 11.5 for ReMem. Lightweight designs such as ExpRecent and ExpRAG reduce steps as well, which indicates that even simple task level experience reuse can make behaviour more efficient without architectural changes to the backbone.

A further analysis links gains to task similarity inside each dataset. Using embeddings from the retriever encoder, the research team compute average distance from tasks to their cluster center. ReMem’s margin over a history baseline correlates strongly with this similarity measure, with reported Pearson correlation about 0.72 on Gemini 2.5 Flash and 0.56 on Claude 3.7 Sonnet. Structured domains such as PDDL and AlfWorld show larger improvements than diverse sets like AIME 25 or GPQA Diamond.

Key Takeaways

Evo-Memory is a comprehensive streaming benchmark that converts standard datasets into ordered task, so agents can retrieve, integrate and update memory over time rather than rely on static conversational recall.

The framework formalizes memory augmented agents as a tuple ((F, U, R, C)) and implements more than 10 representative memory modules, including retrieval based, workflow and hierarchical memories, evaluated on 10 single turn and multi turn datasets across reasoning, question answering, tool use and embodied environments.

ExpRAG provides a minimal experience reuse baseline that stores each task interaction as a structured text record with input, model output and feedback, then retrieves similar experiences as in context exemplars for new tasks, already giving consistent improvements over pure history based baselines.

ReMem extends the standard ReAct style loop with an explicit Think, Act, Refine Memory control cycle, which lets the agent actively retrieve, prune and reorganize its memory during inference, leading to higher accuracy, higher success rate and fewer steps on both single turn reasoning and long horizon interactive environments.

Across Gemini 2.5 Flash and Claude 3.7 Sonnet backbones, self evolving memories such as ExpRAG and especially ReMem make smaller models behave like stronger agents at test time, improving exact match, success and progress metrics without any retraining of base model weights.

Editorial Notes

Evo Memory is a useful step for evaluating self evolving memory in LLM agents. It forces models to operate on sequential task streams instead of isolated prompts. It compares more than 10 memory architectures under a single framework. Simple methods like ExpRAG already show clear gains. ReMem’s action, think, refine memory loop improves exact match, success and progress without retraining base weights. Overall, this research work makes test time evolution a concrete design target for LLM agent systems

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem Framework for Experience Reuse in LLM Agents appeared first on MarkTechPost.

DeepSeek Researchers Introduce DeepSeek-V3.2 and DeepSeek-V3.2-Special …

How do you get GPT-5-level reasoning on real long-context, tool-using workloads without paying the quadratic attention and GPU cost that usually makes those systems impractical? DeepSeek research introduces DeepSeek-V3.2 and DeepSeek-V3.2-Speciale. They are reasoning-first models built for agents and targets high quality reasoning, long context and agent workflows, with open weights and production APIs. The models combine DeepSeek Sparse Attention (DSA), a scaled GRPO reinforcement learning stack and an agent native tool protocol, and report performance comparable to GPT 5, with DeepSeek-V3.2-Speciale reaching Gemini 3.0 Pro level reasoning on public benchmarks and competitions.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Sparse Attention with Near Linear Long Context Cost

Both DeepSeek-V3.2 and DeepSeek-V3.2-Speciale use the DeepSeek-V3 Mixture of Experts transformer with about 671B total parameters and 37B active parameters per token, inherited from V3.1 Terminus. The only structural change is DeepSeek Sparse Attention, introduced through continued pre-training.

DeepSeek Sparse Attention splits attention into 2 components. A lightning indexer runs a small number of low precision heads over all token pairs and produces relevance scores. A fine grained selector keeps the top-k-key value positions per query, and the main attention path runs Multi-Query-Attention and Multi-Head-Latent-Attention on this sparse set.

This changes the dominant complexity from O(L²) to O(kL), where L is sequence length and k is the number of selected tokens and much smaller than L. Based on the benchmarks, DeepSeek-V3.2 matches the dense Terminus baseline on accuracy while reducing long context inference cost by about 50 percent, with faster throughput and lower memory use on H800 class hardware and on vLLM and SGLang backends.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Continued Pre Training for DeepSeek Sparse Attention

DeepSeek Sparse Attention (DSA) is introduced by continued pre-training on top of DeepSeek-V3.2 Terminus. In the dense warm up stage, dense attention remains active, all backbone parameters are frozen and only the lightning indexer is trained with a Kullback Leibler loss to match the dense attention distribution on 128K context sequences. This stage uses a small number of steps and about 2B tokens, enough for the indexer to learn useful scores.

In the sparse stage, the selector keeps 2048 key-value entries per query, the backbone is unfrozen and the model continues training on about 944B tokens. Gradients for the indexer still come only from the alignment loss with dense attention on the selected positions. This schedule makes DeepSeek Sparse Attention (DSA) behave as a drop in replacement for dense attention with similar quality and lower long context cost.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

GRPO with more than 10 Percent RL Compute

On top of the sparse architecture, DeepSeek-V3.2 uses Group Relative Policy Optimization (GRPO) as the main reinforcement learning method. The research team state that post training reinforcement learning RL compute exceeds 10 percent of pre training compute.

RL is organized around specialist domains. The research team trains dedicated runs for mathematics, competitive programming, general logical reasoning, browsing and agent tasks and safety, then distills these specialists into the shared 685B parameter base for DeepSeek-V3.2 and DeepSeek-V3.2-Speciale. GRPO is implemented with an unbiased KL estimator, off policy sequence masking and mechanisms that keep Mixture of Experts (MoE) routing and sampling masks consistent between training and sampling.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Agent Data, Thinking Mode and Tool Protocol

DeepSeek research team builds a large synthetic agent dataset by generating more than 1,800 environments and more than 85,000 tasks across code agents, search agents, general tools and code interpreter setups. Tasks are constructed to be hard to solve and easy to verify, and are used as RL targets together with real coding and search traces.

At inference time, DeepSeek-V3.2 introduces explicit thinking and non thinking modes. The deepseek-reasoner endpoint exposes thinking mode by default, where the model produces an internal chain of thought before the final answer. The thinking with tools guide describes how reasoning content is kept across tool calls and cleared when a new user message arrives, and how tool calls and tool results stay in the context even when reasoning text is trimmed for budget.

The chat template is updated around this behavior. The DeepSeek-V3.2 Speciale repository ships Python encoder and decoder helpers instead of a Jinja template. Messages can carry a reasoning_content field alongside content, controlled by a thinking parameter. A developer role is reserved for search agents and is not accepted in general chat flows by the official API, which protects this channel from accidental misuse.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf

Benchmarks, Competitions And Open Artifacts

On standard reasoning and coding benchmarks, DeepSeek-V3.2 and especially DeepSeek-V3.2 Speciale are reported as comparable to GPT-5 and close to Gemini-3.0 Pro on suites such as AIME 2025, HMMT 2025, GPQA and LiveCodeBench, with improved cost efficiency on long context workloads.

For formal competitions, DeepSeek research team states that DeepSeek-V3.2 Speciale achieves gold medal level performance on the International Mathematical Olympiad 2025, the Chinese Mathematical Olympiad 2025 and the International Olympiad in Informatics 2025, and competitive gold medal level performance at the ICPC World Finals 2025.

Key Takeaways

DeepSeek-V3.2 adds DeepSeek Sparse Attention, which brings near linear O(kL) attention cost and delivers around 50% lower long context API cost compared to previous dense DeepSeek models, while keeping quality similar to DeepSeek-V3.1 Terminus.

The model family keeps the 671B parameter MoE backbone with 37B active parameters per token and exposes a full 128K context window in production APIs, which makes long documents, multi step chains and large tool traces practical rather than a lab only feature.

Post training uses Group Relative Policy Optimization (GRPO) with a compute budget that is more than 10 percent of pre-training, focused on math, code, general reasoning, browsing or agent workloads and safety, along with contest style specialists whose cases are released for external verification.

DeepSeek-V3.2 is the first model in the DeepSeek family to integrate thinking directly into tool use, supporting both thinking and non thinking tool modes and a protocol where internal reasoning persists across tool calls and is reset only on new user messages.

Check out the Paper and Model weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepSeek Researchers Introduce DeepSeek-V3.2 and DeepSeek-V3.2-Speciale for Long Context Reasoning and Agentic Workloads appeared first on MarkTechPost.

MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic …

The AI coding landscape just got a massive shake-up. If you’ve been relying on Claude 3.5 Sonnet or GPT-4o for your dev workflows, you know the pain: great performance often comes with a bill that makes your wallet weep, or latency that breaks your flow.This article provides a technical overview of MiniMax-M2, focusing on its core design choices and capabilities, and how it changes the price to performance baseline for agentic coding workflows.

Branded as ‘Mini Price, Max Performance,’ MiniMax-M2 targets agentic coding workloads with around 2x the speed of leading competitors at roughly 8% of their price. The key change is not only cost efficiency, but a different computational and reasoning pattern in how the model structures and executes its “thinking” during complex tool and code workflows.

The Secret Sauce: Interleaved Thinking

The standout feature of MiniMax-M2 is its native mastery of Interleaved Thinking. 

But what does that actually mean?

Most LLMs operate in a linear “Chain of Thought” (CoT) where they do all their planning upfront and then fire off a series of tool calls (like running code or searching the web). The problem? If the first tool call returns unexpected data, the initial plan becomes stale, leading to “state drift” where the model keeps hallucinating a path that no longer exists.

Interleaved Thinking changes the game by creating a dynamic Plan -> Act-> Reflect loop.

Instead of front-loading all the logic, MiniMax-M2 alternates between explicit reasoning and tool use. It reasons, executes a tool, reads the output, and then reasons again based on that fresh evidence. This allows the model to:

Self-Correct: If a shell command fails, it reads the error and adjusts its next move immediately.

Preserve State: It carries forward hypotheses and constraints between steps, preventing the “memory loss” common in long coding tasks.

Handle Long Horizons: This approach is critical for complex agentic workflows (like building an entire app feature) where the path isn’t clear from step one.

Benchmarks show the impact is real: enabling Interleaved Thinking boosted MiniMax-M2’s score on SWE-Bench Verified by over 3% and on BrowseComp by a massive 40%.

Powered by Mixture of Experts MoE: Speed Meets Smarts

How does MiniMax-M2 achieve low latency while being smart enough to replace a senior dev? The answer lies in its Mixture of Experts (MoE) architecture.

MiniMax-M2 is a massive model with 230 billion total parameters, but it utilizes a “sparse” activation technique. For any given token generation, it only activates 10 billion parameters.

This design delivers the best of both worlds:

Huge Knowledge Base: You get the deep world knowledge and reasoning capacity of a 200B+ model.

Blazing Speed: Inference runs with the lightness of a 10B model, enabling high throughput and low latency.

For interactive agents like Claude Code, Cursor, or Cline, this speed is non-negotiable. You need the model to think, code, and debug in real-time without the “thinking…” spinner of death.

Agent & Code Native

MiniMax-M2 wasn’t just trained on text; it was developed for end-to-end developer workflows. It excels at handling robust toolchains including MCP (Model Context Protocol), shell execution, browser retrieval, and complex codebases.

It is already being integrated into the heavy hitters of the AI coding world:

Claude Code

Cursor

Cline

Kilo Code

Droid

The Economics: 90% Cheaper than the Competition

The pricing structure is perhaps the most aggressive we’ve seen for a model of this caliber. MiniMax is practically giving away “intelligence” compared to the current market leaders.

API Pricing (vs Claude 3.5 Sonnet):

Input Tokens: $0.3 / Million (10% of Sonnet’s cost)

Cache Hits: $0.03 / Million (10% of Sonnet’s cost)

Output Tokens: $1.2 / Million (8% of Sonnet’s cost)

For individual developers, they offer tiered Coding Plans that undercut the market significantly:

Starter: $10/month (Includes a $2 first-month promo).

Pro: $20/month.

Max: $50/month (Up to 5x the usage limit of Claude Code Max).

As if that was not enough…MiniMax recently launched a Global Developer Ambassador Program, a global initiative designed to empower independent ML and LLM developers. The program invites builders to collaborate directly with the MiniMax R&D team to shape the future.

The company is seeking developers with proven open-source experience who are already familiar with MiniMax models and active on platforms like GitHub and Hugging Face.

Key Program Highlights:

The Incentives: Ambassadors receive complimentary access to the MiniMax-M2 Max Coding Plan, early access to unreleased video and audio models, direct feedback channels with product leads, and potential full-time career opportunities.

The Role: Participants are expected to build public demos, create open-source tools, and provide critical feedback on APIs before public launches.

You can sign up here.

Editorial Notes

MiniMax-M2 challenges the idea that “smarter” must mean “slower” or “more expensive.” By leveraging MOE efficiency and Interleaved Thinking, it offers a compelling alternative for developers who want to run autonomous agents without bankrupting their API budget.

As we move toward a world where AI agents don’t just write code but architect entire systems, the ability to “think, act, and reflect” continuously, at a price that allows for thousands of iterations, might just make M2 the new standard for AI engineering.

Thanks to the MINIMAX AI team for the thought leadership/ Resources for this article. MINIMAX AI team has supported this content/article.
The post MiniMax-M2: Technical Deep Dive into Interleaved Thinking for Agentic Coding Workflows appeared first on MarkTechPost.

How to Design an Advanced Multi-Page Interactive Analytics Dashboard w …

In this tutorial, we build an advanced multi-page interactive dashboard using Panel. Through each component of implementation, we explore how to generate synthetic data, apply rich filters, visualize dynamic time-series trends, compare segments and regions, and even simulate live KPI updates. We design the system step by step so we can truly understand how each widget, callback, and plotting function comes together to create a smooth, reactive analytics experience. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport sys, subprocess

def install_deps():
pkgs = [“panel”, “hvplot”, “pandas”, “numpy”, “bokeh”]
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”] + pkgs)

try:
import panel as pn
import hvplot.pandas
import pandas as pd
import numpy as np
except ImportError:
install_deps()
import panel as pn
import hvplot.pandas
import pandas as pd
import numpy as np

pn.extension()

rng = np.random.default_rng(42)
dates = pd.date_range(“2024-01-01″, periods=365, freq=”D”)
segments = [“A”, “B”, “C”]
regions = [“North”, “South”, “East”, “West”]

base = pd.DataFrame(
{
“date”: np.tile(dates, len(segments) * len(regions)),
“segment”: np.repeat(segments, len(dates) * len(regions)),
“region”: np.repeat(np.tile(regions, len(segments)), len(dates)),
}
)
base[“traffic”] = (
100
+ 40 * np.sin(2 * np.pi * base[“date”].dt.dayofyear / 365)
+ rng.normal(0, 15, len(base))
)
trend = {“A”: 1.0, “B”: 1.5, “C”: 2.0}
base[“traffic”] *= base[“segment”].map(trend)
base[“conversions”] = (base[“traffic”] * rng.uniform(0.01, 0.05, len(base))).astype(int)
base[“revenue”] = base[“conversions”] * rng.uniform(20, 60, len(base))
df = base.reset_index(drop=True)

We install all required dependencies and load Panel, hvPlot, Pandas, and NumPy so the dashboard runs smoothly in Colab. We generate a full year of synthetic time-series data across segments and regions, providing a rich dataset for exploration. By the end of this block, we will have a clean, ready-to-use dataframe for all upcoming visualizations. Check out the Full Codes here.

Copy CodeCopiedUse a different Browsersegment_sel = pn.widgets.CheckBoxGroup(name=”Segment”, value=segments[:2], options=segments, inline=True)
region_sel = pn.widgets.MultiChoice(name=”Region”, value=[“North”], options=regions)
metric_sel = pn.widgets.Select(name=”Metric”, value=”traffic”, options=[“traffic”, “conversions”, “revenue”])
date_range = pn.widgets.DateRangeSlider(
name=”Date Range”,
start=df[“date”].min(),
end=df[“date”].max(),
value=(df[“date”].min(), df[“date”].max()),
)
smooth_slider = pn.widgets.IntSlider(name=”Rolling Window (days)”, start=1, end=30, value=7)

def filtered_df(segment, region, drange):
d1, d2 = drange
mask = (
df[“segment”].isin(segment)
& df[“region”].isin(region or regions)
& (df[“date”] >= d1)
& (df[“date”] <= d2)
)
sub = df[mask].copy()
if sub.empty:
return df.iloc[:0]
return sub

@pn.depends(segment_sel, region_sel, metric_sel, smooth_slider, date_range)
def timeseries_plot(segment, region, metric, window, drange):
data = filtered_df(segment, region, drange)
if data.empty:
return pn.pane.Markdown(“### No data for current filters”)
grouped = data.sort_values(“date”).groupby(“date”)[metric].sum()
line = grouped.hvplot.line(title=f”{metric.title()} over time”, ylabel=metric.title())
if window > 1:
smooth = grouped.rolling(window).mean().hvplot.line(line_width=3, alpha=0.6)
return (line * smooth).opts(legend_position=”top_left”)
return line

We build the interactive widgets and the filtering logic that controls the entire dashboard. We wire the time-series plot to the widgets using reactive @pn.depends, letting us change segments, regions, metrics, date ranges, and smoothing windows instantly. With this setup, we can switch perspectives fluidly and see the effects in real time. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser@pn.depends(segment_sel, region_sel, metric_sel, date_range)
def segment_bar(segment, region, metric, drange):
data = filtered_df(segment, region, drange)
if data.empty:
return pn.pane.Markdown(“### No data to aggregate”)
agg = data.groupby(“segment”)[metric].sum().sort_values(ascending=False)
return agg.hvplot.bar(title=f”{metric.title()} by Segment”, yaxis=None)

@pn.depends(segment_sel, region_sel, metric_sel, date_range)
def region_heatmap(segment, region, metric, drange):
data = filtered_df(segment, region, drange)
if data.empty:
return pn.pane.Markdown(“### No data to aggregate”)
pivot = data.pivot_table(index=”segment”, columns=”region”, values=metric, aggfunc=”sum”)
return pivot.hvplot.heatmap(title=f”{metric.title()} Heatmap”, clabel=metric.title())

We construct additional visual layers: a segment-level bar chart and a region-segment heatmap. We let these charts react to the same global filters, so they update automatically whenever we make a selection. This gives us a deeper breakdown of patterns across categories without writing redundant code. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserkpi_source = df.copy()
kpi_idx = [0]

def compute_kpi(slice_df):
if slice_df.empty:
return 0, 0, 0
total_rev = slice_df[“revenue”].sum()
avg_conv = slice_df[“conversions”].mean()
cr = (slice_df[“conversions”].sum() / slice_df[“traffic”].sum()) * 100
return total_rev, avg_conv, cr

kpi_value = pn.indicators.Number(name=”Total Revenue (window)”, value=0, format=”$0,0″)
conv_value = pn.indicators.Number(name=”Avg Conversions”, value=0, format=”0.0″)
cr_value = pn.indicators.Number(name=”Conversion Rate”, value=0, format=”0.00%”)

def update_kpis():
step = 200
start = kpi_idx[0]
end = start + step
if start >= len(kpi_source):
kpi_idx[0] = 0
start, end = 0, step
window_df = kpi_source.iloc[start:end]
kpi_idx[0] = end
total_rev, avg_conv, cr = compute_kpi(window_df)
kpi_value.value = total_rev
conv_value.value = avg_conv
cr_value.value = cr / 100

pn.state.add_periodic_callback(update_kpis, period=1000, start=True)

We simulate a rolling stream of KPIs that update every second, creating a live-dashboard experience. We compute total revenue, average conversions, and conversion rate inside a sliding window and push the values to Panel’s numeric indicators. This lets us observe how metrics evolve continuously, just like a real monitoring system. Check out the Full Codes here.

Copy CodeCopiedUse a different Browsercontrols = pn.WidgetBox(
“### Global Controls”,
segment_sel,
region_sel,
metric_sel,
date_range,
smooth_slider,
sizing_mode=”stretch_width”,
)

page_overview = pn.Column(
pn.pane.Markdown(“## Overview: Filtered Time Series”),
controls,
timeseries_plot,
)

page_insights = pn.Column(
pn.pane.Markdown(“## Segment & Region Insights”),
pn.Row(segment_bar, region_heatmap),
)

page_live = pn.Column(
pn.pane.Markdown(“## Live KPI Window (simulated streaming)”),
pn.Row(kpi_value, conv_value, cr_value),
)

dashboard = pn.Tabs(
(“Overview”, page_overview),
(“Insights”, page_insights),
(“Live KPIs”, page_live),
)

dashboard

We assemble all components into a clean multi-page layout using Tabs. We organize the dashboard into an overview page, an insights page, and a live-KPI page, making navigation simple and intuitive. With this structure, we get a polished, interactive analytics application ready to run directly in Google Colab.

In conclusion, we see how seamlessly we can combine Panel widgets, hvPlot visualizations, and periodic callbacks to build a powerful analytics dashboard. We appreciate how every module, from filtering logic to bar charts to the live KPI stream, fits together to produce a cohesive multi-page interface that runs effortlessly. We finish with a complete, interactive system that we can extend into real-world reporting, experimentation, or production-grade dashboards.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design an Advanced Multi-Page Interactive Analytics Dashboard with Dynamic Filtering, Live KPIs, and Rich Visual Exploration Using Panel appeared first on MarkTechPost.

Meta AI Researchers Introduce Matrix: A Ray Native a Decentralized Fra …

How do you keep synthetic data fresh and diverse for modern AI models without turning a single orchestration pipeline into the bottleneck? Meta AI researchers introduce Matrix, a decentralized framework where both control and data flow are serialized into messages that move through distributed queues. As LLM training increasingly relies on synthetic conversations, tool traces and reasoning chains, most existing systems still depend on a central controller or domain specific setups, which wastes GPU capacity, adds coordination overhead and limits data diversity. Matrix instead uses peer to peer agent scheduling on a Ray cluster and delivers 2 to 15 times higher token throughput on real workloads while maintaining comparable quality.

https://arxiv.org/pdf/2511.21686

From Centralized Controllers to Peer to Peer Agents

Traditional agent frameworks keep workflow state and control logic inside a central orchestrator. Every agent call, tool call and retry goes through that controller. This model is easy to reason about, but it does not scale well when you need tens of thousands of concurrent synthetic dialogues or tool trajectories.

Matrix takes a different approach. It serializes both control flow and data flow into a message object called an orchestrator. The orchestrator holds the task state, including conversation history, intermediate results and routing logic. Stateless agents, implemented as Ray actors, pull an orchestrator from a distributed queue, apply their role specific logic, update the state and then send it directly to the next agent selected by the orchestrator. There is no central scheduler in the inner loop. Each task advances independently at row level, rather than waiting for batch level barriers as in Spark or Ray Data.

This design reduces idle time when different trajectories have very different lengths. It also makes fault handling local to a task. If one orchestrator fails it does not stall a batch.

https://arxiv.org/pdf/2511.21686

System Stack and Services

Matrix runs on a Ray cluster that is usually launched on SLURM. Ray provides distributed actors and queues. Ray Serve exposes LLM endpoints behind vLLM and SGLang, and can also route to external APIs such as Azure OpenAI or Gemini through proxy servers.

Tool calls and other complex services run inside Apptainer containers. This isolates the agent runtime from code execution sandboxes, HTTP tools or custom evaluators. Hydra manages configuration for agent roles, orchestrator types, resource allocations and I or O schemas. Grafana integrates with Ray metrics to track queue length, pending tasks, token throughput and GPU utilization in real time.

Matrix also introduces message offloading. When conversation history grows beyond a size threshold, large payloads are stored in Ray’s object store and only object identifiers are kept in the orchestrator. This reduces cluster bandwidth while still allowing agents to reconstruct prompts when needed.

Case Study 1: Collaborative Reasoner

Collaborative Reasoner, also known as Coral, evaluates multi agent dialogue where two LLM agents discuss a question, disagree when needed and reach a final answer. In the original implementation a central controller manages thousands of self collaboration trajectories. Matrix reimplements the same protocol using peer to peer orchestrators and stateless agents.

On 31 A100 nodes, using LLaMA 3.1 8B Instruct, Matrix configures concurrency as 248 GPUs with 50 queries per GPU, so 12,400 concurrent conversations. The Coral baseline runs at its optimal concurrency of 5,000. Under identical hardware, Matrix generates about 2 billion tokens in roughly 4 hours, while Coral produces about 0.62 billion tokens in about 9 hours. That is a 6.8 times increase in token throughput with almost identical agreement correctness around 0.47.

https://arxiv.org/pdf/2511.21686

Case Study 2: NaturalReasoning Web Data Curation

NaturalReasoning constructs a reasoning dataset from large web corpora. Matrix models the pipeline with three agents. A Filter agent uses a smaller classifier model to select English passages that likely contain reasoning. A Score agent uses a larger instruction tuned model to assign quality scores. A Question agent extracts questions, answers and reasoning chains.

On 25 million DCLM web documents, only about 5.45 percent survive all filters, yielding around 1.19 million question answer pairs with associated reasoning steps. Matrix then compares different parallelism strategies on a 500 thousand document subset. The best configuration combines data parallelism and task parallelism, with 20 data partitions and 700 concurrent tasks per partition. This achieves about 1.61 times higher throughput than a setting that only scales task concurrency.

Over the full 25 million document run, Matrix reaches 5,853 tokens per second, compared to 2,778 tokens per second for a Ray Data batch baseline with 14,000 concurrent tasks. That corresponds to a 2.1 times throughput gain that comes purely from peer to peer row level scheduling, not from different models.

https://arxiv.org/pdf/2511.21686

Case Study 3, Tau2-Bench Tool Use Trajectories

Tau2-Bench evaluates conversational agents that must use tools and a database in a customer support setting. Matrix represents this environment with four agents, a user simulator, an assistant, a tool executor and a reward calculator, plus a sink that collects metrics. Tool APIs and reward logic are reused from the Tau2 reference implementation and are wrapped in containers.

On a cluster with 13 H100 nodes and dozens of LLM replicas, Matrix generates 22,800 trajectories in about 1.25 hours. That corresponds to roughly 41,000 tokens per second. The baseline Tau2-agent implementation on a single node, configured with 500 concurrent threads, reaches about 2,654 tokens per second and 1,519 trajectories. Average reward stays almost unchanged across both systems, which confirms that the speedup does not come from cutting corners in the environment. Overall, Matrix delivers about 15.4 times higher token throughput on this benchmark.

https://arxiv.org/pdf/2511.21686

Key Takeaways

Matrix replaces centralized orchestrators with a peer to peer, message driven agent architecture that treats each task as an independent state machine moving through stateless agents.

The framework is built entirely on an open source stack, SLURM, Ray, vLLM, SGLang and Apptainer, and scales to tens of thousands of concurrent multi agent workflows for synthetic data generation, benchmarking and data processing.

Across three case studies, Collaborative Reasoner, NaturalReasoning and Tau2-Bench, Matrix delivers about 2 to 15.4 times higher token throughput than specialized baselines under identical hardware, while maintaining comparable output quality and rewards.

Matrix offloads large conversation histories to Ray’s object store and keeps only lightweight references in messages, which reduces peak network bandwidth and supports high throughput LLM serving with gRPC based model backends.

Editorial Notes

Matrix is a pragmatic systems contribution that takes multi agent synthetic data generation from bespoke scripts to an operational runtime. By encoding control flow and data flow into orchestrators, then pushing execution into stateless P2P agents on Ray, it cleanly separates scheduling, LLM inference and tools. The case studies on Collaborative Reasoner, NaturalReasoning and Tau2-Bench show that careful systems design, not new model architectures, is now the main lever for scaling synthetic data pipelines.

Check out the Paper and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Researchers Introduce Matrix: A Ray Native a Decentralized Framework for Multi Agent Synthetic Data Generation appeared first on MarkTechPost.

StepFun AI Releases Step-Audio-R1: A New Audio LLM that Finally Benefi …

Why do current audio AI models often perform worse when they generate longer reasoning instead of grounding their decisions in the actual sound. StepFun research team releases Step-Audio-R1, a new audio LLM designed for test time compute scaling, address this failure mode by showing that the accuracy drop with chain of thought is not an audio limitation but a training and modality grounding problem?

https://arxiv.org/pdf/2511.15848

The Core Problem, Audio Models Reason over Text Surrogates

Most current audio models inherit their reasoning behavior from text training. They learn to reason as if they read transcripts, not as if they listen. The StepFun team calls this Textual Surrogate Reasoning. The model uses imagined words and descriptions instead of acoustic cues such as pitch contour, rhythm, timbre or background noise patterns.

This mismatch explains why longer chain of thought often hurts performance in audio. The model spends more tokens elaborating wrong or modality irrelevant assumptions. Step-Audio-R1 attacks this by forcing the model to justify answers using acoustic evidence. The training pipeline is organized around Modality Grounded Reasoning Distillation, MGRD, which selects and distills reasoning traces that explicitly reference audio features.

Architecture

The architecture stays close to the previous Step Audio systems:

A Qwen2 based audio encoder processes raw waveforms at 25 Hz.

An audio adaptor downsamples the encoder output by a factor of 2, to 12.5 Hz, and aligns frames to the language token stream.

A Qwen2.5 32B decoder consumes the audio features and generates text.

The decoder always produces an explicit reasoning block inside <think> and </think> tags, followed by the final answer. This separation lets training objectives shape the structure and content of reasoning without losing focus on task accuracy. The model is released as a 33B parameter audio text to text model on Hugging Face under Apache 2.0.

https://arxiv.org/pdf/2511.15848

Training Pipeline, from Cold Start to Audio Grounded RL

The pipeline has a supervised cold start stage and a reinforcement learning stage that both mix text and audio tasks.

Cold start uses about 5 million examples, covering 1 billion tokens of text only data and 4 billion tokens from audio paired data. Audio tasks include automatic speech recognition, paralinguistic understanding and audio question text answer style dialogs. A fraction of the audio data carries audio chain of thought traces generated by an earlier model. Text data covers multi turn dialog, knowledge question answering, math and code reasoning. All samples share a format where reasoning is wrapped in <think> tags, even when the reasoning block is initially empty.

Supervised learning trains Step-Audio-R1 to follow this format and to generate useful reasoning for both audio and text. This gives a baseline chain of thought behavior, but it is still biased toward text based reasoning.

Modality Grounded Reasoning Distillation MGRD

MGRD is applied in several iterations. For each round, the research team samples audio questions where the label depends on real acoustic properties. For example, questions about speaker emotion, background events in sound scenes or musical structure. The current model produces multiple reasoning and answer candidates per question. A filter keeps only chains that meet three constraints:

They reference acoustic cues, not just textual descriptions or imagined transcripts.

They are logically coherent as short step by step explanations.

Their final answers are correct according to labels or programmatic checks.

These accepted traces form a distilled audio chain of thought dataset. The model is fine tuned on this dataset together with the original text reasoning data. This is followed by Reinforcement Learning with Verified Rewards, RLVR. For text questions, rewards are based on answer correctness. For audio questions, the reward mixes answer correctness and reasoning format, with a typical weighting of 0.8 for accuracy and 0.2 for reasoning. Training uses PPO with about 16 responses sampled per prompt and supports sequences up to around 10 240 tokens to allow long deliberation.

https://arxiv.org/pdf/2511.15848

Benchmarks, closing the gap to Gemini 3 Pro

On a combined speech to text benchmark suite that includes Big Bench Audio, Spoken MQA, MMSU, MMAU and Wild Speech, Step-Audio-R1 reaches an average score of about 83.6 percent. Gemini 2.5 Pro reports about 81.5 percent and Gemini 3 Pro reaches about 85.1 percent. On Big Bench Audio alone, Step-Audio-R1 reaches about 98.7 percent, which is higher than both Gemini versions.

For speech to speech reasoning, the Step-Audio-R1 Realtime variant adopts listen while thinking and think while speaking style streaming. On Big Bench Audio speech to speech, it reaches about 96.1 percent reasoning accuracy with first packet latency around 0.92 seconds. This score surpasses GPT based realtime baselines and Gemini 2.5 Flash style native audio dialogs while keeping sub second interaction.

https://arxiv.org/pdf/2511.15848

Ablations, what matters for audio reasoning

The ablation section provides several design signals for engineers:

A reasoning format reward is necessary. Without it, reinforcement learning tends to shorten or remove chain of thought, which lowers audio benchmark scores.

RL data should target medium difficulty problems. Selecting questions where pass at 8 lies in a middle band gives more stable rewards and maintains long reasoning.

Scaling RL audio data without such selection does not help. Quality of prompts and labels matters more than raw size.

The researchers also describe a self cognition correction pipeline that reduces the frequency of answers such as ‘I can only read text and cannot hear audio’ in a model that is trained to process sound. This uses Direct Preference Optimization on curated preference pairs where correct behavior is to acknowledge and use audio input.

Key Takeaways

Step-Audio-R1 is one of the first audio language model that turns longer chain of thought into a consistent accuracy gain for audio tasks, solving the inverted scaling failure seen in previous audio LLMs.

The model explicitly targets Textual Surrogate Reasoning by using Modality Grounded Reasoning Distillation, which filters and distills only those reasoning traces that rely on acoustic cues such as pitch, timbre and rhythm instead of imagined transcripts.

Architecturally, Step-Audio-R1 combines a Qwen2 based audio encoder with an adaptor and a Qwen2.5 32B decoder that always generates <think> reasoning segments before answers, and is released as a 33B audio text to text model under Apache 2.0.

Across comprehensive audio understanding and reasoning benchmarks covering speech, environmental sounds and music, Step-Audio-R1 surpasses Gemini 2.5 Pro and reaches performance comparable to Gemini 3 Pro, while also supporting a realtime variant for low latency speech to speech interaction.

The training recipe combines large scale supervised chain of thought, modality grounded distillation and Reinforcement Learning with Verified Rewards, providing a concrete and reproducible blueprint for building future audio reasoning models that actually benefit from test time compute scaling.

Editorial Notes

Step-Audio-R1 is an important release because it converts chain of thought from a liability into a useful tool for audio reasoning by directly addressing Textual Surrogate Reasoning with Modality Grounded Reasoning Distillation and Reinforcement Learning with Verified Rewards. It shows that test time compute scaling can benefit audio models when reasoning is anchored in acoustic features and delivers benchmark results comparable to Gemini 3 Pro while remaining open and practically usable for engineers. Overall this research work turns extended deliberation in audio LLMs from a consistent failure mode into a controllable and reproducible design pattern.

Check out the Paper, Repo, Project Page and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post StepFun AI Releases Step-Audio-R1: A New Audio LLM that Finally Benefits from Test Time Compute Scaling appeared first on MarkTechPost.

NVIDIA AI Releases Orchestrator-8B: A Reinforcement Learning Trained C …

How can an AI system learn to pick the right model or tool for each step of a task instead of always relying on one large model for everything? NVIDIA researchers release ToolOrchestra, a novel method for training a small language model to act as the orchestrator- the ‘brain’ of a heterogeneous tool-use agent

https://arxiv.org/pdf/2511.21689

From Single Model Agents to an Orchestration Policy

Most current agents follow a simple pattern. A single large model such as GPT-5 receives a prompt that describes available tools, then decides when to call web search or a code interpreter. All high level reasoning still stays inside the same model. ToolOrchestra changes this setup. It trains a dedicated controller model called as ‘Orchestrator-8B‘, that treats both classic tools and other LLMs as callable components.

A pilot study in the same research shows why naive prompting is not enough. When Qwen3-8B is prompted to route between GPT-5, GPT-5 mini, Qwen3-32B and Qwen2.5-Coder-32B, it delegates 73 percent of cases to GPT-5. When GPT-5 acts as its own orchestrator, it calls GPT-5 or GPT-5 mini in 98 percent of cases. The research team call these self enhancement and other enhancement biases. The routing policy over uses strong models and ignores cost instructions.

ToolOrchestra instead trains a small orchestrator explicitly for this routing problem, using reinforcement learning over full multi turn trajectories.

What is Orchestrator 8B?

Orchestrator-8B is an 8B parameter decoder only Transformer. It is built by fine tuning Qwen3-8B as an orchestration model and released on Hugging Face.

At inference time, the system runs a multi turn loop that alternates reasoning and tool calls. The rollout has three main steps. First, Orchestrator 8B reads the user instruction and an optional natural language preference description, for example a request to prioritize low latency or to avoid web search. Second, it generates internal chain of thought style reasoning and plans an action. Third, it chooses a tool from the available set and emits a structured tool call in a unified JSON format. The environment executes that call, appends the result as an observation and feeds it back into the next step. The process stops when a termination signal is produced or a maximum of 50 turns is reached.

Tools cover three main groups. Basic tools include Tavily web search, a Python sandbox code interpreter and a local Faiss index built with Qwen3-Embedding-8B. Specialized LLMs include Qwen2.5-Math-72B, Qwen2.5-Math-7B and Qwen2.5-Coder-32B. Generalist LLM tools include GPT-5, GPT-5 mini, Llama 3.3-70B-Instruct and Qwen3-32B. All tools share the same schema with names, natural language descriptions and typed parameter specs.

End to End Reinforcement Learning with Multi Objective Rewards

ToolOrchestra formulates the whole workflow as a Markov Decision Process. The state contains the conversation history, past tool calls and observations, and user preferences. Actions are the next text step, including both reasoning tokens and a tool call schema. After up to 50 steps, the environment computes a scalar reward for the full trajectory.

The reward has three components. Outcome reward is binary and depends on whether the trajectory solves the task. For open-ended answers, GPT-5 is used as a judge to compare the model output with the reference. Efficiency rewards penalize both monetary cost and wall clock latency. Token usage for proprietary and open source tools is mapped to monetary cost using public API and Together AI pricing. Preference reward measures how well tool usage matches a user preference vector that can increase or decrease the weight on cost, latency or specific tools. These components are combined into a single scalar using the preference vector.

The policy is optimized with Group Relative Policy Optimization GRPO, a variant of policy gradient reinforcement learning that normalizes rewards within groups of trajectories for the same task. The training process includes filters that drop trajectories with invalid tool call format or weak reward variance to stabilize optimization.

https://arxiv.org/pdf/2511.21689

To make this training possible at scale, the research team plans to introduce ToolScale, a synthetic dataset of multi step tool calling tasks. For each domain, an LLM generates a database schema, database entries, domain specific APIs and then diverse user tasks with ground truth sequences of function calls and required intermediate information.

Benchmark results and cost profile

NVIDIA research team evaluates Orchestrator-8B on three challenging benchmarks, Humanity’s Last Exam, FRAMES and τ² Bench. These benchmarks target long horizon reasoning, factuality under retrieval and function calling in a dual control environment.

On Humanity’s Last Exam text only questions, Orchestrator-8B reaches 37.1 percent accuracy. GPT-5 with basic tools reaches 35.1 percent in the same setting. On FRAMES, Orchestrator-8B achieves 76.3 percent versus 74.0 percent for GPT-5 with tools. On τ² Bench, Orchestrator-8B scores 80.2 percent versus 77.7 percent for GPT-5 with basic tools.

https://arxiv.org/pdf/2511.21689

The efficiency gap is larger. In the configuration that uses basic tools plus specialized and generalist LLM tools, Orchestrator-8B has average cost 9.2 cents and latency 8.2 minutes per query, averaged over Humanity’s Last Exam and FRAMES. In the same configuration, GPT-5 costs 30.2 cents and takes 19.8 minutes on average. The model card summarizes this as about 30 percent of the monetary cost and 2.5 times faster for Orchestrator-8B compared to GPT-5.

Tool use analysis supports this picture. Claude Opus 4.1 used as an orchestrator calls GPT-5 most of the time. GPT-5 used as an orchestrator prefers GPT-5 mini. Orchestrator-8B spreads calls more evenly across strong models, cheaper models, search, local retrieval and the code interpreter, and reaches higher accuracy at lower cost for the same turn budget.

https://arxiv.org/pdf/2511.21689

Generalization experiments replace the training time tools with unseen models such as OpenMath Llama-2-70B, DeepSeek-Math-7B-Instruct, Codestral-22B-v0.1, Claude Sonnet-4.1 and Gemma-3-27B. Orchestrator-8B still achieves the best trade off between accuracy, cost and latency among all baselines in this setting. A separate preference aware test set shows that Orchestrator-8B also tracks user tool usage preferences more closely than GPT-5, Claude Opus-4.1 and Qwen3-235B-A22B under the same reward metric.

Key Takeaways

ToolOrchestra trains an 8B parameter orchestration model, Orchestrator-8B, that selects and sequences tools and LLMs to solve multi step agentic tasks using reinforcement learning with outcome, efficiency and preference aware rewards.

Orchestrator-8B is released as an open weight model on Hugging Face. It is designed to coordinate diverse tools such as web search, code execution, retrieval and specialist LLMs through a unified schema.

On Humanity’s Last Exam, Orchestrator-8B reaches 37.1 percent accuracy, surpassing GPT-5 at 35.1 percent, while being about 2.5 times more efficient, and on τ² Bench and FRAMES it outperforms GPT-5 while using roughly 30 percent of the cost.

The framework shows that naive prompting of a frontier LLM as its own router leads to self enhancement bias where it overuses itself or a small set of strong models, while a trained orchestrator learns a more balanced, cost aware routing policy over multiple tools.

Editorial Notes

NVIDIA’s ToolOrchestra is a practical step toward compound AI systems where an 8B orchestration model, Orchestrator-8B, learns an explicit routing policy over tools and LLMs instead of relying on a single frontier model. It shows clear gains on Humanity’s Last Exam, FRAMES and τ² Bench with about 30 percent of the cost and around 2.5 times better efficiency than GPT-5 based baselines, which makes it directly relevant for teams that care about accuracy, latency and budget. This launch makes orchestration policy a first class optimization target in AI systems.

Check out the Paper, Repo, Project Page and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post NVIDIA AI Releases Orchestrator-8B: A Reinforcement Learning Trained Controller for Efficient Tool and Model Selection appeared first on MarkTechPost.

A Coding Guide to Design an Agentic AI System Using a Control-Plane Ar …

In this tutorial, we build an advanced Agentic AI using the control-plane design pattern, and we walk through each component step by step as we implement it. We treat the control plane as the central orchestrator that coordinates tools, manages safety rules, and structures the reasoning loop. Also, we set up a miniature retrieval system, defined modular tools, and integrated an agentic reasoning layer that dynamically plans and executes actions. At last, we observe how the entire system behaves like a disciplined, tool-aware AI capable of retrieving knowledge, assessing understanding, updating learner profiles, and logging all interactions through a unified, scalable architecture. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys

def install_deps():
deps = [‘anthropic’, ‘numpy’, ‘scikit-learn’]
for dep in deps:
subprocess.check_call([sys.executable, ‘-m’, ‘pip’, ‘install’, ‘-q’, dep])

try:
import anthropic
except ImportError:
install_deps()
import anthropic

import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from dataclasses import dataclass, asdict
from typing import List, Dict, Any, Optional
from datetime import datetime

@dataclass
class Document:
id: str
content: str
metadata: Dict[str, Any]
embedding: Optional[np.ndarray] = None

class SimpleRAGRetriever:
def __init__(self):
self.documents = self._init_knowledge_base()

def _init_knowledge_base(self) -> List[Document]:
docs = [
Document(“cs101”, “Python basics: Variables store data. Use x=5 for integers, name=’Alice’ for strings. Print with print().”, {“topic”: “python”, “level”: “beginner”}),
Document(“cs102”, “Functions encapsulate reusable code. Define with def func_name(params): and call with func_name(args).”, {“topic”: “python”, “level”: “intermediate”}),
Document(“cs103”, “Object-oriented programming uses classes. class MyClass: defines structure, __init__ initializes instances.”, {“topic”: “python”, “level”: “advanced”}),
Document(“math101”, “Linear algebra: Vectors are ordered lists of numbers. Matrix multiplication combines transformations.”, {“topic”: “math”, “level”: “intermediate”}),
Document(“ml101”, “Machine learning trains models on data to make predictions. Supervised learning uses labeled examples.”, {“topic”: “ml”, “level”: “beginner”}),
Document(“ml102”, “Neural networks are composed of layers. Each layer applies weights and activation functions to transform inputs.”, {“topic”: “ml”, “level”: “advanced”}),
]
for i, doc in enumerate(docs):
doc.embedding = np.random.rand(128)
doc.embedding[i*20:(i+1)*20] += 2
return docs

def retrieve(self, query: str, top_k: int = 2) -> List[Document]:
query_embedding = np.random.rand(128)
scores = [cosine_similarity([query_embedding], [doc.embedding])[0][0] for doc in self.documents]
top_indices = np.argsort(scores)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]

We set up all dependencies, import the libraries we rely on, and initialize the data structures for our knowledge base. We define a simple retriever and generate mock embeddings to simulate similarity search in a lightweight way. As we run this block, we prepare everything needed for retrieval-driven reasoning in the later components. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ToolRegistry:
def __init__(self, retriever: SimpleRAGRetriever):
self.retriever = retriever
self.interaction_log = []
self.user_state = {“level”: “beginner”, “topics_covered”: []}

def search_knowledge(self, query: str, filters: Optional[Dict] = None) -> Dict:
docs = self.retriever.retrieve(query, top_k=2)
if filters:
docs = [d for d in docs if all(d.metadata.get(k) == v for k, v in filters.items())]
return {
“tool”: “search_knowledge”,
“results”: [{“content”: d.content, “metadata”: d.metadata} for d in docs],
“count”: len(docs)
}

def assess_understanding(self, topic: str) -> Dict:
questions = {
“python”: [“What keyword defines a function?”, “How do you create a variable?”],
“ml”: [“What is supervised learning?”, “Name two types of ML algorithms.”],
“math”: [“What is a vector?”, “Explain matrix multiplication.”]
}
return {
“tool”: “assess_understanding”,
“topic”: topic,
“questions”: questions.get(topic, [“General comprehension check.”])
}

def update_learner_profile(self, topic: str, level: str) -> Dict:
if topic not in self.user_state[“topics_covered”]:
self.user_state[“topics_covered”].append(topic)
self.user_state[“level”] = level
return {
“tool”: “update_learner_profile”,
“status”: “updated”,
“profile”: self.user_state.copy()
}

def log_interaction(self, event: str, details: Dict) -> Dict:
log_entry = {
“timestamp”: datetime.now().isoformat(),
“event”: event,
“details”: details
}
self.interaction_log.append(log_entry)
return {“tool”: “log_interaction”, “status”: “logged”, “entry_id”: len(self.interaction_log)}

We build the tool registry that our agent uses while interacting with the system. We define tools such as knowledge search, assessments, profile updates, and logging, and we maintain a persistent user-state dictionary. As we use this layer, we see how each tool becomes a modular capability that the control plane can route to. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ControlPlane:
def __init__(self, tool_registry: ToolRegistry):
self.tools = tool_registry
self.safety_rules = {
“max_tools_per_request”: 4,
“allowed_tools”: [“search_knowledge”, “assess_understanding”,
“update_learner_profile”, “log_interaction”]
}
self.execution_log = []

def execute(self, plan: Dict[str, Any]) -> Dict[str, Any]:
if not self._validate_request(plan):
return {“error”: “Safety validation failed”, “plan”: plan}

action = plan.get(“action”)
params = plan.get(“parameters”, {})
result = self._route_and_execute(action, params)

self.execution_log.append({
“timestamp”: datetime.now().isoformat(),
“plan”: plan,
“result”: result
})

return {
“success”: True,
“action”: action,
“result”: result,
“metadata”: {
“execution_count”: len(self.execution_log),
“safety_checks_passed”: True
}
}

def _validate_request(self, plan: Dict) -> bool:
action = plan.get(“action”)
if action not in self.safety_rules[“allowed_tools”]:
return False
if len(self.execution_log) >= 100:
return False
return True

def _route_and_execute(self, action: str, params: Dict) -> Any:
tool_map = {
“search_knowledge”: self.tools.search_knowledge,
“assess_understanding”: self.tools.assess_understanding,
“update_learner_profile”: self.tools.update_learner_profile,
“log_interaction”: self.tools.log_interaction
}
tool_func = tool_map.get(action)
if tool_func:
return tool_func(**params)
return {“error”: f”Unknown action: {action}”}

We implement the control plane that orchestrates tool execution, checks safety rules, and manages permissions. We validate every request, route actions to the right tool, and keep an execution log for transparency. As we run this snippet, we observe how the control plane becomes the governing system that ensures predictable and safe agentic behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TutorAgent:
def __init__(self, control_plane: ControlPlane, api_key: str):
self.control_plane = control_plane
self.client = anthropic.Anthropic(api_key=api_key)
self.conversation_history = []

def teach(self, student_query: str) -> str:
plan = self._plan_actions(student_query)
results = []
for action_plan in plan:
result = self.control_plane.execute(action_plan)
results.append(result)

response = self._synthesize_response(student_query, results)

self.conversation_history.append({
“query”: student_query,
“plan”: plan,
“results”: results,
“response”: response
})
return response

def _plan_actions(self, query: str) -> List[Dict]:
plan = []
query_lower = query.lower()

if any(kw in query_lower for kw in [“what”, “how”, “explain”, “teach”]):
plan.append({
“action”: “search_knowledge”,
“parameters”: {“query”: query},
“context”: {“intent”: “knowledge_retrieval”}
})

if any(kw in query_lower for kw in [“test”, “quiz”, “assess”, “check”]):
topic = “python” if “python” in query_lower else “ml”
plan.append({
“action”: “assess_understanding”,
“parameters”: {“topic”: topic},
“context”: {“intent”: “assessment”}
})

plan.append({
“action”: “log_interaction”,
“parameters”: {“event”: “query_processed”, “details”: {“query”: query}},
“context”: {“intent”: “logging”}
})

return plan

def _synthesize_response(self, query: str, results: List[Dict]) -> str:
response_parts = [f”Student Query: {query}n”]

for result in results:
if result.get(“success”) and “result” in result:
tool_result = result[“result”]

if result[“action”] == “search_knowledge”:
response_parts.append(“n Retrieved Knowledge:”)
for doc in tool_result.get(“results”, []):
response_parts.append(f” • {doc[‘content’]}”)

elif result[“action”] == “assess_understanding”:
response_parts.append(“n Assessment Questions:”)
for q in tool_result.get(“questions”, []):
response_parts.append(f” • {q}”)

return “n”.join(response_parts)

We implement the TutorAgent, which plans actions, communicates with the control plane, and synthesizes final responses. We analyze queries, generate multi-step plans, and combine tool outputs into meaningful answers for learners. As we execute this snippet, we see the agent behaving intelligently by coordinating retrieval, assessment, and logging. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_demo():
print(“=” * 70)
print(“Control Plane as a Tool: RAG AI Tutor Demo”)
print(“=” * 70)

API_KEY = “your-api-key-here”

retriever = SimpleRAGRetriever()
tool_registry = ToolRegistry(retriever)
control_plane = ControlPlane(tool_registry)

print(“System initialized”)
print(f”Tools: {len(control_plane.safety_rules[‘allowed_tools’])}”)
print(f”Knowledge base: {len(retriever.documents)} documents”)

try:
tutor = TutorAgent(control_plane, API_KEY)
except:
print(“Mock mode enabled”)
tutor = None

demo_queries = [
“Explain Python functions to me”,
“I want to learn about machine learning”,
“Test my understanding of Python basics”
]

for query in demo_queries:
print(“n— Query —“)
if tutor:
print(tutor.teach(query))
else:
plan = [
{“action”: “search_knowledge”, “parameters”: {“query”: query}},
{“action”: “log_interaction”, “parameters”: {“event”: “query”, “details”: {}}}
]
print(query)
for action in plan:
result = control_plane.execute(action)
print(f”{action[‘action’]}: {result.get(‘success’, False)}”)

print(“Summary”)
print(f”Executions: {len(control_plane.execution_log)}”)
print(f”Logs: {len(tool_registry.interaction_log)}”)
print(f”Profile: {tool_registry.user_state}”)

if __name__ == “__main__”:
run_demo()

We run a complete demo that initializes all components, processes sample student queries, and prints system state summaries. We watch the agent step through retrieval and logging while the control plane enforces rules and tracks execution history. As we finish this block, we get a clear picture of how the entire architecture works together in a realistic teaching loop.

In conclusion, we gain a clear understanding of how the control-plane pattern simplifies orchestration, strengthens safety, and creates a clean separation between reasoning and tool execution. We now see how a retrieval system, tool registry, and agentic planning layer come together to form a coherent AI tutor that responds intelligently to student queries. As we experiment with the demo, we observe how the system routes tasks, applies rules, and synthesizes useful insights from tool outputs, all while remaining modular and extensible.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Design an Agentic AI System Using a Control-Plane Architecture for Safe, Modular, and Scalable Tool-Driven Reasoning Workflows appeared first on MarkTechPost.

DeepSeek AI Releases DeepSeekMath-V2: The Open Weights Maths Model Tha …

How can an AI system prove complex olympiad level math problems in clear natural language while also checking that its own reasoning is actually correct? DeepSeek AI has released DeepSeekMath-V2, an open weights large language model that is optimized for natural language theorem proving with self verification. The model is built on DeepSeek-V3.2-Exp-Base, runs as a 685B parameter mixture of experts, and is available on Hugging Face under an Apache 2.0 license.

In evaluations, DeepSeekMath-V2 reaches gold level scores on IMO 2025 and CMO 2024, and achieves 118 of 120 points on Putnam 2024 when used with scaled test time compute.

Why Final Answer Rewards are not Enough?

Most recent math reasoning models use reinforcement learning that rewards only the final answer on benchmarks such as AIME and HMMT. This approach pushed models from weak baselines to near saturation on short answer contests in about one year. (Hugging Face)

However, the DeepSeek research team points out two structural problems:

A correct numeric answer does not guarantee correct reasoning. The model may reach the right number through algebraic mistakes that cancel out.

Many tasks, such as olympiad proofs and theorem proving, require a complete argument in natural language. These tasks do not have a single final numeric answer, so standard answer based rewards do not apply.

DeepSeekMath-V2 therefore optimizes proof quality instead of pure answer accuracy. The system evaluates whether a proof is complete and logically sound, and uses that evaluation as the main learning signal.

Training a Verifier before the Generator

The core design is verifier first. DeepSeek research team trains an LLM based verifier that can read a problem and a candidate proof, then output both a natural language analysis and a discrete quality score in the set {0, 0.5, 1}.

The initial reinforcement learning data comes from Art of Problem Solving contests. The research team crawl 17,503 proof style problems from olympiads, team selection tests, and post 2010 problems that explicitly require proofs. These problems form the base set for cold start RL. Candidate proofs come from a DeepSeek-V3.2 reasoning model that is prompted to iteratively refine its own solutions, which increases detail but also creates many imperfect proofs. Human experts label these proofs using the 0, 0.5, 1 rubric, based on rigor and completeness.

The verifier is trained with Group Relative Policy Optimization (GRPO). The reward has two components:

A format reward, which checks that the verifier output follows a fixed template, including an analysis section and a final score in a box.

A score reward, which penalizes the absolute difference between the predicted score and the expert score.

This stage produces a verifier that can grade olympiad style proofs in a consistent way.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

Meta Verification to Control Hallucinated Critiques

A verifier can still game the reward. It can output the correct final score while inventing fake issues in the analysis. This would satisfy the numeric objective but make the explanations unreliable.

To address this, the research team introduce a meta verifier. The meta verifier reads the original problem, the proof, and the verifier analysis, and then evaluates whether the analysis is faithful. It scores aspects such as restatement of steps, identification of real defects, and consistency between the narrative and the final score.

The meta verifier is also trained with GRPO, with its own format and score rewards. Its output, a meta quality score, is then used as an extra reward term for the base verifier. Analyses that hallucinate problems get low meta scores, even if the final proof score is correct. In experiments, this raises the average meta evaluated quality of analyses from around 0.85 to 0.96 on a validation split, while keeping proof score accuracy stable.

Self Verifying Proof Generator and Sequential Refinement

Once the verifier is strong, DeepSeek research team trains the proof generator. The generator takes a problem and outputs both a solution and a self analysis that follows the same rubric as the verifier.

The reward for the generator combines three signals:

The verifier score on the generated proof.

The agreement between the self reported score and the verifier score.

The meta verification score of the self analysis.

Formally, the main reward uses weights α = 0.76 for the proof score and β = 0.24 for the self analysis component, multiplied by a format term that enforces the output structure. This pushes the generator to write proofs that the verifier accepts, and to be honest about remaining issues. If it claims that a flawed proof is perfect, it loses reward through disagreement and low meta scores.

DeepSeek also exploits the 128K token context limit of the base model. For hard problems, the generator often cannot repair all issues in a single pass, because the refined proof plus analysis would exceed context. In that case, the system runs sequential refinement. It generates a proof and self analysis, feeds them back as context, and asks the model to produce a new proof that fixes the previously detected issues. This loop can repeat several times, subject to the context budget.

https://github.com/deepseek-ai/DeepSeek-Math-V2/tree/main

Scaling Verification and Auto Labeling

As the generator improves, it produces harder proofs, which are costly to label by hand. To keep training data fresh, the research team introduces an automatic labeling pipeline based on scaled verification.

For each candidate proof, the system samples multiple independent verifier analyses, then evaluates each analysis using the meta verifier. If several high quality analyses converge on the same serious issues, the proof is labeled as incorrect. If no valid issues survive meta checking, the proof is labeled as correct. In the final training iterations this pipeline replaces human labels, with spot checks confirming good agreement with experts.

Competition and Benchmark Results

The research team evaluated DeepSeekMath-V2 on several fronts:

On an internal set of 91 CNML level problems covering algebra, geometry, number theory, combinatorics, and inequalities, it shows that DeepSeekMath-V2 achieves the highest mean proof score among Gemini 2.5 Pro, GPT 5 Thinking High, and DeepSeekMath-V2 in every category, as measured by their verifier.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

On IMO Shortlist 2024, sequential refinement with self verification improves both pass at 1 and best of 32 quality metrics as the maximum number of refinement iterations increases.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

On IMO ProofBench, expert evaluation the above figure shows that DeepSeekMath-V2 outperforms DeepMind DeepThink IMO Gold on the Basic subset and remains competitive on the Advanced subset, while clearly beating other large models.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

For full competitions, it reports:

IMO 2025: 5 of 6 problems solved, gold medal level.

CMO 2024: 4 problems fully solved plus partial credit on 1 more, gold medal level.

Putnam 2024: 11 of 12 problems solved completely and the remaining problem with minor errors, for 118 of 120 points, above the best human score of 90.

Key Takeaways

DeepSeekMath V2 is a 685B parameter model built on DeepSeek V3.2 Exp Base, designed for natural language theorem proving with self verification, and released as open weights under the Apache 2.0 license.

The main innovation is a verifier first training pipeline with a GRPO trained verifier and meta verifier that score proofs on rigor, not only final answers, which directly addresses the gap between correct answers and correct reasoning.

A proof generator is then trained against this verifier and meta verifier, using rewards that combine proof quality, agreement with self evaluation, and analysis faithfulness, plus sequential refinement under 128K context to iteratively repair proofs.

With scaled test time compute and large verification budgets, DeepSeekMath V2 reaches gold level performance on IMO 2025 and CMO 2024 and scores 118 of 120 on Putnam 2024, surpassing the best human score that year.

Editorial Notes

DeepSeekMath-V2 is an important step toward self verifiable mathematical reasoning, because it directly tackles the gap between correct final answers and correct reasoning, using a verifier, meta verifier and proof generator trained with GRPO on olympiad style proofs and deployed at 685B scale to reach gold level performance on IMO 2025, CMO 2024 and a near perfect 118 of 120 score on Putnam 2024. Overall, this release shows that self verifiable mathematical reasoning with open weights is now practically achievable for competition level problems.

Check out the Full Paper, Model Weights on HF and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepSeek AI Releases DeepSeekMath-V2: The Open Weights Maths Model That Scored 118/120 on Putnam 2024 appeared first on MarkTechPost.

A Coding Implementation for an Agentic AI Framework that Performs Lite …

In this tutorial, we build a complete scientific discovery agent step by step and experience how each component works together to form a coherent research workflow. We begin by loading our literature corpus, constructing retrieval and LLM modules, and then assembling agents that search papers, generate hypotheses, design experiments, and produce structured reports. Through snippets mentioned below, we see how an agentic pipeline emerges naturally, allowing us to explore a scientific question from initial curiosity to a full analysis within a single, integrated system. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport sys, subprocess

def install_deps():
pkgs = [“transformers”, “scikit-learn”, “numpy”]
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”] + pkgs)

try:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
except ImportError:
install_deps()
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

from dataclasses import dataclass
from typing import List, Dict, Any

np.random.seed(42)

LITERATURE = [
{“id”: “P1″,”title”: “Self-Supervised Protein Language Models for Structure Prediction”,”field”: “computational biology”,
“abstract”: “We explore transformer-based protein language models trained on millions of sequences. The models learn residue-level embeddings that improve secondary structure prediction and stability estimation.”},
{“id”: “P2″,”title”: “CRISPR Off-Target Detection Using Deep Learning”,”field”: “genome editing”,
“abstract”: “We propose a convolutional neural network architecture for predicting CRISPR-Cas9 off-target effects directly from genomic sequences, achieving state-of-the-art accuracy on GUIDE-seq datasets.”},
{“id”: “P3″,”title”: “Foundation Models for Scientific Equation Discovery”,”field”: “scientific ML”,
“abstract”: “Large language models are combined with symbolic regression to recover governing equations from noisy experimental observations in physics and fluid dynamics.”},
{“id”: “P4″,”title”: “Active Learning for Materials Property Optimization”,”field”: “materials science”,
“abstract”: “We integrate Bayesian optimization with graph neural networks to actively select candidate materials that maximize target properties while reducing experimental cost.”},
{“id”: “P5″,”title”: “Graph-Based Retrieval for Cross-Domain Literature Review”,”field”: “NLP for science”,
“abstract”: “We construct a heterogeneous citation and concept graph over multi-domain scientific papers and show that graph-aware retrieval improves cross-domain literature exploration.”},
]

corpus_texts = [p[“abstract”] + ” ” + p[“title”] for p in LITERATURE]
vectorizer = TfidfVectorizer(stop_words=”english”)
corpus_matrix = vectorizer.fit_transform(corpus_texts)

MODEL_NAME = “google/flan-t5-small”
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

def generate_text(prompt: str, max_new_tokens: int = 256) -> str:
inputs = tokenizer(prompt, return_tensors=”pt”, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, num_beams=4, early_stopping=True)
return tokenizer.decode(outputs[0], skip_special_tokens=True)

We laid the foundation for our scientific agent by loading libraries, preparing the literature corpus, and initializing our language model. We build the TF-IDF vectorizer and embed all abstracts to later retrieve relevant papers. With the model loaded and data structured, we create the computational backbone for everything that follows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class PaperHit:
paper: Dict[str, Any]
score: float

class LiteratureAgent:
def __init__(self, vectorizer, corpus_matrix, papers: List[Dict[str, Any]]):
self.vectorizer = vectorizer
self.corpus_matrix = corpus_matrix
self.papers = papers

def search(self, query: str, k: int = 3) -> List[PaperHit]:
q_vec = self.vectorizer.transform([query])
sims = cosine_similarity(q_vec, self.corpus_matrix)[0]
idxs = np.argsort(-sims)[:k]
hits = [PaperHit(self.papers[i], float(sims[i])) for i in idxs]
return hits

We implement the literature-search component of our agent. We convert user queries into a vector space and identify the most relevant scientific papers using cosine similarity. Through this, we give our system the ability to ground its reasoning in the closest-matching prior work. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class ExperimentPlan:
system: str
hypothesis: str
variables: Dict[str, Any]
protocol: List[str]

@dataclass
class ExperimentResult:
plan: ExperimentPlan
metrics: Dict[str, float]

class ExperimentAgent:
def design_experiment(self, question: str, hypothesis: str, hits: List[PaperHit]) -> ExperimentPlan:
top_field = hits[0].paper[“field”] if hits else “computational science”
protocol = [
f”Construct dataset combining ideas from: {‘, ‘.join(h.paper[‘id’] for h in hits)}.”,
“Split data into train/validation/test.”,
“Compare baseline model vs. augmented model implementing the hypothesis.”,
“Evaluate using appropriate metrics and perform ablation analysis.”,
]
variables = {
“baseline_model”: “sequence CNN”,
“augmented_model”: “protein language model + CNN”,
“n_train_samples”: 5000,
“n_validation_samples”: 1000,
“metric”: “AUROC”,
}
system = f”{top_field} system related to: {question}”
return ExperimentPlan(system=system, hypothesis=hypothesis, variables=variables, protocol=protocol)

def run_experiment(self, plan: ExperimentPlan) -> ExperimentResult:
base = 0.78 + 0.02 * np.random.randn()
gain = abs(0.05 + 0.01 * np.random.randn())
metrics = {
“baseline_AUROC”: round(base, 3),
“augmented_AUROC”: round(base + gain, 3),
“estimated_gain”: round(gain, 3),
}
return ExperimentResult(plan=plan, metrics=metrics)

We design and simulate experiments based on the retrieved literature and the generated hypothesis. We automatically define variables, build a protocol, and generate synthetic metrics that imitate the dynamics of a real scientific evaluation. This lets us move from theoretical ideas to an actionable experimental plan. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ReportAgent:
def write_report(self, question: str, hits: List[PaperHit], plan: ExperimentPlan, result: ExperimentResult) -> str:
related_work = “n”.join(f”- {h.paper[‘title’]} ({h.paper[‘field’]})” for h in hits)
protocol_str = “n”.join(f”- {step}” for step in plan.protocol)
prompt = f”””
You are an AI research assistant writing a concise research-style report.

Research question:
{question}

Hypothesis:
{plan.hypothesis}

Relevant prior work:
{related_work}

Planned experiment:
System: {plan.system}
Variables: {plan.variables}
Protocol:
{protocol_str}

Simulated results:
{result.metrics}

Write a clear report with the following sections:
1. Background
2. Proposed Approach
3. Experimental Setup
4. Results and Discussion
5. Limitations and Future Work
“””
return generate_text(prompt.strip(), max_new_tokens=320)

We generate a full research-style report using the LLM. We assemble the hypothesis, protocol, results, and related work into a structured document with clearly defined sections. This allows us to turn the pipeline’s raw outputs into polished scientific communication. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ScientificAgent:
def __init__(self):
self.lit_agent = LiteratureAgent(vectorizer, corpus_matrix, LITERATURE)
self.exp_agent = ExperimentAgent()
self.report_agent = ReportAgent()

def propose_hypothesis(self, question: str, hits: List[PaperHit]) -> str:
context = ” “.join(h.paper[“abstract”] for h in hits)
prompt = f”””
You are an AI scientist. Given a research question and related abstracts,
propose a single, testable hypothesis in 2-3 sentences.

Research question:
{question}

Related abstracts:
{context}
“””
return generate_text(prompt.strip(), max_new_tokens=96)

def run_pipeline(self, question: str) -> str:
hits = self.lit_agent.search(question, k=3)
hypothesis = self.propose_hypothesis(question, hits)
plan = self.exp_agent.design_experiment(question, hypothesis, hits)
result = self.exp_agent.run_experiment(plan)
report = self.report_agent.write_report(question, hits, plan, result)
return report

if __name__ == “__main__”:
research_question = (
“How can protein language model embeddings improve CRISPR off-target ”
“prediction compared to sequence-only CNN baselines?”
)
agent = ScientificAgent()
final_report = agent.run_pipeline(research_question)
print(final_report)

We orchestrate the entire pipeline, searching the literature, generating a hypothesis, designing the experiment, running the simulation, and writing the report. We then execute the system on a real research question and observe the complete workflow in action. This step brings all the modules together into a unified scientific agent.

In conclusion, we see how a compact codebase can evolve into a functioning AI co-researcher capable of searching, reasoning, simulating, and summarizing. We understand how each snippet contributes to the full pipeline and how agentic components amplify one another when combined. Also, we place ourselves in a strong position to extend the agent with richer literature sources, more realistic models, and more sophisticated experimental logic, pushing our scientific exploration further with every iteration.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation for an Agentic AI Framework that Performs Literature Analysis, Hypothesis Generation, Experimental Planning, Simulation, and Scientific Reporting appeared first on MarkTechPost.

OceanBase Releases seekdb: An Open Source AI Native Hybrid Search Data …

AI applications rarely deal with one clean table. They mix user profiles, chat logs, JSON metadata, embeddings, and sometimes spatial data. Most teams answer this with a patchwork of an OLTP database, a vector store, and a search engine. OceanBase released seekdb, an open source AI focused database (under the Apache 2.0 license). seekdb is described as an AI native search database that unifies relational data, vector data, text, JSON, and GIS in one engine and exposes hybrid search and in database AI workflows. 

What is seekdb?

seekdb is positioned as the lightweight, embedded version of the OceanBase engine, aimed at AI applications rather than general purpose distributed deployments. It runs as a single node database, supports embedded mode and client or server mode, and remains compatible with MySQL drivers and SQL syntax.

In the capability matrix, seekdb is marked as:

Embedded database supported

Standalone database supported

Distributed database not supported

while the full OceanBase product covers the distributed case.

From a data model perspective, seekdb supports:

Relational data with standard SQL

Vector search

Full text search

JSON data

Spatial GIS data

all inside one storage and indexing layer.

Hybrid search as the core feature

The main feature OceanBase pushes is hybrid search. This is search that combines vector based semantic retrieval, full text keyword retrieval, and scalar filters in a single query and a single ranking step.

seekdb implements hybrid search through a system package named DBMS_HYBRID_SEARCH with two entry points:

DBMS_HYBRID_SEARCH.SEARCH which returns results as JSON, sorted by relevance

DBMS_HYBRID_SEARCH.GET_SQL which returns the concrete SQL string used for execution

The hybrid search path can run:

pure vector search

pure full text search

combined hybrid search

and can push relational filters and joins down into storage. It also supports query reranking strategies like weighted scores and reciprocal rank fusion and can plug in large language model based re-rankers.

For retrieval augmented generation (RAG) and agent memory, this means you can write a single SQL query that does semantic matching on embeddings, exact matching on product codes or proper nouns, and relational filtering on user or tenant scopes.

Vector and full text engine details

At its core, seekdb exposes a modern vector and full text stack.

For vectors, seekdb:

supports dense vectors and sparse vectors

supports Manhattan, Euclidean, inner product, and cosine distance metrics

provides in memory index types such as HNSW, HNSW SQ, HNSW BQ

provides disk based index types including IVF and IVF PQ

Hybrid vector index show how you can store raw text, let seekdb call an embedding model automatically, and have the system maintain the corresponding vector index without a separate preprocessing pipeline.

For text, seekdb offers full text search with:

keyword, phrase, and Boolean queries

BM25 ranking for relevance

multiple tokenizer modes

The key point is that full text and vector indexes are first class and are integrated in the same query planner as scalar indexes and GIS indexes, so hybrid search does not need external orchestration.

AI functions inside the database

seekdb includes built in AI function expressions that let you call models directly from SQL, without a separate application service mediating every call. The main functions are:

AI_EMBED to convert text into embeddings

AI_COMPLETE for text generation using a chat or completion model

AI_RERANK to rerank a list of candidatesAI_PROMPT to assemble prompt templates and dynamic values into a JSON object for AI_COMPLETE

Model metadata and endpoints are managed by the DBMS_AI_SERVICE package, which lets you register external providers, set URLs, and configure keys, all on the database side. 

Multimodal data and workloads

seekdb is built to handle multiple data modalities in one node. it has a multimodal data and indexing layer that covers vectors, text, JSON, and GIS, and a multi-model compute layer for hybrid workloads across vector, full text, and scalar conditions.

It also provides JSON indexes for metadata queries and GIS indexes for spatial conditions. This allows queries like:

find semantically similar documents

filter by JSON metadata like tenant, region, or category

constrain by spatial range or polygon

without leaving the same engine.

Because seekdb is derived from the OceanBase engine, it inherits ACID transactions, row and column hybrid storage, and vectorized execution, although high scale distributed deployments remain a job for the full OceanBase database.

Comparison Table

Key Takeaways

AI native hybrid search: seekdb unifies vector search, full text search and relational filtering in a single SQL and DBMS_HYBRID_SEARCH interface, so RAG and agent workloads can run multi signal retrieval in one query instead of stitching together multiple engines.

Multimodal data in one engine: seekdb stores and indexes relational data, vectors, text, JSON and GIS in the same engine, which lets AI applications keep documents, embeddings and metadata consistent without maintaining separate databases.

In database AI functions for RAG: With AI_EMBED, AI_COMPLETE, AI_RERANK and AI_PROMPT, seekdb can call embedding models, LLMs and rerankers directly from SQL, which simplifies RAG pipelines and moves more orchestration logic into the database layer.

Single node, embedded friendly design: seekdb is a single node, MySQL compatible engine that supports embedded and standalone modes, while distributed, large scale deployments remain the role of full OceanBase, which makes seekdb suitable for local, edge and service embedded AI workloads.

Open source and tool ecosystem: seekdb is open sourced under Apache 2.0 and integrates with a growing ecosystem of AI tools and frameworks, with Python support via pyseekdb and MCP based integration for code assistants and agents, so it can act as a unified data plane for AI applications.

Check out the Repo and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OceanBase Releases seekdb: An Open Source AI Native Hybrid Search Database for Multi-model RAG and AI Agents appeared first on MarkTechPost.

Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Exp …

Tencent Hunyuan has released HunyuanOCR, a 1B parameter vision language model that is specialized for OCR and document understanding. The model is built on Hunyuan’s native multimodal architecture and runs spotting, parsing, information extraction, visual question answering, and text image translation through a single end to end pipeline.

HunyuanOCR is a lightweight alternative to general VLMs such as Gemini 2.5 and Qwen3 VL that still matches or surpasses them on OCR centric tasks. It targets production use cases like document parsing, card and receipt extraction, video subtitle extraction, and multilingual document translation.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Architecture, Native Resolution ViT plus Lightweight LLM

HunyuanOCR uses 3 main modules, a Native Resolution Visual Encoder called Hunyuan ViT, an Adaptive MLP Connector, and a Lightweight Language Model. The encoder is based on SigLIP-v2-400M and is extended to support arbitrary input resolutions through adaptive patching that preserves the original aspect ratio. Images are split into patches according to their native proportions and processed with global attention, which improves recognition on long text lines, long documents, and low quality scans.

The Adaptive MLP Connector performs learnable pooling on the spatial dimension. It compresses the dense visual tokens into a shorter sequence, while keeping information from text dense regions. This reduces sequence length passed to the language model and lowers compute, while preserving OCR relevant details.

The language model is based on the densely architected Hunyuan 0.5B model and uses XD RoPE. XD RoPE splits rotary position embeddings into 4 subspaces for text, height, width, and time. This gives the model a native way to align 1D token order with 2D layout and 3D spatiotemporal structure. As a result, the same stack can handle multi column pages, cross page flows, and sequences of video frames.

Training and inference follow a fully end to end paradigm. There is no external layout analysis or post processing model in the loop. All tasks are expressed as natural language prompts and handled in a single forward pass. This design removes error propagation across pipeline stages and simplifies deployment.

Data and Pre Training Recipe

The data pipeline builds more than 200M image text pairs, across 9 real world scenarios, including street views, documents, advertisements, handwritten text, screenshots, cards and certificates and invoices, game interfaces, video frames, and artistic typography. The corpus covers more than 130 languages.

Synthetic data comes from a multilingual generator that supports right to left scripts and paragraph level rendering. The pipeline controls font, language, rotation, and RGB values, and applies warping, blur, and local lighting changes to simulate mobile captures and other hard conditions.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Pre training follows 4 stages. Stage-1 performs vision language alignment with pure text, synthetic parsing and recognition data, and general caption data, using 50B tokens and 8k context. Stage-2 runs multimodal pre training on 300B tokens that mix pure text with synthetic spotting, parsing, translation, and VQA samples. Stage-3 extends context length to 32k with 80B tokens focused on long documents and long text. Stage-4 is application oriented supervised fine tuning on 24B tokens of human annotated and hard negative data, keeping 32k context and unified instruction templates.

Reinforcement Learning with Verifiable Rewards

After supervised training, HunyuanOCR is further optimized with reinforcement learning. The research team use Group Relative Policy Optimization GRPO and a Reinforcement Learning with Verifiable Rewards setup for structured tasks. For text spotting, the reward is based on intersection over union matching of boxes combined with normalized edit distance over text. For document parsing, the reward uses normalized edit distance between the generated structure and the reference.

For VQA and translation, the system uses an LLM as a judge. VQA uses a binary reward that checks semantic match. Translation uses a COMET style scoring LLM with scores in [0, 5], normalized to [0, 1]. The training framework enforces length limits and strict formats, and assigns zero reward when outputs overflow or break schema, which stabilizes optimization and encourages valid JSON or structured outputs.

Benchmark Results, a 1B Model Competing with Larger VLMs

On the internal text spotting benchmark of 900 images across 9 categories, HunyuanOCR reaches an overall score of 70.92. It outperforms traditional pipeline methods like PaddleOCR and BaiduOCR and also general VLMs such as Gemini 2.5 Pro, Qwen3 VL 2B, Qwen3 VL 235B, and Seed 1.6 Vision, despite using far fewer parameters.

On OmniDocBench, HunyuanOCR achieves 94.10 overall, with 94.73 on formulas and 91.81 on tables. On the Wild OmniDocBench variant, which prints and recaptures documents under folds and lighting changes, it scores 85.21 overall. On DocML, a multilingual parsing benchmark across 14 non Chinese and non English languages, it reaches 91.03, and the paper reports state of the art results across all 14 languages.

For information extraction and VQA, HunyuanOCR reaches 92.29 accuracy on cards, 92.53 on receipts, and 92.87 on video subtitles. On OCRBench, it scores 860, higher than DeepSeek OCR at similar scale and close to larger general VLMs like Qwen3 VL 2B Instruct and Gemini 2.5 Pro.

In text image translation, HunyuanOCR uses the DoTA benchmark and a DocML based internal set. It achieves a strong COMET score on DoTA for English to Chinese document translation, and the model wins first place in Track 2.2 OCR free Small Model of the ICDAR 2025 DIMT competition.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Key Takeaways

Compact end to end OCR VLM: HunyuanOCR is a 1B parameter OCR focused vision language model that connects a 0.4B native resolution ViT to a 0.5B Hunyuan language model through an MLP adapter, and runs spotting, parsing, information extraction, VQA and translation in one end to end instruction driven pipeline without external layout or detection modules.

Unified support for diverse OCR scenarios: The model is trained on more than 200M image text pairs across 9 scenarios, including documents, street views, advertisements, handwritten content, screenshots, cards and invoices, game interfaces and video frames, with coverage of over 130 languages in training and support for more than 100 languages in deployment.

Data pipeline plus reinforcement learning: Training uses a 4 stage recipe, vision language alignment, multimodal pre training, long context pre training and application oriented supervised fine tuning, followed by reinforcement learning with group relative policy optimization and verifiable rewards for spotting, parsing, VQA and translation.

Strong benchmark results for sub 3B modelsHunyuanOCR reaches 94.1 on OmniDocBench for document understanding, and achieves 860 on OCRBench, which is reported as state of the art among vision language models with fewer than 3B parameters, while also outperforming several commercial OCR APIs and larger open models such as Qwen3 VL 4B on core OCR benchmarks.

Editorial Notes

HunyuanOCR is a strong signal that OCR specific VLMs are maturing into practical infrastructure, not just benchmarks. Tencent combines a 1B parameter end to end architecture with Native Vision Transformer, Adaptive MLP Connector and RL with verifiable rewards to deliver a single model that covers spotting, parsing, IE, VQA and translation across more than 100 languages, and it does so while reaching leading scores on OCRBench for sub 3B models and 94.1 on OmniDocBench. Overall, HunyuanOCR marks an important shift toward compact, instruction driven OCR engines that are realistic for production deployment.

Check out the Paper, Model weight and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Expert VLM appeared first on MarkTechPost.

How to Implement Functional Components of Transformer and Mini-GPT Mod …

In this tutorial, we explore how to build neural networks from scratch using Tinygrad while remaining fully hands-on with tensors, autograd, attention mechanisms, and transformer architectures. We progressively build every component ourselves, from basic tensor operations to multi-head attention, transformer blocks, and, finally, a working mini-GPT model. Through each stage, we observe how Tinygrad’s simplicity helps us understand what happens under the hood when models train, optimize, and fuse kernels for performance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess, sys, os
print(“Installing dependencies…”)
subprocess.check_call([“apt-get”, “install”, “-qq”, “clang”], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, “git+https://github.com/tinygrad/tinygrad.git”])

import numpy as np
from tinygrad import Tensor, nn, Device
from tinygrad.nn import optim
import time

print(f” Using device: {Device.DEFAULT}”)
print(“=” * 60)

print(“n PART 1: Tensor Operations & Autograd”)
print(“-” * 60)

x = Tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = Tensor([[2.0, 0.0], [1.0, 2.0]], requires_grad=True)

z = (x @ y).sum() + (x ** 2).mean()
z.backward()

print(f”x:n{x.numpy()}”)
print(f”y:n{y.numpy()}”)
print(f”z (scalar): {z.numpy()}”)
print(f”∂z/∂x:n{x.grad.numpy()}”)
print(f”∂z/∂y:n{y.grad.numpy()}”)

We set up Tinygrad in our Colab environment and immediately begin experimenting with tensors and automatic differentiation. We create a small computation graph and observe how gradients flow through matrix operations. As we print the outputs, we gain an intuitive understanding of how Tinygrad handles backpropagation under the hood. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 2: Building Custom Layers”)
print(“-” * 60)

class MultiHeadAttention:
def __init__(self, dim, num_heads):
self.num_heads = num_heads
self.dim = dim
self.head_dim = dim // num_heads
self.qkv = Tensor.glorot_uniform(dim, 3 * dim)
self.out = Tensor.glorot_uniform(dim, dim)

def __call__(self, x):
B, T, C = x.shape[0], x.shape[1], x.shape[2]
qkv = x.reshape(B * T, C).dot(self.qkv).reshape(B, T, 3, self.num_heads, self.head_dim)
q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
scale = (self.head_dim ** -0.5)
attn = (q @ k.transpose(-2, -1)) * scale
attn = attn.softmax(axis=-1)
out = (attn @ v).transpose(1, 2).reshape(B, T, C)
return out.reshape(B * T, C).dot(self.out).reshape(B, T, C)

class TransformerBlock:
def __init__(self, dim, num_heads):
self.attn = MultiHeadAttention(dim, num_heads)
self.ff1 = Tensor.glorot_uniform(dim, 4 * dim)
self.ff2 = Tensor.glorot_uniform(4 * dim, dim)
self.ln1_w = Tensor.ones(dim)
self.ln2_w = Tensor.ones(dim)

def __call__(self, x):
x = x + self.attn(self._layernorm(x, self.ln1_w))
ff = x.reshape(-1, x.shape[-1])
ff = ff.dot(self.ff1).gelu().dot(self.ff2)
x = x + ff.reshape(x.shape)
return self._layernorm(x, self.ln2_w)

def _layernorm(self, x, w):
mean = x.mean(axis=-1, keepdim=True)
var = ((x – mean) ** 2).mean(axis=-1, keepdim=True)
return w * (x – mean) / (var + 1e-5).sqrt()

We design our own multi-head attention module and a transformer block entirely from scratch. We implement the projections, attention scores, softmax, feedforward layers, and layer normalization manually. As we run this code, we see how each component contributes to a transformer layer’s overall behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n PART 3: Mini-GPT Architecture”)
print(“-” * 60)

class MiniGPT:
def __init__(self, vocab_size=256, dim=128, num_heads=4, num_layers=2, max_len=32):
self.vocab_size = vocab_size
self.dim = dim
self.tok_emb = Tensor.glorot_uniform(vocab_size, dim)
self.pos_emb = Tensor.glorot_uniform(max_len, dim)
self.blocks = [TransformerBlock(dim, num_heads) for _ in range(num_layers)]
self.ln_f = Tensor.ones(dim)
self.head = Tensor.glorot_uniform(dim, vocab_size)

def __call__(self, idx):
B, T = idx.shape[0], idx.shape[1]
tok_emb = self.tok_emb[idx.flatten()].reshape(B, T, self.dim)
pos_emb = self.pos_emb[:T].reshape(1, T, self.dim)
x = tok_emb + pos_emb
for block in self.blocks:
x = block(x)
mean = x.mean(axis=-1, keepdim=True)
var = ((x – mean) ** 2).mean(axis=-1, keepdim=True)
x = self.ln_f * (x – mean) / (var + 1e-5).sqrt()
return x.reshape(B * T, self.dim).dot(self.head).reshape(B, T, self.vocab_size)

def get_params(self):
params = [self.tok_emb, self.pos_emb, self.ln_f, self.head]
for block in self.blocks:
params.extend([block.attn.qkv, block.attn.out, block.ff1, block.ff2, block.ln1_w, block.ln2_w])
return params

model = MiniGPT(vocab_size=256, dim=64, num_heads=4, num_layers=2, max_len=16)
params = model.get_params()
total_params = sum(p.numel() for p in params)
print(f”Model initialized with {total_params:,} parameters”)

We assemble the full MiniGPT architecture using the components built earlier. We embed tokens, add positional information, stack multiple transformer blocks, and project the final outputs back to vocab logits. As we initialize the model, we begin to appreciate how a compact transformer can be built with surprisingly few moving parts. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 4: Training Loop”)
print(“-” * 60)

def gen_data(batch_size, seq_len):
x = np.random.randint(0, 256, (batch_size, seq_len))
y = np.roll(x, 1, axis=1)
y[:, 0] = x[:, 0]
return Tensor(x, dtype=’int32′), Tensor(y, dtype=’int32′)

optimizer = optim.Adam(params, lr=0.001)
losses = []

print(“Training to predict previous token in sequence…”)
with Tensor.train():
for step in range(20):
start = time.time()
x_batch, y_batch = gen_data(batch_size=16, seq_len=16)
logits = model(x_batch)
B, T, V = logits.shape[0], logits.shape[1], logits.shape[2]
loss = logits.reshape(B * T, V).sparse_categorical_crossentropy(y_batch.reshape(B * T))
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss.numpy())
elapsed = time.time() – start
if step % 5 == 0:
print(f”Step {step:3d} | Loss: {loss.numpy():.4f} | Time: {elapsed*1000:.1f}ms”)

print(“nn PART 5: Lazy Evaluation & Kernel Fusion”)
print(“-” * 60)

N = 512
a = Tensor.randn(N, N)
b = Tensor.randn(N, N)

print(“Creating computation: (A @ B.T + A).sum()”)
lazy_result = (a @ b.T + a).sum()
print(“→ No computation done yet (lazy evaluation)”)

print(“nCalling .realize() to execute…”)
start = time.time()
realized = lazy_result.realize()
elapsed = time.time() – start

print(f”✓ Computed in {elapsed*1000:.2f}ms”)
print(f”Result: {realized.numpy():.4f}”)
print(“nNote: Operations were fused into optimized kernels!”)

We train the MiniGPT model on simple synthetic data and observe the loss decreasing across steps. We also explore Tinygrad’s lazy execution model by creating a fused kernel that executes only when it is realized. As we monitor timings, we understand how kernel fusion improves performance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 6: Custom Operations”)
print(“-” * 60)

def custom_activation(x):
return x * x.sigmoid()

x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]], requires_grad=True)
y = custom_activation(x)
loss = y.sum()
loss.backward()

print(f”Input: {x.numpy()}”)
print(f”Swish(x): {y.numpy()}”)
print(f”Gradient: {x.grad.numpy()}”)

print(“nn” + “=” * 60)
print(” Tutorial Complete!”)
print(“=” * 60)
print(“””
Key Concepts Covered:
1. Tensor operations with automatic differentiation
2. Custom neural network layers (Attention, Transformer)
3. Building a mini-GPT language model from scratch
4. Training loop with Adam optimizer
5. Lazy evaluation and kernel fusion
6. Custom activation functions
“””)

We implement a custom activation function and verify that gradients propagate correctly through it. We then print a summary of all major concepts covered in the tutorial. As we finish, we reflect on how each section builds our ability to understand, modify, and extend deep learning internals using Tinygrad.

In conclusion, we reinforce our understanding of how neural networks truly operate beneath modern abstractions, and we experience firsthand how Tinygrad empowers us to tinker with every internal detail. We have built a transformer, trained it on synthetic data, experimented with lazy evaluation and kernel fusion, and even created custom operations, all within a minimal, transparent framework. At last, we recognize how this workflow prepares us for deeper experimentation, whether we extend the model, integrate real datasets, or continue exploring Tinygrad’s low-level capabilities.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Implement Functional Components of Transformer and Mini-GPT Model from Scratch Using Tinygrad to Understand Deep Learning Internals appeared first on MarkTechPost.

Black Forest Labs Releases FLUX.2: A 32B Flow Matching Transformer for …

Black Forest Labs has released FLUX.2, its second generation image generation and editing system. FLUX.2 targets real world creative workflows such as marketing assets, product photography, design layouts, and complex infographics, with editing support up to 4 megapixels and strong control over layout, logos, and typography.

FLUX.2 product family and FLUX.2 [dev]

The FLUX.2 family spans hosted APIs and open weights:

FLUX.2 [pro] is the managed API tier. It targets state of the art quality relative to closed models, with high prompt adherence and low inference cost, and is available in the BFL Playground, BFL API, and partner platforms.

FLUX.2 [flex] exposes parameters such as number of steps and guidance scale, so developers can trade off latency, text rendering accuracy, and visual detail.

FLUX.2 [dev] is the open weight checkpoint, derived from the base FLUX.2 model. It is described as the most powerful open weight image generation and editing model, combining text to image and multi image editing in one checkpoint, with 32 billion parameters.

FLUX.2 [klein] is a coming open source Apache 2.0 variant, size distilled from the base model for smaller setups, with many of the same capabilities.

All variants support image editing from text and multiple references in a single model, which removes the need to maintain separate checkpoints for generation and editing.

Architecture, latent flow, and the FLUX.2 VAE

FLUX.2 uses a latent flow matching architecture. The core design couples a Mistral-3 24B vision language model with a rectified flow transformer that operates on latent image representations. The vision language model provides semantic grounding and world knowledge, while the transformer backbone learns spatial structure, materials, and composition.

The model is trained to map noise latents to image latents under text conditioning, so the same architecture supports both text driven synthesis and editing. For editing, latents are initialized from existing images, then updated under the same flow process while preserving structure.

A new FLUX.2 VAE defines the latent space. It is designed to balance learnability, reconstruction quality, and compression, and is released separately on Hugging Face under an Apache 2.0 license. This autoencoder is the backbone for all FLUX.2 flow models and can also be reused in other generative systems.

https://bfl.ai/blog/flux-2

Capabilities for production workflows

The FLUX.2 Docs and Diffusers integration highlight several key capabilities:

Multi reference support: FLUX.2 can combine up to 10 reference images to maintain character identity, product appearance, and style across outputs.

Photoreal detail at 4MP: the model can edit and generate images up to 4 megapixels, with improved textures, skin, fabrics, hands, and lighting suitable for product shots and photo like use cases.

Robust text and layout rendering: it can render complex typography, infographics, memes, and user interface layouts with small legible text, which is a common weakness in many older models.

World knowledge and spatial logic: the model is trained for more grounded lighting, perspective, and scene composition, which reduces artifacts and the synthetic look.

https://bfl.ai/blog/flux-2

Key Takeaways

FLUX.2 is a 32B latent flow matching transformer that unifies text to image, image editing, and multi reference composition in a single checkpoint.

FLUX.2 [dev] is the open weight variant, paired with the Apache 2.0 FLUX.2 VAE, while the core model weights use the FLUX.2-dev Non Commercial License with mandatory safety filtering.

The system supports up to 4 megapixel generation and editing, robust text and layout rendering, and up to 10 visual references for consistent characters, products, and styles.

Full precision inference requires more than 80GB VRAM, but 4 bit and FP8 quantized pipelines with offloading make FLUX.2 [dev] usable on 18GB to 24GB GPUs and even 8GB cards with sufficient system RAM.

Editorial Notes

FLUX.2 is an important step for open weight visual generation, since it combines a 32B rectified flow transformer, a Mistral 3 24B vision language model, and the FLUX.2 VAE into a single high fidelity pipeline for text to image and editing. The clear VRAM profiles, quantized variants, and strong integrations with Diffusers, ComfyUI, and Cloudflare Workers make it practical for real workloads, not only benchmarks. This release pushes open image models closer to production grade creative infrastructure.

Check out the Technical details, Model weight and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Black Forest Labs Releases FLUX.2: A 32B Flow Matching Transformer for Production Image Pipelines appeared first on MarkTechPost.

How Myriad Genetics achieved fast, accurate, and cost-efficient docume …

This post was written with Martyna Shallenberg and Brode Mccrady from Myriad Genetics.
Healthcare organizations face challenges in processing and managing high volumes of complex medical documentation while maintaining quality in patient care. These organizations need solutions to process documents effectively to meet growing demands. Myriad Genetics, a provider of genetic testing and precision medicine solutions serving healthcare providers and patients worldwide, addresses this challenge.
Myriad’s Revenue Engineering Department processes thousands of healthcare documents daily across Women’s Health, Oncology, and Mental Health divisions. The company classifies incoming documents into classes such as Test Request Forms, Lab Results, Clinical Notes, and Insurance to automate Prior Authorization workflows. The system routes these documents to appropriate external vendors for processing based on their identified document class. They manually perform Key Information Extraction (KIE) including insurance details, patient information, and test results to determine Medicare eligibility and support downstream processes.
As document volumes increased, Myriad faced challenges with its existing system. The automated document classification solution worked but was costly and time-consuming. Information extraction remained manual due to complexity. To address high costs and slow processing, Myriad needed a better solution.
This post explores how Myriad Genetics partnered with the AWS Generative AI Innovation Center (GenAIIC) to transform their healthcare document processing pipeline using Amazon Bedrock and Amazon Nova foundation models. We detail the challenges with their existing solution, and how generative AI reduced costs and improved processing speed.
We examine the technical implementation using AWS’s open source GenAI Intelligent Document Processing (GenAI IDP) Accelerator solution, the optimization strategies used for document classification and key information extraction, and the measurable business impact on Myriad’s prior authorization workflows. We cover how we used prompt engineering techniques, model selection strategies, and architectural decisions to build a scalable solution that processes complex medical documents with high accuracy while reducing operational costs.
Document processing bottlenecks limiting healthcare operations
Myriad Genetics’ daily operations depend on efficiently processing complex medical documents containing critical information for patient care workflows and regulatory compliance. Their existing solution combined Amazon Textract for Optical Character Recognition (OCR) with Amazon Comprehend for document classification.
Despite 94% classification accuracy, this solution had operational challenges:

Operational costs: 3 cents per page resulting in $15,000 monthly expenses per business unit
Classification latency: 8.5 minutes per document, delaying downstream prior authorization workflows

Information extraction was entirely manual, requiring contextual understanding to differentiate critical clinical distinctions (like “is metastatic” versus “is not metastatic”) and to locate information like insurance numbers and patient information across varying document formats. This processing burden was substantial, with Women’s Health customer service requiring up to 10 full-time employees contributing 78 hours daily in the Women’s Health business unit alone.
Myriad needed a solution to:

Reduce document classification costs while maintaining or improving accuracy
Accelerate document processing to eliminate workflow bottlenecks
Automate information extraction for medical documents
Scale across multiple business units and document types

Amazon Bedrock and generative AI
Modern large language models (LLMs) process complex healthcare documents with high accuracy due to pre-training on massive text corpora. This pre-training enables LLMs to understand language patterns and document structures without feature engineering or large labeled datasets. Amazon Bedrock is a fully managed service that offers a broad range of high-performing LLMs from leading AI companies. It provides the security, privacy, and responsible AI capabilities that healthcare organizations require when processing sensitive medical information. For this solution, we used Amazon’s newest foundation models:

Amazon Nova Pro: A cost-effective, low-latency model ideal for document classification
Amazon Nova Premier: An advanced model with reasoning capabilities for information extraction

Solution overview
We implemented a solution with Myriad using AWS’s open source GenAI IDP Accelerator. The accelerator provides a scalable, serverless architecture that converts unstructured documents into structured data. The accelerator processes multiple documents in parallel through configurable concurrency limits without overwhelming downstream services. Its built-in evaluation framework lets users provide expected output through the user interface (UI) and evaluate generated results to iteratively customize configuration and improve accuracy.

The accelerator offers 1-click deployment with a choice of pre-built patterns optimized for different workloads with different configurability, cost, and accuracy requirements:

Pattern 1 – Uses Amazon Bedrock Data Automation, a fully managed service that offers rich out-of-the-box features, ease of use, and straightforward per-page pricing. This pattern is recommended for most use cases.
Pattern 2 – Uses Amazon Textract and Amazon Bedrock with Amazon Nova, Anthropic’s Claude, or custom fine-tuned Amazon Nova models. This pattern is ideal for complex documents requiring custom logic.
Pattern 3 – Uses Amazon Textract, Amazon SageMaker with a fine-tuned model for classification, and Amazon Bedrock for extraction. This pattern is ideal for documents requiring specialized classification.

Pattern 2 proved most suitable for this project, meeting the critical requirement of low cost while offering flexibility to optimize accuracy through prompt engineering and LLM selection. This pattern offers a no-code configuration – customize document types, extraction fields, and processing logic through configuration, editable in the web UI.
We customized the definitions of document classes, key attributes and their definitions per document class, LLM choice, LLM hyperparameters, and classification and extraction LLM prompts via Pattern 2’s config file. In production, Myriad integrated this solution into their existing event-driven architecture. The following diagram illustrates the production pipeline:

Document Ingestion: Incoming order events trigger document retrieval from source document management systems, with cache optimization for previously processed documents.
Concurrency Management: DynamoDB tracked concurrent AWS Step Function jobs while Amazon Simple Queue Service (SQS) queues files exceeding concurrency limits for document processing.
Text Extraction: Amazon Textract extracted text, layout information, tables and forms from the normalized documents.
Classification: The configured LLM analyzed the extracted content based on the customized document classification prompt provided in the config file and classifies documents into appropriate categories.
Key Information Extraction: The configured LLM extracted medical information using extraction prompt provided in the config file.
Structured Output: The pipeline formatted the results in a structured manner and delivered to Myriad’s Authorization System via RESTful operations.

Document classification with generative AI
While Myriad’s existing solution achieved 94% accuracy, misclassifications occurred due to structural similarities, overlapping content, and shared formatting patterns across document types. This semantic ambiguity made it difficult to distinguish between similar documents. We guided Myriad on prompt optimization techniques that used LLM’s contextual understanding capabilities. This approach moved beyond pattern matching to enable semantic analysis of document context and purpose, identifying distinguishing features that human experts recognize but previous automated systems missed.
AI-driven prompt engineering for document classification
We developed class definitions with distinguishing characteristics between similar document types. To identify these differentiators, we provided document samples from each class to Anthropic Claude Sonnet 3.7 on Amazon Bedrock with model reasoning enabled (a feature that allows the model to demonstrate its step-by-step analysis process). The model identified distinguishing features between similar document classes, which Myriad’s subject matter experts refined and incorporated into the GenAI IDP Accelerator’s Pattern 2 config file for document classification prompts.
Format-based classification strategies
We used document structure and formatting as key differentiators to distinguish between similar document types that shared comparable content but differed in structure. We enabled the classification models to recognize format-specific characteristics such as layout structures, field arrangements, and visual elements, allowing the system to differentiate between documents that textual content alone cannot distinguish. For example, lab reports and test results both contain patient information and medical data, but lab reports display numerical values in tabular format while test results follow a narrative format. We instructed the LLM: “Lab reports contain numerical results organized in tables with reference ranges and units. Test results present findings in paragraph format with clinical interpretations.”
Implementing negative prompting for enhanced accuracy
We implemented negative prompting techniques to resolve confusion between similar documents by explicitly instructing the model what classifications to avoid. This approach added exclusionary language to classification prompts, specifying characteristics that should not be associated with each document type. Initially, the system frequently misclassified Test Request Forms (TRFs) as Test Results due to confusion between patient medical history and lab measurements. Adding a negative prompt like “These forms contain patient medical history. DO NOT confuse them with test results which contain current/recent lab measurements” to the TRF definition improved the classification accuracy by 4%. By providing explicit guidance on common misclassification patterns, the system avoided typical errors and confusion between similar document types.
Model selection for cost and performance optimization
Model selection drives optimal cost-performance at scale, so we conducted comprehensive benchmarking using the GenAI IDP Accelerator’s evaluation framework. We tested four foundation models—Amazon Nova Lite, Amazon Nova Pro, Amazon Nova Premier, and Anthropic Claude Sonnet 3.7—using 1,200 healthcare documents across three document classes: Test Request Forms, Lab Results, and Insurance. We assessed each model using three critical metrics: classification accuracy, processing latency, and cost per document. The accelerator’s cost tracking enabled direct comparison of operational expenses across different model configurations, ensuring performance improvements translate into measurable business value at scale.
The evaluation results demonstrated that Amazon Nova Pro achieved optimal balance for Myriad’s use case. We transitioned from Myriad’s Amazon Comprehend implementation to Amazon Nova Pro with optimized prompts for document classification, achieving significant improvements: classification accuracy increased from 94% to 98%, processing costs decreased by 77%, and processing speed improved by 80%—reducing classification time from 8.5 minutes to 1.5 minutes per document.
Automating Key Information Extraction with generative AI
Myriad’s information extraction was manual, requiring up to 10 full-time employees contributing 78 hours daily in the Women’s Health unit alone, which created operational bottlenecks and scalability constraints. Automating healthcare KIE presented challenges: checkbox fields required distinguishing between marking styles (checkmarks, X’s, handwritten marks); documents contained ambiguous visual elements like overlapping marks or content spanning multiple fields; extraction needed contextual understanding to differentiate clinical distinctions and locate information across varying document formats. We worked with Myriad to develop an automated KIE solution, implementing the following optimization techniques to address extraction complexity.
Enhanced OCR configuration for checkbox recognition
To address checkbox identification challenges, we enabled Amazon Textract’s specialized TABLES and FORMS features on the GenAI IDP Accelerator portal as shown in the following image, to improve OCR discrimination between selected and unselected checkbox elements. These features enhanced the system’s ability to detect and interpret marking styles found in medical forms.

We enhanced accuracy by incorporating visual cues into the extraction prompts. We updated the prompts with instructions such as “look for visible marks in or around the small square boxes (✓, x, or handwritten marks)” to guide the language model in identifying checkbox selections. This combination of enhanced OCR capabilities and targeted prompting improved checkbox extraction in medical forms.
Visual context learning through few-shot examples
Configuring Textract and improving prompts alone could not handle complex visual elements effectively. We implemented a multimodal approach that sent both document images and extracted text from Textract to the foundation model, enabling simultaneous analysis of visual layout and textual content for accurate extraction decisions. We implemented few-shot learning by providing example document images paired with their expected extraction outputs to guide the model’s understanding of various form layouts and marking styles. Multiple document image examples with their correct extraction patterns create lengthy LLM prompts. We leveraged the GenAI IDP Accelerator’s built-in integration with Amazon Bedrock’s prompt caching feature to reduce costs and latency. Prompt caching stores lengthy few-shot examples in memory for 5 minutes—when processing multiple similar documents within that timeframe, Bedrock reuses cached examples instead of reprocessing them, reducing both cost and processing time.
Chain of thought reasoning for complex extraction
While this multimodal approach improved extraction accuracy, we still faced challenges with overlapping and ambiguous tick marks in complex form layouts. To perform well in ambiguous and complex situations, we used Amazon Nova Premier and implemented Chain of Thought reasoning to have the model think through extraction decisions step-by-step using thinking tags. For example:

Analyze the checkbox marks in this form:

<thinking>
1. What checkboxes are present? [List all visible options]
2. Where are the marks positioned? [Describe mark locations]
3. Which marks are clear vs ambiguous? [Assess mark quality]
4. For overlapping marks: Which checkbox contains most of the mark?
5. Are marks positioned in the center or touching edges? [Prioritize center positioning]
</thinking>

Additionally, we included reasoning explanations in the few-shot examples, demonstrating how we reached conclusions in ambiguous cases. This approach enabled the model to work through complex visual evidence and contextual clues before making final determinations, improving performance with ambiguous tick marks.
Testing across 32 document samples with varying complexity levels via the GenAI IDP Accelerator revealed that Amazon Textract with Layout, TABLES, and FORMS features enabled, paired with Amazon Nova Premier’s advanced reasoning capabilities and the inclusion of few-shot examples, delivered the best results. The solution achieved 90% accuracy (same as human evaluator baseline accuracy) while processing documents in approximately 1.3 minutes each.
Results and business impact
Through our new solution, we delivered measurable improvements that met the business goals established at the project outset:
Document classification performance:

We increased accuracy from 94% to 98% through prompt optimization techniques for Amazon Nova Pro, including AI-driven prompt engineering, document-format based classification strategies, and negative prompting.
We reduced classification costs by 77% (from 3.1 to 0.7 cents per page) by migrating from Amazon Comprehend to Amazon Nova Pro with optimized prompts.
We reduced classification time by 80% (from 8.5 to 1.5 minutes per document) by choosing Amazon Nova Pro to provide a low-latency and cost-effective solution.

New automated Key Information Extraction performance:

We achieved 90% extraction accuracy (same as the baseline manual process): Delivered through a combination of Amazon Textract’s document analysis capabilities, visual context learning through few-shot examples and Amazon Nova Premier’s reasoning for complex data interpretation.
We achieved processing costs of 9 cents per page and processing time of 1.3 minutes per document compared to manual baseline requiring up to 10 full-time employees working 78 hours daily per business unit.

Business impact and rollout
Myriad has planned a phased rollout beginning with document classification. They plan to launch our new classification solution in the Women’s Health business unit, followed by Oncology and Mental Health divisions. As a result of our work, Myriad will realize up to $132K in annual savings in their document classification costs. The solution reduces each prior authorization submission time by 2 minutes—specialists now complete orders in four minutes instead of six minutes due to faster access to tagged documents. This improvement saves 300 hours monthly across 9,000 prior authorizations in Women’s Health alone, equivalent to 50 hours per prior authorization specialist.
These measurable improvements have transformed Myriad’s operations, as their engineering leadership confirms:

“Partnering with the GenAIIC to migrate our Intelligent Document Processing solution from AWS Comprehend to Bedrock has been a transformative step forward. By improving both performance and accuracy, the solution is projected to deliver savings of more than $10,000 per month. The team’s close collaboration with Myriad’s internal engineering team delivered a high-quality, scalable solution, while their deep expertise in advanced language models has elevated our capabilities. This has been an excellent example of how innovation and partnership can drive measurable business impact.” – Martyna Shallenberg, Senior Director of Software Engineering, Myriad Genetics

Conclusion
The AWS GenAI IDP Accelerator enabled Myriad’s rapid implementation, providing a flexible framework that reduced development time. Healthcare organizations need tailored solutions—the accelerator delivers extensive customization capabilities that let users adapt solutions to specific document types and workflows without requiring extensive code changes or frequent redeployment during development. Our approach demonstrates the power of strategic prompt engineering and model selection. We achieved high accuracy in a specialized domain by focusing on prompt design, including negative prompting and visual cues. We optimized both cost and performance by selecting Amazon Nova Pro for classification and Nova Premier for complex extraction—matching the right model to each specific task.
Explore the solution for yourself
Organizations looking to improve their document processing workflows can experience these benefits firsthand. The open source GenAI IDP Accelerator that powered Myriad’s transformation is available to deploy and test in your environment. The accelerator’s straightforward setup process lets users quickly evaluate how generative AI can transform document processing challenges.
Once you’ve explored the accelerator and seen its potential impact on your workflows, reach out to the AWS GenAIIC team to explore how the GenAI IDP Accelerator can be customized and optimized for your specific use case. This hands-on approach ensures you can make informed decisions about implementing intelligent document processing in your organization.

About the authors
Priyashree Roy is a Data Scientist II at the AWS Generative AI Innovation Center, where she applies her expertise in machine learning and generative AI to develop innovative solutions for strategic AWS customers. She brings a rigorous scientific approach to complex business challenges, informed by her PhD in experimental particle physics from Florida State University and postdoctoral research at the University of Michigan.
Mofijul Islam is an Applied Scientist II and Tech Lead at the AWS Generative AI Innovation Center, where he helps customers tackle customer-centric research and business challenges using generative AI, large language models (LLM), multi-agent learning, code generation, and multimodal learning. He holds a PhD in machine learning from the University of Virginia, where his work focused on multimodal machine learning, multilingual natural language processing (NLP), and multitask learning. His research has been published in top-tier conferences like NeurIPS, International Conference on Learning Representations (ICLR), Empirical Methods in Natural Language Processing (EMNLP), Society for Artificial Intelligence and Statistics (AISTATS), and Association for the Advancement of Artificial Intelligence (AAAI), as well as Institute of Electrical and Electronics Engineers (IEEE) and Association for Computing Machinery (ACM) Transactions.
Nivedha Balakrishnan is a Deep Learning Architect II at the AWS Generative AI Innovation Center, where she helps customers design and deploy generative AI applications to solve complex business challenges. Her expertise spans large language models (LLMs), multimodal learning, and AI-driven automation. She holds a Master’s in Applied Data Science from San Jose State University and a Master’s in Biomedical Engineering from Linköping University, Sweden. Her previous research focused on AI for drug discovery and healthcare applications, bridging life sciences with machine learning.
Martyna Shallenberg is a Senior Director of Software Engineering at Myriad Genetics, where she leads cross-functional teams in building AI-driven enterprise solutions that transform revenue cycle operations and healthcare delivery. With a unique background spanning genomics, molecular diagnostics, and software engineering, she has scaled innovative platforms ranging from Intelligent Document Processing (IDP) to modular LIMS solutions. Martyna is also the Founder & President of BioHive’s HealthTech Hub, fostering cross-domain collaboration to accelerate precision medicine and healthcare innovation.
Brode Mccrady is a Software Engineering Manager at Myriad Genetics, where he leads initiatives in AI, revenue systems, and intelligent document processing. With over a decade of experience in business intelligence and strategic analytics, Brode brings deep expertise in translating complex business needs into scalable technical solutions. He holds a degree in Economics, which informs his data-driven approach to problem-solving and business strategy.
Randheer Gehlot is a Principal Customer Solutions Manager at AWS who specializes in healthcare and life sciences transformation. With a deep focus on AI/ML applications in healthcare, he helps enterprises design and implement efficient cloud solutions that address real business challenges. His work involves partnering with organizations to modernize their infrastructure, enable innovation, and accelerate their cloud adoption journey while ensuring practical, sustainable outcomes.
Acknowledgements
We would like to thank Bob Strahan, Kurt Mason, Akhil Nooney and Taylor Jensen for their significant contributions, strategic decisions and guidance throughout.