A Coding and Experimental Analysis of Decentralized Federated Learning …

In this tutorial, we explore how federated learning behaves when the traditional centralized aggregation server is removed and replaced with a fully decentralized, peer-to-peer gossip mechanism. We implement both centralized FedAvg and decentralized Gossip Federated Learning from scratch and introduce client-side differential privacy by injecting calibrated noise into local model updates. By running controlled experiments on non-IID MNIST data, we examine how privacy strength, as measured by different epsilon values, directly affects convergence speed, stability, and final model accuracy. Also, we study the practical trade-offs between privacy guarantees and learning efficiency in real-world decentralized learning systems. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport os, math, random, time
from dataclasses import dataclass
from typing import Dict, List, Tuple
import subprocess, sys

def pip_install(pkgs):
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”] + pkgs)

pip_install([“torch”, “torchvision”, “numpy”, “matplotlib”, “networkx”, “tqdm”])

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Subset
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import networkx as nx
from tqdm import trange

SEED = 7
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.benchmark = True

transform = transforms.Compose([transforms.ToTensor()])

train_ds = datasets.MNIST(root=”/content/data”, train=True, download=True, transform=transform)
test_ds = datasets.MNIST(root=”/content/data”, train=False, download=True, transform=transform)

We set up the execution environment and installed all required dependencies. We initialize random seeds and device settings to maintain reproducibility across experiments. We also load the MNIST dataset, which serves as a lightweight yet effective benchmark for federated learning experiments. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef make_noniid_clients(dataset, num_clients=20, shards_per_client=2, seed=SEED):
rng = np.random.default_rng(seed)
y = np.array([dataset[i][1] for i in range(len(dataset))])
idx = np.arange(len(dataset))
idx_sorted = idx[np.argsort(y)]
num_shards = num_clients * shards_per_client
shard_size = len(dataset) // num_shards
shards = [idx_sorted[i*shard_size:(i+1)*shard_size] for i in range(num_shards)]
rng.shuffle(shards)
client_indices = []
for c in range(num_clients):
take = shards[c*shards_per_client:(c+1)*shards_per_client]
client_indices.append(np.concatenate(take))
return client_indices

NUM_CLIENTS = 20
client_indices = make_noniid_clients(train_ds, num_clients=NUM_CLIENTS, shards_per_client=2)

test_loader = DataLoader(test_ds, batch_size=1024, shuffle=False, num_workers=2, pin_memory=True)

class MLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(28*28, 256)
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 10)
def forward(self, x):
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)

We construct a non-IID data distribution by partitioning the training dataset into label-based shards across multiple clients. We define a compact neural network model that balances expressiveness and computational efficiency. It enables us to realistically simulate data heterogeneity, a critical challenge in federated learning systems. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef get_model_params(model):
return {k: v.detach().clone() for k, v in model.state_dict().items()}

def set_model_params(model, params):
model.load_state_dict(params, strict=True)

def add_params(a, b):
return {k: a[k] + b[k] for k in a.keys()}

def sub_params(a, b):
return {k: a[k] – b[k] for k in a.keys()}

def scale_params(a, s):
return {k: a[k] * s for k in a.keys()}

def mean_params(params_list):
out = {k: torch.zeros_like(params_list[0][k]) for k in params_list[0].keys()}
for p in params_list:
for k in out.keys():
out[k] += p[k]
for k in out.keys():
out[k] /= len(params_list)
return out

def l2_norm_params(delta):
sq = 0.0
for v in delta.values():
sq += float(torch.sum(v.float() * v.float()).item())
return math.sqrt(sq)

def dp_sanitize_update(delta, clip_norm, epsilon, delta_dp, rng):
norm = l2_norm_params(delta)
scale = min(1.0, clip_norm / (norm + 1e-12))
clipped = scale_params(delta, scale)
if epsilon is None or math.isinf(epsilon) or epsilon <= 0:
return clipped
sigma = clip_norm * math.sqrt(2.0 * math.log(1.25 / delta_dp)) / epsilon
noised = {}
for k, v in clipped.items():
noise = torch.normal(mean=0.0, std=sigma, size=v.shape, generator=rng, device=v.device, dtype=v.dtype)
noised[k] = v + noise
return noised

We implement parameter manipulation utilities that enable addition, subtraction, scaling, and averaging of model weights across clients. We introduce differential privacy by clipping local updates and injecting Gaussian noise, both determined by the chosen privacy budget. It serves as the core privacy mechanism that enables us to study the privacy–utility trade-off in both centralized and decentralized settings. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef local_train_one_client(base_params, client_id, epochs, lr, batch_size, weight_decay=0.0):
model = MLP().to(device)
set_model_params(model, base_params)
model.train()
loader = DataLoader(
Subset(train_ds, client_indices[client_id].tolist() if hasattr(client_indices[client_id], “tolist”) else client_indices[client_id]),
batch_size=batch_size,
shuffle=True,
num_workers=2,
pin_memory=True
)
opt = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9, weight_decay=weight_decay)
for _ in range(epochs):
for xb, yb in loader:
xb, yb = xb.to(device), yb.to(device)
opt.zero_grad(set_to_none=True)
logits = model(xb)
loss = F.cross_entropy(logits, yb)
loss.backward()
opt.step()
return get_model_params(model)

@torch.no_grad()
def evaluate(params):
model = MLP().to(device)
set_model_params(model, params)
model.eval()
total, correct = 0, 0
loss_sum = 0.0
for xb, yb in test_loader:
xb, yb = xb.to(device), yb.to(device)
logits = model(xb)
loss = F.cross_entropy(logits, yb, reduction=”sum”)
loss_sum += float(loss.item())
pred = torch.argmax(logits, dim=1)
correct += int((pred == yb).sum().item())
total += int(yb.numel())
return loss_sum / total, correct / total

We define the local training loop that each client executes independently on its private data. We also implement a unified evaluation routine to measure test loss and accuracy for any given model state. Together, these functions simulate realistic federated learning behavior where training and evaluation are fully decoupled from data ownership. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser@dataclass
class FedAvgConfig:
rounds: int = 25
clients_per_round: int = 10
local_epochs: int = 1
lr: float = 0.06
batch_size: int = 64
clip_norm: float = 2.0
epsilon: float = math.inf
delta_dp: float = 1e-5

def run_fedavg(cfg):
global_params = get_model_params(MLP().to(device))
history = {“test_loss”: [], “test_acc”: []}
for r in trange(cfg.rounds):
chosen = random.sample(range(NUM_CLIENTS), k=cfg.clients_per_round)
start_params = global_params
updates = []
for cid in chosen:
local_params = local_train_one_client(start_params, cid, cfg.local_epochs, cfg.lr, cfg.batch_size)
delta = sub_params(local_params, start_params)
rng = torch.Generator(device=device)
rng.manual_seed(SEED * 10000 + r * 100 + cid)
delta_dp = dp_sanitize_update(delta, cfg.clip_norm, cfg.epsilon, cfg.delta_dp, rng)
updates.append(delta_dp)
avg_update = mean_params(updates)
global_params = add_params(start_params, avg_update)
tl, ta = evaluate(global_params)
history[“test_loss”].append(tl)
history[“test_acc”].append(ta)
return history, global_params

We implement the centralized FedAvg algorithm, where a subset of clients trains locally and sends differentially private updates to a central aggregator. We track model performance across communication rounds to observe convergence behavior under varying privacy budgets. This serves as the baseline against which decentralized gossip-based learning is compared. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser@dataclass
class GossipConfig:
rounds: int = 25
local_epochs: int = 1
lr: float = 0.06
batch_size: int = 64
clip_norm: float = 2.0
epsilon: float = math.inf
delta_dp: float = 1e-5
topology: str = “ring”
p: float = 0.2
gossip_pairs_per_round: int = 10

def build_topology(cfg):
if cfg.topology == “ring”:
G = nx.cycle_graph(NUM_CLIENTS)
elif cfg.topology == “erdos_renyi”:
G = nx.erdos_renyi_graph(NUM_CLIENTS, cfg.p, seed=SEED)
if not nx.is_connected(G):
comps = list(nx.connected_components(G))
for i in range(len(comps) – 1):
a = next(iter(comps[i]))
b = next(iter(comps[i+1]))
G.add_edge(a, b)
else:
raise ValueError
return G

def run_gossip(cfg):
node_params = [get_model_params(MLP().to(device)) for _ in range(NUM_CLIENTS)]
G = build_topology(cfg)
history = {“avg_test_loss”: [], “avg_test_acc”: []}
for r in trange(cfg.rounds):
new_params = []
for cid in range(NUM_CLIENTS):
p0 = node_params[cid]
p_local = local_train_one_client(p0, cid, cfg.local_epochs, cfg.lr, cfg.batch_size)
delta = sub_params(p_local, p0)
rng = torch.Generator(device=device)
rng.manual_seed(SEED * 10000 + r * 100 + cid)
delta_dp = dp_sanitize_update(delta, cfg.clip_norm, cfg.epsilon, cfg.delta_dp, rng)
p_local_dp = add_params(p0, delta_dp)
new_params.append(p_local_dp)
node_params = new_params
edges = list(G.edges())
for _ in range(cfg.gossip_pairs_per_round):
i, j = random.choice(edges)
avg = mean_params([node_params[i], node_params[j]])
node_params[i] = avg
node_params[j] = avg
losses, accs = [], []
for cid in range(NUM_CLIENTS):
tl, ta = evaluate(node_params[cid])
losses.append(tl)
accs.append(ta)
history[“avg_test_loss”].append(float(np.mean(losses)))
history[“avg_test_acc”].append(float(np.mean(accs)))
return history, node_params

We implement decentralized Gossip Federated Learning using a peer-to-peer model that exchanges over a predefined network topology. We simulate repeated local training and pairwise parameter averaging without relying on a central server. It allows us to analyze how privacy noise propagates through decentralized communication patterns and affects convergence. Check out the Full Codes here.

Copy CodeCopiedUse a different Browsereps_sweep = [math.inf, 8.0, 4.0, 2.0, 1.0]
ROUNDS = 20

fedavg_results = {}
gossip_results = {}

common_local_epochs = 1
common_lr = 0.06
common_bs = 64
common_clip = 2.0
common_delta = 1e-5

for eps in eps_sweep:
fcfg = FedAvgConfig(
rounds=ROUNDS,
clients_per_round=10,
local_epochs=common_local_epochs,
lr=common_lr,
batch_size=common_bs,
clip_norm=common_clip,
epsilon=eps,
delta_dp=common_delta
)
hist_f, _ = run_fedavg(fcfg)
fedavg_results[eps] = hist_f

gcfg = GossipConfig(
rounds=ROUNDS,
local_epochs=common_local_epochs,
lr=common_lr,
batch_size=common_bs,
clip_norm=common_clip,
epsilon=eps,
delta_dp=common_delta,
topology=”ring”,
gossip_pairs_per_round=10
)
hist_g, _ = run_gossip(gcfg)
gossip_results[eps] = hist_g

plt.figure(figsize=(10, 5))
for eps in eps_sweep:
plt.plot(fedavg_results[eps][“test_acc”], label=f”FedAvg eps={eps}”)
plt.xlabel(“Round”)
plt.ylabel(“Accuracy”)
plt.legend()
plt.grid(True)
plt.show()

plt.figure(figsize=(10, 5))
for eps in eps_sweep:
plt.plot(gossip_results[eps][“avg_test_acc”], label=f”Gossip eps={eps}”)
plt.xlabel(“Round”)
plt.ylabel(“Avg Accuracy”)
plt.legend()
plt.grid(True)
plt.show()

final_fed = [fedavg_results[eps][“test_acc”][-1] for eps in eps_sweep]
final_gos = [gossip_results[eps][“avg_test_acc”][-1] for eps in eps_sweep]

x = [100.0 if math.isinf(eps) else eps for eps in eps_sweep]

plt.figure(figsize=(8, 5))
plt.plot(x, final_fed, marker=”o”, label=”FedAvg”)
plt.plot(x, final_gos, marker=”o”, label=”Gossip”)
plt.xlabel(“Epsilon”)
plt.ylabel(“Final Accuracy”)
plt.legend()
plt.grid(True)
plt.show()

def rounds_to_threshold(acc_curve, threshold):
for i, a in enumerate(acc_curve):
if a >= threshold:
return i + 1
return None

best_f = fedavg_results[math.inf][“test_acc”][-1]
best_g = gossip_results[math.inf][“avg_test_acc”][-1]

th_f = 0.9 * best_f
th_g = 0.9 * best_g

for eps in eps_sweep:
rf = rounds_to_threshold(fedavg_results[eps][“test_acc”], th_f)
rg = rounds_to_threshold(gossip_results[eps][“avg_test_acc”], th_g)
print(eps, rf, rg)

We run controlled experiments across multiple privacy levels and collect results for both centralized and decentralized training strategies. We visualize convergence trends and final accuracy to clearly expose the privacy–utility trade-off. We also compute convergence speed metrics to quantitatively compare how different aggregation schemes respond to increasing privacy constraints.

In conclusion, we demonstrated that decentralization fundamentally changes how differential privacy noise propagates through a federated system. We observed that while centralized FedAvg typically converges faster under weak privacy constraints, gossip-based federated learning is more robust to noisy updates at the cost of slower convergence. Our experiments highlighted that stronger privacy guarantees significantly slow learning in both settings, but the effect is amplified in decentralized topologies due to delayed information mixing. Overall, we showed that designing privacy-preserving federated systems requires jointly reasoning about aggregation topology, communication patterns, and privacy budgets rather than treating them as independent choices.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding and Experimental Analysis of Decentralized Federated Learning with Gossip Protocols and Differential Privacy appeared first on MarkTechPost.

Robbyant Open Sources LingBot World: a Real Time World Model for Inter …

Robbyant, the embodied AI unit inside Ant Group, has open sourced LingBot-World, a large scale world model that turns video generation into an interactive simulator for embodied agents, autonomous driving and games. The system is designed to render controllable environments with high visual fidelity, strong dynamics and long temporal horizons, while staying responsive enough for real time control.

From text to video to text to world

Most text to video models generate short clips that look realistic but behave like passive movies. They do not model how actions change the environment over time. LingBot-World is built instead as an action conditioned world model. It learns the transition dynamics of a virtual world, so that keyboard and mouse inputs, together with camera motion, drive the evolution of future frames.

Formally, the model learns the conditional distribution of future video tokens, given past frames, language prompts and discrete actions. At training time, it predicts sequences up to about 60 seconds. At inference time, it can autoregressively roll out coherent video streams that extend to around 10 minutes, while keeping scene structure stable.

Data engine, from web video to interactive trajectories

A core design in LingBot-World is a unified data engine. It provides rich, aligned supervision for how actions change the world while covering diverse real scenes.

The data acquisition pipeline combines 3 sources:

Large scale web videos of humans, animals and vehicles, from both first person and third person views

Game data, where RGB frames are strictly paired with user controls such as W, A, S, D and camera parameters

Synthetic trajectories rendered in Unreal Engine, where clean frames, camera intrinsics and extrinsics and object layouts are all known

After collection, a profiling stage standardizes this heterogeneous corpus. It filters for resolution and duration, segments videos into clips and estimates missing camera parameters using geometry and pose models. A vision language model scores clips for quality, motion magnitude and view type, then selects a curated subset.

On top of this, a hierarchical captioning module builds 3 levels of text supervision:

Narrative captions for whole trajectories, including camera motion

Scene static captions that describe environment layout without motion

Dense temporal captions for short time windows that focus on local dynamics

This separation lets the model disentangle static structure from motion patterns, which is important for long horizon consistency.

Architecture, MoE video backbone and action conditioning

LingBot-World starts from Wan2.2, a 14B parameter image to video diffusion transformer. This backbone already captures strong open domain video priors. Robbyant team extends it into a mixture of experts DiT, with 2 experts. Each expert has about 14B parameters, so the total parameter count is 28B, but only 1 expert is active at each denoising step. This keeps inference cost similar to a dense 14B model while expanding capacity.

A curriculum extends training sequences from 5 seconds to 60 seconds. The schedule increases the proportion of high noise timesteps, which stabilizes global layouts over long contexts and reduces mode collapse for long rollouts.

To make the model interactive, actions are injected directly into the transformer blocks. Camera rotations are encoded with Plücker embeddings. Keyboard actions are represented as multi hot vectors over keys such as W, A, S, D. These encodings are fused and passed through adaptive layer normalization modules, which modulate hidden states in the DiT. Only the action adapter layers are fine tuned, the main video backbone stays frozen, so the model retains visual quality from pre training while learning action responsiveness from a smaller interactive dataset.

Training uses both image to video and video to video continuation tasks. Given a single image, the model can synthesize future frames. Given a partial clip, it can extend the sequence. This results in an internal transition function that can start from arbitrary time points.

LingBot World Fast, distillation for real time use

The mid-trained model, LingBot-World Base, still relies on multi step diffusion and full temporal attention, which are expensive for real time interaction. Robbyant team introduces LingBot-World-Fast as an accelerated variant.

The fast model is initialized from the high noise expert and replaces full temporal attention with block causal attention. Inside each temporal block, attention is bidirectional. Across blocks, it is causal. This design supports key value caching, so the model can stream frames autoregressively with lower cost.

Distillation uses a diffusion forcing strategy. The student is trained on a small set of target timesteps, including timestep 0, so it sees both noisy and clean latents. Distribution Matching Distillation is combined with an adversarial discriminator head. The adversarial loss updates only the discriminator. The student network is updated with the distillation loss, which stabilizes training while preserving action following and temporal coherence.

In experiments, LingBot World Fast reaches 16 frames per second when processing 480p videos on a system with 1 GPU node, and, maintains end to end interaction latency under 1 second for real time control.

Emergent memory and long horizon behavior

One of the most interesting properties of LingBot-World is emergent memory. The model maintains global consistency without explicit 3D representations such as Gaussian splatting. When the camera moves away from a landmark such as Stonehenge and returns after about 60 seconds, the structure reappears with consistent geometry. When a car leaves the frame and later reenters, it appears at a physically plausible location, not frozen or reset.

The model can also sustain ultra long sequences. The research team shows coherent video generation that extends up to 10 minutes, with stable layout and narrative structure.]

VBench results and comparison to other world models

For quantitative evaluation, the research team used VBench on a curated set of 100 generated videos, each longer than 30 seconds. LingBot-World is compared to 2 recent world models, Yume-1.5 and HY-World-1.5.

On VBench, LingBot World reports:

https://arxiv.org/pdf/2601.20540v1

These scores are higher than both baselines for imaging quality, aesthetic quality and dynamic degree. The dynamic degree margin is large, 0.8857 compared to 0.7612 and 0.7217, which indicates richer scene transitions and more complex motion that respond to user inputs. Motion smoothness and temporal flicker are comparable to the best baseline, and the method achieves the best overall consistency metric among the 3 models.

A separate comparison with other interactive systems such as Matrix-Game-2.0, Mirage-2 and Genie-3 highlights that LingBot-World is one of the few fully open sourced world models that combines general domain coverage, long generation horizon, high dynamic degree, 720p resolution and real time capabilities.

https://arxiv.org/pdf/2601.20540v1

Applications, promptable worlds, agents and 3D reconstruction

Beyond video synthesis, LingBot-World is positioned as a testbed for embodied AI. The model supports promptable world events, where text instructions change weather, lighting, style or inject local events such as fireworks or moving animals over time, while preserving spatial structure.

It can also train downstream action agents, for example with a small vision language action model like Qwen3-VL-2B predicting control policies from images. Because the generated video streams are geometrically consistent, they can be used as input to 3D reconstruction pipelines, which produce stable point clouds for indoor, outdoor and synthetic scenes.

Key Takeaways

LingBot-World is an action conditioned world model that extends text to video into text to world simulation, where keyboard actions and camera motion directly control long horizon video rollouts up to around 10 minutes.

The system is trained on a unified data engine that combines web videos, game logs with action labels and Unreal Engine trajectories, plus hierarchical narrative, static scene and dense temporal captions to separate layout from motion.

The core backbone is a 28B parameter mixture of experts diffusion transformer, built from Wan2.2, with 2 experts of 14B each, and action adapters that are fine tuned while the visual backbone remains frozen.

LingBot-World-Fast is a distilled variant that uses block causal attention, diffusion forcing and distribution matching distillation to achieve about 16 frames per second at 480p on 1 GPU node, with reported end to end latency under 1 second for interactive use.

On VBench with 100 generated videos longer than 30 seconds, LingBot-World reports the highest imaging quality, aesthetic quality and dynamic degree among Yume-1.5 and HY-World-1.5, and the model shows emergent memory and stable long range structure suitable for embodied agents and 3D reconstruction.

Check out the Paper, Repo, Project page and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Robbyant Open Sources LingBot World: a Real Time World Model for Interactive Simulation and Embodied AI appeared first on MarkTechPost.

AI2 Releases SERA, Soft Verified Coding Agents Built with Supervised T …

Allen Institute for AI (AI2) Researchers introduce SERA, Soft Verified Efficient Repository Agents, as a coding agent family that aims to match much larger closed systems using only supervised training and synthetic trajectories.

What is SERA?

SERA is the first release in AI2’s Open Coding Agents series. The flagship model, SERA-32B, is built on the Qwen 3 32B architecture and is trained as a repository level coding agent.

On SWE bench Verified at 32K context, SERA-32B reaches 49.5 percent resolve rate. At 64K context it reaches 54.2 percent. These numbers place it in the same performance band as open weight systems such as Devstral-Small-2 with 24B parameters and GLM-4.5 Air with 110B parameters, while SERA remains fully open in code, data, and weights.

The series includes four models today, SERA-8B, SERA-8B GA, SERA-32B, and SERA-32B GA. All are released on Hugging Face under an Apache 2.0 license.

Soft Verified Generation

The training pipeline relies on Soft Verified Generation, SVG. SVG produces agent trajectories that look like realistic developer workflows, then uses patch agreement between two rollouts as a soft signal of correctness.

The process is:

First rollout: A function is sampled from a real repository. The teacher model, GLM-4.6 in the SERA-32B setup, receives a bug style or change description and operates with tools to view files, edit code, and run commands. It produces a trajectory T1 and a patch P1.

Synthetic pull request: The system converts the trajectory into a pull request like description. This text summarizes intent and key edits in a format similar to real pull requests.

Second rollout: The teacher starts again from the original repository, but now it only sees the pull request description and the tools. It produces a new trajectory T2 and patch P2 that tries to implement the described change.

Soft verification: The patches P1 and P2 are compared line by line. A recall score r is computed as the fraction of modified lines in P1 that appear in P2. When r equals 1 the trajectory is hard verified. For intermediate values, the sample is soft verified.

The key result from the ablation study is that strict verification is not required. When models are trained on T2 trajectories with different thresholds on r, even r equals 0, performance on SWE bench Verified is similar at a fixed sample count. This suggests that realistic multi step traces, even if noisy, are valuable supervision for coding agents.

https://allenai.org/blog/open-coding-agents

Data scale, training, and cost

SVG is applied to 121 Python repositories derived from the SWE-smith corpus. Across GLM-4.5 Air and GLM-4.6 teacher runs, the full SERA datasets contain more than 200,000 trajectories from both rollouts, making this one of the largest open coding agent datasets.

SERA-32B is trained on a subset of 25,000 T2 trajectories from the Sera-4.6-Lite T2 dataset. Training uses standard supervised fine tuning with Axolotl on Qwen-3-32B for 3 epochs, learning rate 1e-5, weight decay 0.01, and maximum sequence length 32,768 tokens.

Many trajectories are longer than the context limit. The research team define a truncation ratio, the fraction of steps that fit into 32K tokens. They then prefer trajectories that already fit, and for the rest they select slices with high truncation ratio. This ordered truncation strategy clearly outperforms random truncation when they compare SWE bench Verified scores.

The reported compute budget for SERA-32B, including data generation and training, is about 40 GPU days. Using a scaling law over dataset size and performance, the research team estimated that the SVG approach is around 26 times cheaper than reinforcement learning based systems such as SkyRL-Agent and 57 times cheaper than earlier synthetic data pipelines such as SWE-smith for reaching similar SWE-bench scores.

https://allenai.org/blog/open-coding-agents

Repository specialization

A central use case is adapting an agent to a specific repository. The research team studies this on three major SWE-bench Verified projects, Django, SymPy, and Sphinx.

For each repository, SVG generates on the order of 46,000 to 54,000 trajectories. Due to compute limits, the specialization experiments train on 8,000 trajectories per repository, mixing 3,000 soft verified T2 trajectories with 5,000 filtered T1 trajectories.

At 32K context, these specialized students match or slightly outperform the GLM-4.5-Air teacher, and also compare well with Devstral-Small-2 on those repository subsets. For Django, a specialized student reaches 52.23 percent resolve rate versus 51.20 percent for GLM-4.5-Air. For SymPy, the specialized model reaches 51.11 percent versus 48.89 percent for GLM-4.5-Air.

Key Takeaways

SERA turns coding agents into a supervised learning problem: SERA-32B is trained with standard supervised fine tuning on synthetic trajectories from GLM-4.6, with no reinforcement learning loop and no dependency on repository test suites.

Soft Verified Generation removes the need for tests: SVG uses two rollouts and patch overlap between P1 and P2 to compute a soft verification score, and the research team show that even unverified or weakly verified trajectories can train effective coding agents.

Large, realistic agent dataset from real repositories: The pipeline applies SVG to 121 Python projects from the SWE smith corpus, producing more than 200,000 trajectories and creating one of the largest open datasets for coding agents.

Efficient training with explicit cost and scaling analysis: SERA-32B trains on 25,000 T2 trajectories and the scaling study shows that SVG is about 26 times cheaper than SkyRL-Agent and 57 times cheaper than SWE-smith at similar SWE bench Verified performance.

Check out the Paper, Repo and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post AI2 Releases SERA, Soft Verified Coding Agents Built with Supervised Training Only for Practical Repository Level Automation Workflows appeared first on MarkTechPost.

A Coding Implementation to Training, Optimizing, Evaluating, and Inter …

In this tutorial, we walk through an end-to-end, advanced workflow for knowledge graph embeddings using PyKEEN, actively exploring how modern embedding models are trained, evaluated, optimized, and interpreted in practice. We start by understanding the structure of a real knowledge graph dataset, then systematically train and compare multiple embedding models, tune their hyperparameters, and analyze their performance using robust ranking metrics. Also, we focus not just on running pipelines but on building intuition for link prediction, negative sampling, and embedding geometry, ensuring we understand why each step matters and how it affects downstream reasoning over graphs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q pykeen torch torchvision

import warnings
warnings.filterwarnings(‘ignore’)

import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple

from pykeen.pipeline import pipeline
from pykeen.datasets import Nations, FB15k237, get_dataset
from pykeen.models import TransE, ComplEx, RotatE, DistMult
from pykeen.training import SLCWATrainingLoop, LCWATrainingLoop
from pykeen.evaluation import RankBasedEvaluator
from pykeen.triples import TriplesFactory
from pykeen.hpo import hpo_pipeline
from pykeen.sampling import BasicNegativeSampler
from pykeen.losses import MarginRankingLoss, BCEWithLogitsLoss
from pykeen.trackers import ConsoleResultTracker

print(“PyKEEN setup complete!”)
print(f”PyTorch version: {torch.__version__}”)
print(f”CUDA available: {torch.cuda.is_available()}”)

We set up the complete experimental environment by installing PyKEEN and its deep learning dependencies, and by importing all required libraries for modeling, evaluation, visualization, and optimization. We ensure a clean, reproducible workflow by suppressing warnings and verifying the PyTorch and CUDA configurations for efficient computation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n” + “=”*80)
print(“SECTION 2: Dataset Exploration”)
print(“=”*80 + “n”)

dataset = Nations()

print(f”Dataset: {dataset}”)
print(f”Number of entities: {dataset.num_entities}”)
print(f”Number of relations: {dataset.num_relations}”)
print(f”Training triples: {dataset.training.num_triples}”)
print(f”Testing triples: {dataset.testing.num_triples}”)
print(f”Validation triples: {dataset.validation.num_triples}”)

print(“nSample triples (head, relation, tail):”)
for i in range(5):
h, r, t = dataset.training.mapped_triples[i]
head = dataset.training.entity_id_to_label[h.item()]
rel = dataset.training.relation_id_to_label[r.item()]
tail = dataset.training.entity_id_to_label[t.item()]
print(f” {head} –[{rel}]–> {tail}”)

def analyze_dataset(triples_factory: TriplesFactory) -> pd.DataFrame:
“””Compute basic statistics about the knowledge graph.”””
stats = {
‘Metric’: [],
‘Value’: []
}

stats[‘Metric’].extend([‘Entities’, ‘Relations’, ‘Triples’])
stats[‘Value’].extend([
triples_factory.num_entities,
triples_factory.num_relations,
triples_factory.num_triples
])

unique, counts = torch.unique(triples_factory.mapped_triples[:, 1], return_counts=True)
stats[‘Metric’].extend([‘Avg triples per relation’, ‘Max triples for a relation’])
stats[‘Value’].extend([counts.float().mean().item(), counts.max().item()])

return pd.DataFrame(stats)

stats_df = analyze_dataset(dataset.training)
print(“nDataset Statistics:”)
print(stats_df.to_string(index=False))

We load and explore the Nation’s knowledge graph to understand its scale, structure, and relational complexity before training any models. We inspect sample triples to build intuition about how entities and relations are represented internally using indexed mappings. We then compute core statistics such as relation frequency and triple distribution, allowing us to reason about graph sparsity and modeling difficulty upfront. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n” + “=”*80)
print(“SECTION 3: Training Multiple Models”)
print(“=”*80 + “n”)

models_config = {
‘TransE’: {
‘model’: ‘TransE’,
‘model_kwargs’: {’embedding_dim’: 50},
‘loss’: ‘MarginRankingLoss’,
‘loss_kwargs’: {‘margin’: 1.0}
},
‘ComplEx’: {
‘model’: ‘ComplEx’,
‘model_kwargs’: {’embedding_dim’: 50},
‘loss’: ‘BCEWithLogitsLoss’,
},
‘RotatE’: {
‘model’: ‘RotatE’,
‘model_kwargs’: {’embedding_dim’: 50},
‘loss’: ‘MarginRankingLoss’,
‘loss_kwargs’: {‘margin’: 3.0}
}
}

training_config = {
‘training_loop’: ‘sLCWA’,
‘negative_sampler’: ‘basic’,
‘negative_sampler_kwargs’: {‘num_negs_per_pos’: 5},
‘training_kwargs’: {
‘num_epochs’: 100,
‘batch_size’: 128,
},
‘optimizer’: ‘Adam’,
‘optimizer_kwargs’: {‘lr’: 0.001}
}

results = {}

for model_name, config in models_config.items():
print(f”nTraining {model_name}…”)

result = pipeline(
dataset=dataset,
model=config[‘model’],
model_kwargs=config.get(‘model_kwargs’, {}),
loss=config.get(‘loss’),
loss_kwargs=config.get(‘loss_kwargs’, {}),
**training_config,
random_seed=42,
device=’cuda’ if torch.cuda.is_available() else ‘cpu’
)

results[model_name] = result

print(f”n{model_name} Results:”)
print(f” MRR: {result.metric_results.get_metric(‘mean_reciprocal_rank’):.4f}”)
print(f” Hits@1: {result.metric_results.get_metric(‘hits_at_1’):.4f}”)
print(f” Hits@3: {result.metric_results.get_metric(‘hits_at_3’):.4f}”)
print(f” Hits@10: {result.metric_results.get_metric(‘hits_at_10’):.4f}”)

We define a consistent training configuration and systematically train multiple knowledge graph embedding models to enable fair comparison. We use the same dataset, negative sampling strategy, optimizer, and training loop while allowing each model to leverage its own inductive bias and loss formulation. We then evaluate and record standard ranking metrics, such as MRR and Hits@K, to quantitatively assess each embedding approach’s performance on link prediction. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n” + “=”*80)
print(“SECTION 4: Model Comparison”)
print(“=”*80 + “n”)

metrics_to_compare = [‘mean_reciprocal_rank’, ‘hits_at_1’, ‘hits_at_3’, ‘hits_at_10’]
comparison_data = {metric: [] for metric in metrics_to_compare}
model_names = []

for model_name, result in results.items():
model_names.append(model_name)
for metric in metrics_to_compare:
comparison_data[metric].append(
result.metric_results.get_metric(metric)
)

comparison_df = pd.DataFrame(comparison_data, index=model_names)
print(“Model Comparison:”)
print(comparison_df.to_string())

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle(‘Model Performance Comparison’, fontsize=16)

for idx, metric in enumerate(metrics_to_compare):
ax = axes[idx // 2, idx % 2]
comparison_df[metric].plot(kind=’bar’, ax=ax, color=’steelblue’)
ax.set_title(metric.replace(‘_’, ‘ ‘).title())
ax.set_ylabel(‘Score’)
ax.set_xlabel(‘Model’)
ax.grid(axis=’y’, alpha=0.3)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

plt.tight_layout()
plt.show()

We aggregate evaluation metrics from all trained models into a unified comparison table for direct performance analysis. We visualize key ranking metrics using bar charts, allowing us to quickly identify strengths and weaknesses across different embedding approaches. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n” + “=”*80)
print(“SECTION 5: Hyperparameter Optimization”)
print(“=”*80 + “n”)

hpo_result = hpo_pipeline(
dataset=dataset,
model=’TransE’,
n_trials=10,
training_loop=’sLCWA’,
training_kwargs={‘num_epochs’: 50},
device=’cuda’ if torch.cuda.is_available() else ‘cpu’,
)

print(“nBest Configuration Found:”)
print(f” Embedding Dim: {hpo_result.study.best_params.get(‘model.embedding_dim’, ‘N/A’)}”)
print(f” Learning Rate: {hpo_result.study.best_params.get(‘optimizer.lr’, ‘N/A’)}”)
print(f” Best MRR: {hpo_result.study.best_value:.4f}”)

print(“n” + “=”*80)
print(“SECTION 6: Link Prediction”)
print(“=”*80 + “n”)

best_model_name = comparison_df[‘mean_reciprocal_rank’].idxmax()
best_result = results[best_model_name]
model = best_result.model

print(f”Using {best_model_name} for predictions”)

def predict_tails(model, dataset, head_label: str, relation_label: str, top_k: int = 5):
“””Predict most likely tail entities for a given head and relation.”””
head_id = dataset.entity_to_id[head_label]
relation_id = dataset.relation_to_id[relation_label]

num_entities = dataset.num_entities
heads = torch.tensor([head_id] * num_entities).unsqueeze(1)
relations = torch.tensor([relation_id] * num_entities).unsqueeze(1)
tails = torch.arange(num_entities).unsqueeze(1)

batch = torch.cat([heads, relations, tails], dim=1)

with torch.no_grad():
scores = model.predict_hrt(batch)

top_scores, top_indices = torch.topk(scores.squeeze(), k=top_k)

predictions = []
for score, idx in zip(top_scores, top_indices):
tail_label = dataset.entity_id_to_label[idx.item()]
predictions.append((tail_label, score.item()))

return predictions

if dataset.training.num_entities > 10:
sample_head = list(dataset.entity_to_id.keys())[0]
sample_relation = list(dataset.relation_to_id.keys())[0]

print(f”nTop predictions for: {sample_head} –[{sample_relation}]–> ?”)
predictions = predict_tails(
best_result.model,
dataset.training,
sample_head,
sample_relation,
top_k=5
)

for rank, (entity, score) in enumerate(predictions, 1):
print(f” {rank}. {entity} (score: {score:.4f})”)

We apply automated hyperparameter optimization to systematically search for a stronger TransE configuration that improves ranking performance without manual tuning. We then select the best-performing model based on MRR and use it to perform practical link prediction by scoring all possible tail entities for a given head–relation pair. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n” + “=”*80)
print(“SECTION 7: Model Interpretation”)
print(“=”*80 + “n”)

entity_embeddings = model.entity_representations[0]()
entity_embeddings_tensor = entity_embeddings.detach().cpu()

print(f”Entity embeddings shape: {entity_embeddings_tensor.shape}”)
print(f”Embedding dtype: {entity_embeddings_tensor.dtype}”)

if entity_embeddings_tensor.is_complex():
print(“Detected complex embeddings – converting to real representation”)
entity_embeddings_np = np.concatenate([
entity_embeddings_tensor.real.numpy(),
entity_embeddings_tensor.imag.numpy()
], axis=1)
print(f”Converted embeddings shape: {entity_embeddings_np.shape}”)
else:
entity_embeddings_np = entity_embeddings_tensor.numpy()

from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(entity_embeddings_np)

def find_similar_entities(entity_label: str, top_k: int = 5):
“””Find most similar entities based on embedding similarity.”””
entity_id = dataset.training.entity_to_id[entity_label]
similarities = similarity_matrix[entity_id]

similar_indices = np.argsort(similarities)[::-1][1:top_k+1]

similar_entities = []
for idx in similar_indices:
label = dataset.training.entity_id_to_label[idx]
similarity = similarities[idx]
similar_entities.append((label, similarity))

return similar_entities

if dataset.training.num_entities > 5:
example_entity = list(dataset.entity_to_id.keys())[0]
print(f”nEntities most similar to ‘{example_entity}’:”)
similar = find_similar_entities(example_entity, top_k=5)
for rank, (entity, sim) in enumerate(similar, 1):
print(f” {rank}. {entity} (similarity: {sim:.4f})”)

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(entity_embeddings_np)

plt.figure(figsize=(12, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.6)

num_labels = min(10, len(dataset.training.entity_id_to_label))
for i in range(num_labels):
label = dataset.training.entity_id_to_label[i]
plt.annotate(label, (embeddings_2d[i, 0], embeddings_2d[i, 1]),
fontsize=8, alpha=0.7)

plt.title(‘Entity Embeddings (2D PCA Projection)’)
plt.xlabel(‘PC1’)
plt.ylabel(‘PC2’)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(“n” + “=”*80)
print(“TUTORIAL SUMMARY”)
print(“=”*80 + “n”)

print(“””
Key Takeaways:
1. PyKEEN provides easy-to-use pipelines for KG embeddings
2. Multiple models can be compared with minimal code
3. Hyperparameter optimization improves performance
4. Models can predict missing links in knowledge graphs
5. Embeddings capture semantic relationships
6. Always use filtered evaluation for fair comparison
7. Consider multiple metrics (MRR, Hits@K)

Next Steps:
– Try different models (ConvE, TuckER, etc.)
– Use larger datasets (FB15k-237, WN18RR)
– Implement custom loss functions
– Experiment with relation prediction
– Use your own knowledge graph data

For more information, visit: https://pykeen.readthedocs.io
“””)

print(“n✓ Tutorial Complete!”)

We interpret the learned entity embeddings by measuring semantic similarity and identifying closely related entities in the vector space. We project high-dimensional embeddings into two dimensions using PCA to visually inspect structural patterns and clustering behavior within the knowledge graph. We then consolidate key takeaways and outline clear next steps, reinforcing how embedding analysis connects model performance to meaningful graph-level insights.

In conclusion, we developed a complete, practical understanding of how to work with knowledge graph embeddings at an advanced level, from raw triples to interpretable vector spaces. We demonstrated how to rigorously compare models, apply hyperparameter optimization, perform link prediction, and analyze embeddings to uncover semantic structure within the graph. Also, we showed how PyKEEN enables rapid experimentation while still allowing fine-grained control over training and evaluation, making it suitable for both research and real-world knowledge graph applications.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation to Training, Optimizing, Evaluating, and Interpreting Knowledge Graph Embeddings with PyKEEN appeared first on MarkTechPost.

Ant Group Releases LingBot-VLA, A Vision Language Action Foundation Mo …

How do you build a single vision language action model that can control many different dual arm robots in the real world? LingBot-VLA is Ant Group Robbyant’s new Vision Language Action foundation model that targets practical robot manipulation in the real world. It is trained on about 20,000 hours of teleoperated bimanual data collected from 9 dual arm robot embodiments and is evaluated on the large scale GM-100 benchmark across 3 platforms. The model is designed for cross morphology generalization, data efficient post training, and high training throughput on commodity GPU clusters.

https://arxiv.org/pdf/2601.18692

Large scale dual arm dataset across 9 robot embodiments

The pre-training dataset is built from real world teleoperation on 9 popular dual arm configurations. These include AgiBot G1, AgileX, Galaxea R1Lite, Galaxea R1Pro, Realman Rs 02, Leju KUAVO 4 Pro, Qinglong humanoid, ARX Lift2, and a Bimanual Franka setup. All systems have dual 6 or 7 degree of freedom arms with parallel grippers and multiple RGB-D cameras that provide multi view observations

Teleoperation uses VR control for AgiBot G1 and isomorphic arm control for AgileX. For each scene the recorded videos from all views are segmented by human annotators into clips that correspond to atomic actions. Static frames at the start and end of each clip are removed to reduce redundancy. Task level and sub task level language instructions are then generated with Qwen3-VL-235B-A22B. This pipeline yields synchronized sequences of images, instructions, and action trajectories for pre-training. 

To characterize action diversity the research team visualizes the most frequent atomic actions in training and tests through word clouds. About 50 percent of atomic actions in the test set do not appear within the top 100 most frequent actions in the training set. This gap ensures that evaluation stresses cross task generalization rather than frequency based memorization.

https://arxiv.org/pdf/2601.18692

Architecture, Mixture of Transformers, and Flow Matching actions

LingBot-VLA combines a strong multimodal backbone with an action expert through a Mixture of Transformers architecture. The vision language backbone is Qwen2.5-VL. It encodes multi-view operational images and the natural language instruction into a sequence of multimodal tokens. In parallel, the action expert receives robot proprioceptive state and chunks of past actions. Both branches share a self attention module that performs layer wise joint sequence modeling over observation and action tokens.

At each time step the model forms an observation sequence that concatenates tokens from 3 camera views, the task instruction, and the robot state. The action sequence is a future action chunk with a temporal horizon set to 50 during pre-training. The training objective is conditional Flow Matching. The model learns a vector field that transports Gaussian noise to the ground truth action trajectory along a linear probability path. This gives a continuous action representation and produces smooth, temporally coherent control suitable for precise dual arm manipulation. 

LingBot-VLA uses blockwise causal attention over the joint sequence. Observation tokens can attend to each other bidirectionally. Action tokens can attend to all observation tokens and only to past action tokens. This mask prevents information leakage from future actions into current observations while still allowing the action expert to exploit the full multimodal context at each decision step.

Spatial perception via LingBot Depth distillation

Many VLA models struggle with depth reasoning when depth sensors fail or return sparse measurements. LingBot-VLA addresses this by integrating LingBot-Depth, a separate spatial perception model based on Masked Depth Modeling. LingBot-Depth is trained in a self supervised way on a large RGB-D corpus and learns to reconstruct dense metric depth when parts of the depth map are masked, often in regions where physical sensors tend to fail.

In LingBot-VLA the visual queries from each camera view are aligned with LingBot-Depth tokens through a projection layer and a distillation loss. Cross attention maps VLM queries into the depth latent space and the training minimizes their difference from LingBot-Depth features. This injects geometry aware information into the policy and improves performance on tasks that require accurate 3D spatial reasoning, such as insertion, stacking, and folding under clutter and occlusion. 

GM-100 real world benchmark across 3 platforms

The main evaluation uses GM-100, a real world benchmark with 100 manipulation tasks and 130 filtered teleoperated trajectories per task on each of 3 hardware platforms. Experiments compare LingBot-VLA with π0.5, GR00T N1.6, and WALL-OSS under a shared post training protocol. All methods fine tune from public checkpoints with the same dataset, batch size 256, and 20 epochs. Success Rate measures completion of all subtasks within 3 minutes and Progress Score tracks partial completion.

On GM-100, LingBot-VLA with depth achieves state of the art averages across the 3 platforms. The average Success Rate is 17.30 percent and the average Progress Score is 35.41 percent. π0.5 reaches 13.02 percent SR (success rate) and 27.65 percent PS (progress score). GR00T N1.6 and WALL-OSS are lower at 7.59 percent SR, 15.99 percent PS and 4.05 percent SR, 10.35 percent PS respectively. LingBot-VLA without depth already outperforms GR00T N1.6 and WALL-OSS and the depth variant adds further gains.

In RoboTwin 2.0 simulation with 50 tasks, models are trained on 50 demonstrations per task in clean scenes and 500 per task in randomized scenes. LingBot-VLA with depth reaches 88.56 percent average Success Rate in clean scenes and 86.68 percent in randomized scenes. π0.5 reaches 82.74 percent and 76.76 percent in the same settings. This shows consistent gains from the same architecture and depth integration when domain randomization is strong.

https://arxiv.org/pdf/2601.18692

Scaling behavior and data efficient post training

The research team analyzes scaling laws by varying pre-training data from 3,000 to 20,000 hours on a subset of 25 tasks. Both Success Rate and Progress Score increase monotonically with data volume, with no saturation at the largest scale studied. This is the first empirical evidence that VLA models maintain favorable scaling on real robot data at this size.

They also study data efficiency of post training on AgiBot G1 using 8 representative GM-100 tasks. With only 80 demonstrations per task LingBot-VLA already surpasses π0.5 that uses the full 130 demonstration set, in both Success Rate and Progress Score. As more trajectories are added the performance gap widens. This confirms that the pre-trained policy transfers with only dozens to around 100 task specific trajectories, which directly reduces adaptation cost for new robots or tasks. 

Training throughput and open source toolkit

LingBot-VLA comes with a training stack optimized for multi-node efficiency. The codebase uses a FSDP style strategy for parameters and optimizer states, hybrid sharding for the action expert, mixed precision with float32 reductions and bfloat16 storage, and operator level acceleration with fused attention kernels and torch compile. 

On an 8 GPU setup the research team reported throughput of 261 samples per second per GPU for Qwen2.5-VL-3B and PaliGemma-3B-pt-224 model configurations. This corresponds to a 1.5 times to 2.8 times speedup compared with existing VLA oriented codebases such as StarVLA, Dexbotic, and OpenPI evaluated on the same Libero based benchmark. Throughput scales close to linearly when moving from 8 to 256 GPUs. The full post training toolkit is released as open source. 

Key Takeaways

LingBot-VLA is a Qwen2.5-VL based vision language action foundation model trained on about 20,000 hours of real world dual arm teleoperation across 9 robot embodiments, which enables strong cross morphology and cross task generalization.

The model integrates LingBot Depth through feature distillation so vision tokens are aligned with a depth completion expert, which significantly improves 3D spatial understanding for insertion, stacking, folding, and other geometry sensitive tasks.

On the GM-100 real world benchmark, LingBot-VLA with depth achieves about 17.30 percent average Success Rate and 35.41 percent average Progress Score, which is higher than π0.5, GR00T N1.6, and WALL OSS under the same post training protocol.

LingBot-VLA shows high data efficiency in post training, since on AgiBot G1 it can surpass π0.5 that uses 130 demonstrations per task while using only about 80 demonstrations per task, and performance continues to improve as more trajectories are added.

Check out the Paper, Model Weight, Repo and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Ant Group Releases LingBot-VLA, A Vision Language Action Foundation Model For Real World Robot Manipulation appeared first on MarkTechPost.

Beyond the Chatbox: Generative UI, AG-UI, and the Stack Behind Agent-D …

Most AI applications still showcase the model as a chat box. That interface is simple, but it hides what agents are actually doing, such as planning steps, calling tools, and updating state. Generative UI is about letting the agent drive real interface elements, for example tables, charts, forms, and progress indicators, so the experience feels like a product, not a log of tokens.

https://www.copilotkit.ai/blog/the-state-of-agentic-ui-comparing-ag-ui-mcp-ui-and-a2ui-protocols

What is Generative UI?

The CopilotKit team explains Generative UI as to any user interface that is partially or fully produced by an AI agent. Instead of only returning text, the agent can drive:

stateful components such as forms and filters

visualizations such as charts and tables

multistep flows such as wizards

status surfaces such as progress and intermediate results

https://www.copilotkit.ai/blog/the-state-of-agentic-ui-comparing-ag-ui-mcp-ui-and-a2ui-protocols

The key idea is that the UI is still implemented by the application. The agent describes what should change, and the UI layer chooses how to render it and how to keep state consistent. 

Three main patterns of Generative UI:

Static generative UI: the agent selects from a fixed catalog of components and fills props

Declarative generative UI:  the agent returns a structured schema that a renderer maps to components

Fully generated UI:  the model emits raw markup such as HTML or JSX

Most production systems today use static or declarative forms, because they are easier to secure and test.

You can also download the Generative UI Guide here.

But why is it needed for Devs?

The main pain point in agent applications is the connection between the model and the product. Without a standard approach, every team builds custom web-sockets, ad-hoc event formats, and one off ways to stream tool calls and state.

Generative UI, together with a protocol like AG-UI, gives a consistent mental model:

the agent backend exposes state, tool activity, and UI intent as structured events

the frontend consumes those events and updates components

user interactions are converted back into structured signals that the agent can reason over

CopilotKit packages this in its SDKs with hooks, shared state, typed actions, and Generative UI helpers for React and other frontends. This lets you focus on the agent logic and domain specific UI instead of inventing a protocol.

https://www.copilotkit.ai/blog/the-state-of-agentic-ui-comparing-ag-ui-mcp-ui-and-a2ui-protocols

How does it affect End Users?

For end users, the difference is visible as soon as the workflow becomes non-trivial.

A data analysis copilot can show filters, metric pickers, and live charts instead of describing plots in text. A support agent can surface record editing forms and status timelines instead of long explanations of what it did. An operations agent can show task queues, error badges, and retry buttons that the user can act on.

This is what CopilotKit and the AG-UI ecosystem call agentic UI, user interfaces where the agent is embedded in the product and updates the UI in real time, while users stay in control through direct interaction.

The Protocol Stack, AG-UI, MCP Apps, A2UI, Open-JSON-UI

Several specifications define how agents express UI intent. CopilotKit’s documentation and the AG-UI docs summarize three main generative UI specs:

A2UI from Google, a declarative, JSON based Generative UI spec designed for streaming and platform agnostic rendering

Open-JSON-UI from OpenAI, an open standardization of OpenAI’s internal declarative Generative UI schema for structured interfaces 

MCP Apps from Anthropic and OpenAI, a Generative UI layer on top of MCP where tools can return iframe based interactive surfaces 

These are payload formats. They describe what UI to render, for example a card, table, or form, and the associated data.

AG-UI sits at a different layer. It is the Agent User Interaction protocol, an event driven, bi-directional runtime that connects any agent backend to any frontend over transports such as server sent events or WebSockets. AG-UI carries:

lifecycle and message events

state snapshots and deltas

tool activity

user actions

generative UI payloads such as A2UI, Open-JSON-UI, or MCP Apps

MCP connects agents to tools and data, A2A connects agents to each other, A2UI or Open-JSON-UI define declarative UI payloads, MCP Apps defines iframe based UI payloads, and AG-UI moves all of those between agent and UI.

Key Takeaways

Generative UI is structured UI, not just chat: Agents emit structured UI intent, such as forms, tables, charts, and progress, which the app renders as real components, so the model controls stateful views, not only text streams.

AG-UI is the runtime pipe, A2UI and Open JSON UI and MCP Apps are payloads: AG-UI carries events between agent and frontend, while A2UI, Open JSON UI, and MCP UI define how UI is described as JSON or iframe based payloads that the UI layer renders.

CopilotKit standardizes agent to UI-wiring: CopilotKit provides SDKs, shared state, typed actions, and Generative UI helpers so developers do not build custom protocols for streaming state, tool activity, and UI updates.

Static and declarative Generative UI are production friendly: Most real apps use static catalogs of components or declarative specs such as A2UI or Open JSON UI, which keep security, testing, and layout control in the host application.

User interactions become first class events for the agent: Clicks, edits, and submissions are converted into structured AG-UI events, the agent consumes them as inputs for planning and tool calls, which closes the human in the loop control cycle.

Generative UI sounds abstract until you see it running.

If you’re curious how these ideas translate into real applications, CopilotKit is open source and actively used to build agent-native interfaces – from simple workflows to more complex systems. Dive into the repo and explore the patterns on GitHub. It’s all built in the open.

You can find here additional learning materials for Generative UI. You can also download the Generative UI Guide here.

Generative-UIThe post Beyond the Chatbox: Generative UI, AG-UI, and the Stack Behind Agent-Driven Interfaces appeared first on MarkTechPost.

Google DeepMind Unveils AlphaGenome: A Unified Sequence-to-Function Mo …

Google DeepMind is expanding its biological toolkit beyond the world of protein folding. After the success of AlphaFold, the Google’s research team has introduced AlphaGenome. This is a unified deep learning model designed for sequence to function genomics. This represents a major shift in how we model the human genome. AlphaGenome does not treat DNA as simple text. Instead, it processes 1,000,000 base pair windows of raw DNA to predict the functional state of a cell.

Bridging the Scale Gap with Hybrid Architectures

The complexity of the human genome comes from its scale. Most existing models struggle to see the big picture while keeping track of fine details. AlphaGenome solves this by using a hybrid architecture. It combines a U-Net backbone with Transformer blocks. This allows the model to capture long range interactions across 1 Megabase of sequence while maintaining base pair resolution. This is like building a system that can read a thousand page book and still remember the exact location of a single comma.

Mapping Sequences to Functional Biological Modalities

AlphaGenome is a sequence to function model. This means its primary goal is to map DNA sequences directly to biological activities. These activities are measured in genomic tracks. The research team trained AlphaGenome to predict 11 different genomic modalities. These modalities include RNA-seq, CAGE, and ATAC-seq. They also include ChIP-seq for various transcription factors and chromatin contact maps. By predicting all these tracks at once, the model gains a holistic understanding of how DNA regulates the cell.

The Power of Multi-Task Learning in Genomics

The technical advancement of AlphaGenome lies in its ability to handle 11 distinct types of data simultaneously. In the past, researchers often built separate models for each task. AlphaGenome uses a multi-task learning approach. This helps the model learn shared features across different biological processes. If the model understands how a protein binds to DNA, it can better predict how that DNA will be expressed as RNA. This unified approach reduces the need for multiple specialized models.

Advancing Variant Effect Prediction via Distillation

One of the most critical applications for AlphaGenome is Variant Effect Prediction, or VEP. This process determines how a single mutation in DNA affects the body. Mutations can lead to diseases like cancer or heart disease. AlphaGenome excels at this by using a specific training method called Teacher Student distillation. The research team first created an ensemble of ‘all folds’ teacher models. These teachers were trained on vast amounts of genomic data. Then, they distilled that knowledge into a single student model.

Compressing Knowledge for Precision Medicine

This distillation process makes the model both faster and more robust. This is a standard way to compress knowledge. However, applying it to genomics at this scale is a new milestone. The student model learns to replicate the high quality predictions of the teacher ensemble. This allows it to identify harmful mutations with high accuracy. The model can even predict how a mutation in a distant regulatory element might impact a gene far away on the DNA strand.

High-Performance Computing with JAX and TPUs

The architecture is implemented using JAX. JAX is a high performance numerical computing library. It is often used for high scale machine learning at Google. Using JAX allows AlphaGenome to run efficiently on Tensor Processing Units, or TPUs. The research team used sequence parallelism to handle the massive 1 Megabase input windows. This ensures that the memory requirements do not explode as the sequence length increases. This shows the importance of selecting the right framework for large scale biological data.

Transfer Learning for Data-Scarce Cell Types

AlphaGenome also addresses the challenge of data scarcity in certain cell types. Because it is a foundation model, it can be fine tuned for specific tasks. The model learns general biological rules from large public datasets. These rules can then be applied to rare diseases or specific tissues where data is hard to find. This transfer learning capability is one of the reasons why AlphaGenome is so versatile. It can predict how a gene will behave in a brain cell even if it was primarily trained on liver cell data.

Toward a New Era of Personalized Care

In the future, AlphaGenome could lead to a new era of personalized medicine. Doctors could use the model to scan a patient’s entire genome in 1,000,000 base pair chunks. They could identify exactly which variants are likely to cause health issues. This would allow for treatments that are tailored to a person’s specific genetic code. AlphaGenome moves us closer to this reality by providing a clear and accurate map of the functional genome.

Setting the Standard for Biological AI

AlphaGenome also marks a turning point for AI in genomics. It proves that we can model the most complex biological systems using the same principles used in modern AI. By combining U-Net structures with Transformers and using teacher student distillation, Google DeepMind team has set a new standard.

Key Takeaways

Hybrid Sequence Architecture: AlphaGenome uses a specialized hybrid design that combines a U-Net backbone with Transformer blocks. This allows the model to process massive windows of 1,000,000 base pairs while maintaining the high resolution needed to identify single mutations.

Multi-Modal Functional Prediction: The model is trained to predict 11 different genomic modalities simultaneously, which include RNA-seq, CAGE, and ATAC-seq. By learning these various biological tracks together, the system gains a holistic understanding of how DNA regulates cellular activity across different tissues.

Teacher-Student Distillation: To achieve industry leading accuracy in Variant Effect Prediction (VEP), researchers used a distillation method. They transferred the knowledge from an ensemble of high performing ‘teacher’ models into a single, efficient ‘student’ model that is faster and more robust for identifying disease-causing mutations.

Built for High Performance Computing: The framework is implemented in JAX and optimized for TPUs. By using sequence parallelism, AlphaGenome can handle the computational load of analyzing megabase scale DNA sequences without exceeding memory limits, making it a powerful tool for large scale research.

Check out the Paper and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Unveils AlphaGenome: A Unified Sequence-to-Function Model Using Hybrid Transformers and U-Nets to Decode the Human Genome appeared first on MarkTechPost.

Scaling content review operations with multi-agent workflow

Enterprises are managing ever-growing volumes of content, ranging from product catalogs and support articles to knowledge bases and technical documentation. Ensuring this information remains accurate, relevant, and aligned with the latest business facts is a formidable challenge. Manual content review processes are often slow, costly, and unable to keep pace with dynamic business needs. According to a McKinsey study, organizations that use generative AI for knowledge work, including content review and quality assurance can boost productivity by up to 30–50% and dramatically reduce time spent on repetitive verification tasks. Similarly, research from Deloitte highlights that AI-driven content operations not only increase efficiency but also help organizations maintain higher content accuracy and reduce operational risk.
Amazon Bedrock AgentCore, a purpose-built infrastructure for deploying and operating AI agents at scale, combined with Strands Agents, an open source SDK for building AI agents, empowers organizations to automate comprehensive content review workflows. This agent-based approach enables businesses to evaluate content for accuracy, verify information against authoritative sources, and generate actionable recommendations for improvement. By using specialized agents that work together autonomously, human experts can focus on strategic review tasks while the AI agent system handles large-scale content validation.
The agent-based approach we present is applicable to any type of enterprise content, from product documentation and knowledge bases to marketing materials and technical specifications. To demonstrate these concepts in action, we walk through a practical example of reviewing blog content for technical accuracy. These patterns and techniques can be directly adapted to various content review needs by adjusting the agent configurations, tools, and verification sources.
Solution overview
The content review solution implements a multi-agent workflow pattern, where three specialized AI agents built with Strands Agents and deployed on Amazon Bedrock AgentCore work in a coordinated pipeline. Each agent receives the output from the previous agent, processes it according to its specialized function, and passes enriched information to the next agent in the sequence. This creates a progressive refinement process where:

Content scanner agent analyzes raw content and extracts relevant information
Content verification agent takes these extracted elements and validates them against authoritative sources
Recommendation agent transforms verification findings into actionable content updates

Technical content maintenance requires multiple specialized agents because manually scanning, verifying, and updating documentation is inefficient and error prone. Each agent has a focused role – the scanner identifies time-sensitive elements, the verifier checks current accuracy, and the recommendation agent crafts precise updates. The system’s modular design, with clear interfaces and responsibilities, makes it easy to add new agents or expand capabilities as content complexity grows. To illustrate how this agent-based content review system works in practice, we walk through an implementation that reviews technical blog posts for accuracy. Tech companies frequently publish blog posts detailing new features, updates, and best practices. However, the rapid pace of innovation means some features become deprecated or updated, making it challenging to keep information current across hundreds or thousands of published posts. While we demonstrate this pattern with blog content, the architecture is content agnostic and supports any content type by configuring the agents with appropriate prompts, tools, and data sources.
Practical example: Blog content review solution
We use three specialized agents that communicate sequentially to automatically review posts and identify outdated technical information. Users can trigger the system manually or schedule it to run periodically.

Figure-1 Blog content review architecture
The workflow begins when a blog URL is provided to the blog scanner agent, which retrieves the content using Strands http_request tool and extracts key technical claims requiring verification. The verification agent then queries the AWS documentation MCP server to fetch the latest documentation and validate the technical claims against current documentation. Finally, the recommendation agent synthesizes the findings and generates a comprehensive review report with actionable recommendations for the blog team.
The code is open source and hosted on GitHub.
Multi-agent workflow
Content scanner agent: Intelligent extraction for obsolescence detection
The content scanner agent serves as the entry point to the multi-agent workflow. It is responsible for identifying potentially obsolete technical information. This agent specifically targets elements that are likely to become outdated over time. The agent analyzes content and produces structured output that categorizes each technical element by type, location in the blog, and time-sensitivity. This structured format enables the verification agent to receive well-organized data it can efficiently process.
Content verification agent: Evidence-based validation
The content verification agent receives the structured technical elements from the scanner agent and performs validation against authoritative sources. The verification agent uses the AWS documentation MCP server to access current technical documentation. For each technical element received from the scanner agent, it follows a systematic verification process guided by specific prompts that focus on objective, measurable criteria.
The agent is prompted to check for:

Version-specific information: Does the mentioned version number, API endpoint, or configuration parameter still exist?
Feature availability: Is the described service feature still available in the specified regions or tiers?
Syntax accuracy: Do code examples, CLI commands, or configuration snippets match current documentation?
Prerequisite validity: Are the listed requirements, dependencies, or setup steps still accurate?
Pricing and limits: Do mentioned costs, quotas, or service limits align with current published information?

For each technical element received from the scanner agent, the agent performs the following steps:

Generates targeted search queries based on the element type and content
Queries the documentation server for current information
Compares the original claim against authoritative sources using the specific criteria above
Classifies the verification result as CURRENT, PARTIALLY_OBSOLETE, or FULLY_OBSOLETE
Documents specific discrepancies with evidence

Example verification in action: When the scanner agent identifies the claim “Amazon Bedrock is available in us-east-1 and us-west-2 regions only,” the Verification Agent generates the search query “Amazon Bedrock available regions” and retrieves current regional availability from AWS documentation. Upon finding that Bedrock is now available in 8+ regions including eu-west-1 and ap-southeast-1, it classifies this as PARTIALLY_OBSOLETE with the evidence: “Original claim lists 2 regions, but current documentation shows availability in us-east-1, us-west-2, eu-west-1, ap-southeast-1, and 4 additional regions as of the verification date.”
The verification agent’s output maintains the element structure from the scanner agent while adding these verification details and evidence-based classifications.
Recommendation agent: Actionable update generation
The recommendation agent represents the final stage in the multi-agent workflow, transforming verification findings into ready-to-implement content updates. This agent receives the verification results and generates specific recommendations that maintain the original content’s style while correcting technical inaccuracies.
Adapting the multi-agent workflow pattern for your content review use cases
The multi-agent workflow pattern can be quickly adapted to any content review scenario without architectural changes. Whether reviewing product documentation, marketing materials, or regulatory compliance documents, the same three agent sequential workflow applies. The system prompts need to be modified for each agent to focus on domain specific elements and potentially swap out the tools or knowledge sources. For instance, while our blog review example uses an http_request tool to fetch the blog content and the AWS Documentation MCP Server for verification, a product catalog review system might use database connector tool to retrieve product information and query inventory management APIs for verification. Similarly, a compliance review system would adjust the scanner agent’s prompt to identify regulatory statements instead of technical claims, connect the verification agent to legal databases rather than technical documentation, and configure the recommendation agent to generate audit-ready reports instead of content updates. The core sequential steps extraction, verification, and recommendation remain constant across all these scenarios, providing a proven pattern that scales from technical blogs to any enterprise content type.We recommend the following changes to customize the solution for other content types.

Replace the values of CONTENT_SCANNER_PROMPT, CONTENT_VERIFICATION_PROMPT, and RECOMMENDATION_PROMPT variables with your custom prompt instructions:

python
CONTENT_SCANNER_PROMPT = “””<replace with your prompt instructions>”””
CONTENT_VERIFICATION_PROMPT = “””<replace with your prompt instructions>”””
RECOMMENDATION_PROMPT = “””<replace with your prompt instructions>”””

Update the official documentation MCP server for content verification agent:

python
product_db_mcp_client = MCPClient(
lambda: stdio_client(StdioServerParameters(
command=”uvx”, args=[“<replace with your official documentation MCP server>”]
))
)

Add appropriate content access tools such as database_query_tooland cms_api_tool for the content scanner agent when http_request tool is insufficient:

python
scanner_agent = Agent(
model=”us.anthropic.claude-3-7-sonnet-20250219-v1:0″,
system_prompt=CONTENT_SCANNER_PROMPT,
tools=[database_query_tool, cms_api_tool] # Replace http_request
)

These targeted modifications enable the same architectural pattern to handle any content type while maintaining the proven three-agent workflow structure, ensuring reliability and consistency across different content domains without requiring changes to the core orchestration logic.
Conclusion and next steps
In this post, we explained how to architect an AI agent powered content review system using Amazon Bedrock AgentCore and Strands Agents. We demonstrated the multi-agent workflow pattern where specialized agents work together to scan content, verify technical accuracy against authoritative sources, and generate actionable recommendations. Additionally, we discussed how to adapt this multi-agent pattern for different content types by modifying agent prompts, tools, and data sources while maintaining the same architectural framework.
We encourage you to test the sample code available on GitHub in your own account to gain first-hand experience with the solution. As next steps, consider starting with a pilot project on a subset of your content, customizing the agent prompts for your specific domain, and integrating appropriate verification sources for your use case. The modular nature of this architecture allows you to iteratively refine each agent’s capabilities as you expand the system to handle your organization’s full content review needs.

About the authors
Sarath Krishnan is a Senior Gen AI/ML Specialist Solutions Architect at Amazon Web Services, where he helps enterprise customers design and deploy generative AI and machine learning solutions that deliver measurable business outcomes. He brings deep expertise in Generative AI, Machine Learning, and MLOps to build scalable, secure, and production-ready AI systems.
Santhosh Kuriakose is an AI/ML Specialist Solutions Architect at Amazon Web Services, where he leverages his expertise in AI and ML to build technology solutions that deliver strategic business outcomes for his customers
Ravi Vijayan is a Customer Solutions Manager with Amazon Web Services. He brings expertise as a Developer, Tech Program Manager, and Client Partner, and is currently focused on helping customers fully realize the potential and benefits of migrating to the cloud and modernizing with Generative AI

Alibaba Introduces Qwen3-Max-Thinking, a Test Time Scaled Reasoning Mo …

Qwen3-Max-Thinking is Alibaba’s new flagship reasoning model. It does not only scale parameters, it also changes how inference is done, with explicit control over thinking depth and built in tools for search, memory, and code execution.

https://qwen.ai/blog?id=qwen3-max-thinking

Model scale, data, and deployment

Qwen3-Max-Thinking is a trillion-parameter MoE flagship LLM pretrained on 36T tokens and built on the Qwen3 family as the top tier reasoning model. The model targets long horizon reasoning and code, not only casual chat. It runs with a context window of 260k tokens, which supports repository scale code, long technical reports, and multi document analysis within a single prompt.

Qwen3-Max-Thinking is a closed model served through Qwen-Chat and Alibaba Cloud Model Studio with an OpenAI compatible HTTP API. The same endpoint can be called in a Claude style tool schema, so existing Anthropic or Claude Code flows can swap in Qwen3-Max-Thinking with minimal changes. There are no public weights, so usage is API based, which matches its positionin

Smart Test Time Scaling and experience cumulative reasoning

Most large language models improve reasoning by simple test time scaling, for example best of N sampling with several parallel chains of thought. That approach increases quality but cost grows almost linearly with the number of samples. Qwen3-Max-Thinking introduces an experience cumulative, multi round test time scaling strategy.

Instead of only sampling more in parallel, the model iterates within a single conversation, reusing intermediate reasoning traces as structured experience. After each round, it extracts useful partial conclusions, then focuses subsequent computation on unresolved parts of the question. This process is controlled by an explicit thinking budget that developers can adjust via API parameters such as enable_thinking and additional configuration fields.

The reported effect is that accuracy rises without a proportional increase in token count. For example, Qwen’s own ablations show GPQA Diamond increasing from around 90 level accuracy to about 92.8, and LiveCodeBench v6 rising from about 88.0 to 91.4 under the experience cumulative strategy at similar token budgets. This is important because it means higher reasoning quality can be driven by more efficient scheduling of compute, not only by more samples.

Native agent stack with Adaptive Tool Use

Qwen3-Max-Thinking integrates three tools as first class capabilities: Search, Memory, and a Code Interpreter. Search connects to web retrieval so the model can fetch fresh pages, extract content, and ground its answers. Memory stores user or session specific state, which supports personalized reasoning over longer workflows. The Code Interpreter executes Python, which allows numeric verification, data transforms, and program synthesis with runtime checks.

The model uses Adaptive Tool Use to decide when to invoke these tools during a conversation. Tool calls are interleaved with internal thinking segments, rather than being orchestrated by an external agent. This design reduces the need for separate routers or planners and tends to reduce hallucinations, because the model can explicitly fetch missing information or verify calculations instead of guessing.

Tool ability is also benchmarked. On Tau² Bench, which measures function calling and tool orchestration, Qwen3-Max-Thinking reports a score of 82.1, comparable with other frontier models in this category.

Benchmark profile across knowledge, reasoning, and search

On 19 public benchmarks, Qwen3-Max-Thinking is positioned at or near the same level as GPT 5.2 Thinking, Claude Opus 4.5, and Gemini 3 Pro. For knowledge tasks, reported scores include 85.7 on MMLU-Pro, 92.8 on MMLU-Redux, and 93.7 on C-Eval, where Qwen leads the group on Chinese language evaluation.

For hard reasoning, it records 87.4 on GPQA, 98.0 on HMMT Feb 25, 94.7 on HMMT Nov 25, and 83.9 on IMOAnswerBench, which puts it in the top tier of current math and science models. On coding and software engineering it reaches 85.9 on LiveCodeBench v6 and 75.3 on SWE Verified.

In the base HLE configuration Qwen3-Max-Thinking scores 30.2, below Gemini 3 Pro at 37.5 and GPT 5.2 Thinking at 35.5. In a tool enabled HLE setup, the official comparison table that includes web search integration shows Qwen3-Max-Thinking at 49.8, ahead of GPT 5.2 Thinking at 45.5 and Gemini 3 Pro at 45.8. With its most aggressive experience cumulative test time scaling configuration on HLE with tools, Qwen3-Max-Thinking reaches 58.3 while GPT 5.2 Thinking remains at 45.5, although that higher number is for a heavier inference mode than the standard comparison table.

Key Takeaways

Qwen3-Max-Thinking is a closed, API only flagship reasoning model from Alibaba, built on a more than 1 trillion parameter backbone trained on about 36 trillion tokens with a 262144 token context window.

The model introduces experience cumulative test time scaling, where it reuses intermediate reasoning across multiple rounds, improving benchmarks such as GPQA Diamond and LiveCodeBench v6 at similar token budgets.

Qwen3-Max-Thinking integrates Search, Memory, and a Code Interpreter as native tools and uses Adaptive Tool Use so the model itself decides when to browse, recall state, or execute Python during a conversation.

On public benchmarks it reports competitive scores with GPT 5.2 Thinking, Claude Opus 4.5, and Gemini 3 Pro, including strong results on MMLU Pro, GPQA, HMMT, IMOAnswerBench, LiveCodeBench v6, SWE Bench Verified, and Tau² Bench..

Check out the API and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Alibaba Introduces Qwen3-Max-Thinking, a Test Time Scaled Reasoning Model with Native Tool Use Powering Agentic Workloads appeared first on MarkTechPost.

How to Design Self-Reflective Dual-Agent Governance Systems with Const …

In this tutorial, we implement a dual-agent governance system that applies Constitutional AI principles to financial operations. We demonstrate how we separate execution and oversight by pairing a Worker Agent that performs financial actions with an Auditor Agent that enforces policy, safety, and compliance. By encoding governance rules directly into a formal constitution and combining rule-based checks with AI-assisted reasoning, we can build systems that are self-reflective, auditable, and resilient to risky or non-compliant behavior in high-stakes financial workflows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q pydantic anthropic python-dotenv

import json
import re
from typing import List, Dict, Any, Optional, Literal
from pydantic import BaseModel, Field, validator
from enum import Enum
from datetime import datetime
import os

We install and import the core libraries required to structure, validate, and govern our agent-based system. We rely on Pydantic for strongly typed data models, enums, and validation, while standard Python utilities handle timestamps, parsing, and environment configuration. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass PolicyViolationType(str, Enum):
“””Types of policy violations”””
PII_EXPOSURE = “pii_exposure”
BUDGET_EXCEEDED = “budget_exceeded”
UNAUTHORIZED_ACTION = “unauthorized_action”
MISSING_JUSTIFICATION = “missing_justification”
SUSPICIOUS_PATTERN = “suspicious_pattern”

class SafetyPolicy(BaseModel):
“””Individual safety policy rule”””
name: str
description: str
severity: Literal[“low”, “medium”, “high”, “critical”]
check_function: str

class Constitution(BaseModel):
“””The ‘Constitution’ – A set of rules that govern agent behavior”””
policies: List[SafetyPolicy]
max_transaction_amount: float = 10000.0
require_approval_above: float = 5000.0
allowed_pii_fields: List[str] = [“name”, “account_id”]

def get_policy_by_name(self, name: str) -> Optional[SafetyPolicy]:
return next((p for p in self.policies if p.name == name), None)

FINANCIAL_CONSTITUTION = Constitution(
policies=[
SafetyPolicy(
name=”PII Protection”,
description=”Must not expose sensitive PII (SSN, full credit card, passwords)”,
severity=”critical”,
check_function=”Scan for SSN patterns, credit card numbers, passwords”
),
SafetyPolicy(
name=”Budget Limits”,
description=”Transactions must not exceed predefined budget limits”,
severity=”high”,
check_function=”Check if transaction amount exceeds max_transaction_amount”
),
SafetyPolicy(
name=”Action Authorization”,
description=”Only pre-approved action types are allowed”,
severity=”high”,
check_function=”Verify action type is in approved list”
),
SafetyPolicy(
name=”Justification Required”,
description=”All transactions above threshold must have justification”,
severity=”medium”,
check_function=”Check for justification field in high-value transactions”
),
SafetyPolicy(
name=”Pattern Detection”,
description=”Detect suspicious patterns (multiple rapid transactions, round numbers)”,
severity=”medium”,
check_function=”Analyze transaction patterns for anomalies”
)
],
max_transaction_amount=10000.0,
require_approval_above=5000.0
)

We define the core constitutional framework that governs agent behavior by formalizing policy types, severities, and enforcement rules. We encode financial safety constraints such as PII protection, budget limits, authorization checks, and justification requirements as first-class, machine-readable policies. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass FinancialRequest(BaseModel):
“””Input request to the Worker Agent”””
action: str
amount: Optional[float] = None
recipient: Optional[str] = None
description: str
justification: Optional[str] = None
metadata: Dict[str, Any] = Field(default_factory=dict)

class WorkerOutput(BaseModel):
“””Output from the Worker Agent”””
request_id: str
action_taken: str
details: Dict[str, Any]
raw_response: str
timestamp: str = Field(default_factory=lambda: datetime.now().isoformat())

class PolicyViolation(BaseModel):
“””Detected policy violation”””
policy_name: str
violation_type: PolicyViolationType
severity: str
description: str
suggested_fix: Optional[str] = None

class AuditResult(BaseModel):
“””Result from the Auditor Agent”””
approved: bool
violations: List[PolicyViolation] = Field(default_factory=list)
risk_score: float # 0-100
feedback: str
revision_needed: bool

@classmethod
def validate_risk_score(cls, v):
if isinstance(v, (int, float)):
return max(0.0, min(100.0, v))
return v

We define strongly typed data models that structure how financial requests, agent outputs, and audit findings flow through the system. We use these schemas to ensure every action, decision, and violation is captured in a consistent, machine-validated format with full traceability. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MockAIClient:
“””Simulates the Anthropic API for this tutorial”””

def __init__(self):
self.call_count = 0

def messages_create(self, model: str, max_tokens: int, messages: List[Dict]) -> Any:
“””Simulate API call”””
self.call_count += 1
user_msg = messages[-1][“content”]

if “WORKER AGENT” in user_msg or “financial request” in user_msg.lower():
return self._worker_response(user_msg)

elif “AUDITOR AGENT” in user_msg or “audit” in user_msg.lower():
return self._auditor_response(user_msg)

return self._default_response()

def _worker_response(self, msg: str) -> Any:
“””Simulate worker agent processing a request”””

amount_match = re.search(r’$?(d+(?:,d{3})*(?:.d{2})?)’, msg)
amount = float(amount_match.group(1).replace(‘,’, ”)) if amount_match else 0

if ‘transfer’ in msg.lower():
action = ‘transfer’
elif ‘payment’ in msg.lower() or ‘pay’ in msg.lower():
action = ‘payment’
elif ‘report’ in msg.lower():
action = ‘report’
else:
action = ‘general_query’

response = {
“action_taken”: action,
“amount”: amount,
“status”: “completed”,
“recipient”: “John Doe” if amount > 0 else None,
“account_id”: “ACC-12345”,
“timestamp”: datetime.now().isoformat()
}

if amount > 5000:
response[“ssn”] = “123-45-6789”

if amount > 8000:
response[“credit_card”] = “4532-1234-5678-9010”

class MockResponse:
def __init__(self, content):
self.content = [type(‘obj’, (object,), {
‘type’: ‘text’,
‘text’: json.dumps(content, indent=2)
})]

return MockResponse(response)

def _auditor_response(self, msg: str) -> Any:
“””Simulate auditor agent checking policies”””

violations = []

if ‘ssn’ in msg.lower() or re.search(r’d{3}-d{2}-d{4}’, msg):
violations.append({
“policy”: “PII Protection”,
“type”: “pii_exposure”,
“severity”: “critical”,
“detail”: “SSN detected in output”
})

if ‘credit_card’ in msg.lower() or re.search(r’d{4}-d{4}-d{4}-d{4}’, msg):
violations.append({
“policy”: “PII Protection”,
“type”: “pii_exposure”,
“severity”: “critical”,
“detail”: “Credit card number detected”
})

amount_match = re.search(r'”amount”:s*(d+(?:.d+)?)’, msg)
if amount_match:
amount = float(amount_match.group(1))
if amount > 10000:
violations.append({
“policy”: “Budget Limits”,
“type”: “budget_exceeded”,
“severity”: “high”,
“detail”: f”Amount ${amount} exceeds limit of $10,000″
})
elif amount > 5000 and ‘justification’ not in msg.lower():
violations.append({
“policy”: “Justification Required”,
“type”: “missing_justification”,
“severity”: “medium”,
“detail”: “High-value transaction lacks justification”
})

audit_result = {
“approved”: len(violations) == 0,
“violations”: violations,
“risk_score”: min(len(violations) * 30, 100),
“feedback”: “Transaction approved” if len(violations) == 0 else “Violations detected – revision required”
}

class MockResponse:
def __init__(self, content):
self.content = [type(‘obj’, (object,), {
‘type’: ‘text’,
‘text’: json.dumps(content, indent=2)
})]

return MockResponse(audit_result)

def _default_response(self) -> Any:
class MockResponse:
def __init__(self):
self.content = [type(‘obj’, (object,), {
‘type’: ‘text’,
‘text’: ‘{“status”: “acknowledged”}’
})]
return MockResponse()

We simulate the behavior of a large language model by implementing a mock AI client that differentiates between worker and auditor roles. We intentionally inject policy violations such as PII leakage and budget issues to stress-test the governance logic under realistic failure conditions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass WorkerAgent:
“””Agent A – The Worker that processes financial requests”””

def __init__(self, client: MockAIClient):
self.client = client
self.role = “Financial Operations Worker”
self.processed_requests = []

def process_request(self, request: FinancialRequest) -> WorkerOutput:
“””Process a financial request”””
print(f”n{‘=’*60}”)
print(f” WORKER AGENT: Processing request…”)
print(f”{‘=’*60}”)
print(f”Action: {request.action}”)
if request.amount:
print(f”Amount: ${request.amount:,.2f}”)
else:
print(“Amount: N/A”)
print(f”Description: {request.description}”)

prompt = self._build_worker_prompt(request)

response = self.client.messages_create(
model=”claude-sonnet-4-20250514″,
max_tokens=1000,
messages=[{“role”: “user”, “content”: prompt}]
)

raw_response = response.content[0].text

try:
details = json.loads(raw_response)
except json.JSONDecodeError:
details = {“raw”: raw_response}

output = WorkerOutput(
request_id=f”REQ-{len(self.processed_requests)+1:04d}”,
action_taken=request.action,
details=details,
raw_response=raw_response
)

self.processed_requests.append(output)
print(f”n Worker completed processing (ID: {output.request_id})”)

return output

def _build_worker_prompt(self, request: FinancialRequest) -> str:
“””Build prompt for worker agent”””
amount_str = f”${request.amount:,.2f}” if request.amount else “$0.00″
return f”””You are a WORKER AGENT processing a financial request.

Request Details:
– Action: {request.action}
– Amount: {amount_str}
– Recipient: {request.recipient or ‘N/A’}
– Description: {request.description}
– Justification: {request.justification or ‘None provided’}

Process this request and return a JSON response with:
– action_taken
– amount
– status
– recipient
– account_id
– timestamp
– Any other relevant details

Return ONLY valid JSON.”””

class AuditorAgent:
“””Agent B – The Auditor that validates worker output”””

def __init__(self, client: MockAIClient, constitution: Constitution):
self.client = client
self.constitution = constitution
self.role = “Governance Auditor”
self.audit_history = []

def audit(self, worker_output: WorkerOutput) -> AuditResult:
“””Audit the worker’s output against the constitution”””
print(f”n{‘=’*60}”)
print(f” AUDITOR AGENT: Auditing output…”)
print(f”{‘=’*60}”)

violations = self._check_rules(worker_output)

prompt = self._build_auditor_prompt(worker_output, violations)

response = self.client.messages_create(
model=”claude-sonnet-4-20250514″,
max_tokens=1000,
messages=[{“role”: “user”, “content”: prompt}]
)

raw_audit = response.content[0].text
try:
audit_data = json.loads(raw_audit)
except json.JSONDecodeError:
audit_data = {“approved”: False, “violations”: violations, “risk_score”: 50}

result = AuditResult(
approved=audit_data.get(“approved”, False) and len(violations) == 0,
violations=violations,
risk_score=audit_data.get(“risk_score”, len(violations) * 25),
feedback=audit_data.get(“feedback”, “Audit completed”),
revision_needed=not audit_data.get(“approved”, False) or len(violations) > 0
)

self.audit_history.append(result)

self._display_audit_result(result)

return result

def _check_rules(self, output: WorkerOutput) -> List[PolicyViolation]:
“””Perform rule-based constitutional checks”””
violations = []
details_str = json.dumps(output.details)

if re.search(r’d{3}-d{2}-d{4}’, details_str):
violations.append(PolicyViolation(
policy_name=”PII Protection”,
violation_type=PolicyViolationType.PII_EXPOSURE,
severity=”critical”,
description=”Social Security Number detected in output”,
suggested_fix=”Remove or mask SSN field”
))

if re.search(r’d{4}[-s]?d{4}[-s]?d{4}[-s]?d{4}’, details_str):
violations.append(PolicyViolation(
policy_name=”PII Protection”,
violation_type=PolicyViolationType.PII_EXPOSURE,
severity=”critical”,
description=”Credit card number detected in output”,
suggested_fix=”Remove or tokenize credit card number”
))

amount = output.details.get(“amount”, 0)
if amount > self.constitution.max_transaction_amount:
violations.append(PolicyViolation(
policy_name=”Budget Limits”,
violation_type=PolicyViolationType.BUDGET_EXCEEDED,
severity=”high”,
description=f”Amount ${amount:,.2f} exceeds limit of ${self.constitution.max_transaction_amount:,.2f}”,
suggested_fix=f”Reduce amount to ${self.constitution.max_transaction_amount:,.2f} or request approval”
))

if amount > self.constitution.require_approval_above:
if “justification” not in details_str.lower():
violations.append(PolicyViolation(
policy_name=”Justification Required”,
violation_type=PolicyViolationType.MISSING_JUSTIFICATION,
severity=”medium”,
description=f”Transaction of ${amount:,.2f} requires justification”,
suggested_fix=”Add justification field explaining the transaction”
))

return violations

def _build_auditor_prompt(self, output: WorkerOutput, violations: List[PolicyViolation]) -> str:
“””Build prompt for auditor agent”””
return f”””You are an AUDITOR AGENT validating financial operations against a Constitution.

Constitution Policies:
{json.dumps([p.dict() for p in self.constitution.policies], indent=2)}

Worker Output to Audit:
{output.raw_response}

Already Detected Violations:
{json.dumps([v.dict() for v in violations], indent=2)}

Perform additional analysis and return JSON with:
– approved (boolean)
– risk_score (0-100)
– feedback (string)
– Any additional concerns

Return ONLY valid JSON.”””

def _display_audit_result(self, result: AuditResult):
“””Display audit results in a readable format”””
print(f”n AUDIT RESULTS:”)
print(f”Status: {‘ APPROVED’ if result.approved else ‘ REJECTED’}”)
print(f”Risk Score: {result.risk_score:.1f}/100″)
print(f”Violations Found: {len(result.violations)}”)

if result.violations:
print(f”n POLICY VIOLATIONS:”)
for i, v in enumerate(result.violations, 1):
print(f”n {i}. {v.policy_name} [{v.severity.upper()}]”)
print(f” Type: {v.violation_type.value}”)
print(f” Issue: {v.description}”)
if v.suggested_fix:
print(f” Fix: {v.suggested_fix}”)

print(f”n Feedback: {result.feedback}”)
print(f”Revision Needed: {‘Yes’ if result.revision_needed else ‘No’}”)

We implement the core dual-agent logic by separating execution and governance responsibilities between a Worker Agent and an Auditor Agent. We allow the worker to focus purely on fulfilling financial requests, while we enforce constitutional rules through deterministic checks and AI-assisted auditing. By combining structured prompts, rule-based validation, and clear audit feedback, we create a self-reflective control loop that prioritizes safety, accountability, and compliance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass GovernanceSystem:
“””Orchestrates the dual-agent governance workflow”””

def __init__(self, constitution: Constitution):
self.client = MockAIClient()
self.worker = WorkerAgent(self.client)
self.auditor = AuditorAgent(self.client, constitution)
self.constitution = constitution
self.max_revision_attempts = 3

def process_with_governance(self, request: FinancialRequest) -> Dict[str, Any]:
“””Main workflow: Worker processes, Auditor validates, loop if needed”””
print(f”n{‘#’*60}”)
print(f”# GOVERNANCE SYSTEM: New Request”)
print(f”{‘#’*60}”)

attempt = 0
while attempt < self.max_revision_attempts:
attempt += 1
print(f”n Attempt {attempt}/{self.max_revision_attempts}”)

worker_output = self.worker.process_request(request)

audit_result = self.auditor.audit(worker_output)

if audit_result.approved:
print(f”n{‘=’*60}”)
print(f” FINAL RESULT: APPROVED”)
print(f”{‘=’*60}”)
return {
“status”: “approved”,
“output”: worker_output.dict(),
“audit”: audit_result.dict(),
“attempts”: attempt
}

critical_violations = [v for v in audit_result.violations if v.severity == “critical”]
if critical_violations:
print(f”n{‘=’*60}”)
print(f” FINAL RESULT: REJECTED (Critical Violations)”)
print(f”{‘=’*60}”)
return {
“status”: “rejected”,
“reason”: “critical_violations”,
“audit”: audit_result.dict(),
“attempts”: attempt
}

if attempt >= self.max_revision_attempts:
print(f”n{‘=’*60}”)
print(f” FINAL RESULT: REJECTED (Max Attempts)”)
print(f”{‘=’*60}”)
return {
“status”: “rejected”,
“reason”: “max_attempts_exceeded”,
“audit”: audit_result.dict(),
“attempts”: attempt
}

return {“status”: “error”, “message”: “Unexpected exit from loop”}

We orchestrate the complete governance workflow by coordinating the worker and auditor agents within a controlled revision loop. We evaluate each attempt against constitutional rules and immediately halt execution when critical violations are detected. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_examples():
“””Run demonstration examples”””

print(“=”*80)
print(” DUAL-AGENT GOVERNANCE SYSTEM WITH CONSTITUTIONAL AI”)
print(” Tutorial: Self-Reflective Financial Operations Agents”)
print(“=”*80)

system = GovernanceSystem(FINANCIAL_CONSTITUTION)

print(“nn” + “=”*80)
print(“EXAMPLE 1: Safe Transaction ($2,500)”)
print(“=”*80)

request1 = FinancialRequest(
action=”payment”,
amount=2500.00,
recipient=”Vendor Corp”,
description=”Monthly software license payment”,
justification=”Regular recurring payment for essential services”
)

result1 = system.process_with_governance(request1)

print(“nn” + “=”*80)
print(“EXAMPLE 2: High-Value Transaction with PII Leak ($7,500)”)
print(“=”*80)

request2 = FinancialRequest(
action=”transfer”,
amount=7500.00,
recipient=”Executive”,
description=”Bonus payment to executive”,
justification=”Q4 performance bonus”
)

result2 = system.process_with_governance(request2)

print(“nn” + “=”*80)
print(“EXAMPLE 3: Budget-Exceeding Transaction ($15,000)”)
print(“=”*80)

request3 = FinancialRequest(
action=”transfer”,
amount=15000.00,
recipient=”Supplier”,
description=”Large equipment purchase”,
justification=”New manufacturing equipment for production line”
)

result3 = system.process_with_governance(request3)

print(“nn” + “=”*80)
print(” SUMMARY OF RESULTS”)
print(“=”*80)
print(f”nExample 1: {result1[‘status’].upper()}”)
print(f”Example 2: {result2[‘status’].upper()} – {result2.get(‘reason’, ‘N/A’)}”)
print(f”Example 3: {result3[‘status’].upper()} – {result3.get(‘reason’, ‘N/A’)}”)

print(f”nnTotal API Calls: {system.client.call_count}”)
print(f”Worker Processed: {len(system.worker.processed_requests)} requests”)
print(f”Auditor Performed: {len(system.auditor.audit_history)} audits”)

print(“nn” + “=”*80)
print(” ACTIVE CONSTITUTION”)
print(“=”*80)
for policy in FINANCIAL_CONSTITUTION.policies:
print(f”n {policy.name} [{policy.severity.upper()}]”)
print(f” {policy.description}”)

We demonstrate the system end-to-end by running realistic financial scenarios that exercise both safe and unsafe behaviors. We show how the governance loop responds differently to compliant transactions, PII leaks, and budget violations while producing transparent audit outcomes. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
run_examples()

print(“nn” + “=”*80)
print(” TUTORIAL COMPLETE!”)
print(“=”*80)
print(“nKey Concepts Demonstrated:”)
print(“✓ Constitutional AI – Rule-based governance”)
print(“✓ Dual-Agent System – Worker + Auditor pattern”)
print(“✓ Policy Violation Detection – PII, Budget, Authorization”)
print(“✓ Iterative Revision Loop – Self-correction mechanism”)
print(“✓ Risk Scoring – Quantitative safety assessment”)
print(“nNext Steps:”)
print(“• Replace MockAIClient with real Anthropic API”)
print(“• Implement actual revision logic in Worker Agent”)
print(“• Add more sophisticated pattern detection”)
print(“• Integrate with real financial systems”)
print(“• Build logging and monitoring dashboard”)
print(“=”*80)

We conclude the tutorial by executing all examples and clearly surfacing the core concepts demonstrated by the system. We recap how constitutional rules, dual-agent governance, violation detection, and risk scoring work together in practice.

In conclusion, we demonstrated how to operationalize Constitutional AI beyond theory and embed it into real-world financial decision-making pipelines. We illustrated how we detect and respond to PII leakage, budget overruns, and missing justifications while quantifying risk and enforcing hard governance boundaries. By orchestrating iterative review loops between worker and auditor agents, we demonstrated a practical blueprint for building trustworthy, compliant, and scalable AI-driven financial systems where safety and accountability are first-class design goals rather than afterthoughts.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design Self-Reflective Dual-Agent Governance Systems with Constitutional AI for Secure and Compliant Financial Operations appeared first on MarkTechPost.

MBZUAI Releases K2 Think V2: A Fully Sovereign 70B Reasoning Model For …

Can a fully sovereign open reasoning model match state of the art systems when every part of its training pipeline is transparent. Researchers from Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) release K2 Think V2, a fully sovereign reasoning model designed to test how far open and fully documented pipelines can push long horizon reasoning on math, code, and science when the entire stack is open and reproducible. K2 Think V2 takes the 70 billion parameter K2 V2 Instruct base model and applies a carefully engineered reinforcement learning approach to turn it into a high precision reasoning model that remains fully open in both weights and data.

https://arxiv.org/pdf/2512.06201

From K2 V2 base model to reasoning specialist

K2 V2 is a dense decoder only transformer with 80 layers, hidden size 8192, and 64 attention heads with grouped query attention and rotary position embeddings. It is trained on around 12 trillion tokens drawn from the TxT360 corpus and related curated datasets that cover web text, math, code, multilingual data, and scientific literature.

Training proceeds in three phases. Pretraining runs at context length 8192 tokens on natural data to establish robust general knowledge. Mid training then extends context up to 512k tokens using TxT360 Midas, which mixes long documents, synthetic thinking traces, and diverse reasoning behaviors while carefully keeping at least 30 percent short context data in every stage. Finally, supervised fine tuning, called TxT360 3efforts, injects instruction following and structured reasoning signals.

The important point is that K2 V2 is not a generic base model. It is explicitly optimized for long context consistency and exposure to reasoning behaviors during mid training. That makes it a natural foundation for a post training stage that focuses only on reasoning quality, which is exactly what K2 Think V2 does.

Fully sovereign RLVR on GURU dataset

K2 Think V2 is trained with a GRPO style RLVR recipe on top of K2 V2 Instruct. The team uses the Guru dataset, version 1.5, which focuses on math, code, and STEM questions. Guru is derived from permissively licensed sources, expanded in STEM coverage, and decontaminated against key evaluation benchmarks before use. This is important for a sovereign claim, because both the base model data and the RL data are curated and documented by the same institute.

The GRPO setup removes the usual KL and entropy auxiliary losses and uses asymmetric clipping of the policy ratio with the high clip set to 0.28. Training runs fully on policy with temperature 1.2 to increase rollout diversity, global batch size 256, and no micro batching. This avoids off policy corrections that are known to introduce instability in GRPO like training.

RLVR itself runs in two stages. In the first stage, response length is capped at 32k tokens and the model trains for about 200 steps. In the second stage, the maximum response length is increased to 64k tokens and training continues for about 50 steps with the same hyperparameters. This schedule specifically exploits the long context capability inherited from K2 V2 so that the model can practice full chain of thought trajectories rather than short solutions.

https://mbzuai.ac.ae/news/k2-think-v2-a-fully-sovereign-reasoning-model/

Benchmark profile

K2 Think V2 targets reasoning benchmarks rather than purely knowledge benchmarks. On AIME 2025 it reaches pass at 1 of 90.42. On HMMT 2025 it scores 84.79. On GPQA Diamond, a difficult graduate level science benchmark, it reaches 72.98. On SciCode it records 33.00, and on Humanity’s Last Exam it reaches 9.5 under the benchmark settings.

These scores are reported as averages over 16 runs and are directly comparable only within the same evaluation protocol. The MBZUAI team also highlights improvements on IFBench and on the Artificial Analysis evaluation suite, with particular gains in hallucination rate and long context reasoning compared with the previous K2 Think release.

Safety and openness

The research team reports a Safety 4 style analysis that aggregates four safety surfaces. Content and public safety, truthfulness and reliability, and societal alignment all reach macro average risk levels in the low range. Data and infrastructure risks remain higher and are marked as critical, which reflects concerns about sensitive personal information handling rather than model behavior alone. The team states that K2 Think V2 still shares the generic limitations of large language models despite these mitigations. On Artificial Analysis’s Openness Index, K2 Think V2 sits at the frontier together with K2 V2 and Olmo-3.

Key Takeaways

K2 Think V2 is a fully sovereign 70B reasoning model: Built on K2 V2 Instruct, with open weights, open data recipes, detailed training logs, and full RL pipeline released via Reasoning360.

Base model is optimized for long context and reasoning before RL: K2 V2 is a dense decoder transformer trained on around 12T tokens, with mid training extending context length to 512K tokens and supervised ‘3 efforts’ SFT targeting structured reasoning.

Reasoning is aligned using GRPO based RLVR on the Guru dataset: Training uses a 2 stage on policy GRPO setup on Guru v1.5, with asymmetric clipping, temperature 1.2, and response caps at 32K then 64K tokens to learn long chain of thought solutions.

Competitive results on hard reasoning benchmarks: K2 Think V2 reports strong pass at 1 scores such as 90.42 on AIME 2025, 84.79 on HMMT 2025, and 72.98 on GPQA Diamond, positioning it as a high precision open reasoning model for math, code, and science.

Check out the Paper, Model Weight, Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post MBZUAI Releases K2 Think V2: A Fully Sovereign 70B Reasoning Model For Math, Code, And Science appeared first on MarkTechPost.

Moonshot AI Releases Kimi K2.5: An Open Source Visual Agentic Intellig …

Moonshot AI has released Kimi K2.5 as an open source visual agentic intelligence model. It combines a large Mixture of Experts language backbone, a native vision encoder, and a parallel multi agent system called Agent Swarm. The model targets coding, multimodal reasoning, and deep web research with strong benchmark results on agentic, vision, and coding suites.

Model Architecture and Training

Kimi K2.5 is a Mixture of Experts model with 1T total parameters and about 32B activated parameters per token. The network has 61 layers. It uses 384 experts, with 8 experts selected per token plus 1 shared expert. The attention hidden size is 7168 and there are 64 attention heads.

The model uses MLA attention and the SwiGLU activation function. The tokenizer vocabulary size is 160K. The maximum context length during training and inference is 256K tokens. This supports long tool traces, long documents, and multi step research workflows.

Vision is handled by a MoonViT encoder with about 400M parameters. Visual tokens are trained together with text tokens in a single multimodal backbone. Kimi K2.5 is obtained by continual pretraining on about 15T tokens of mixed vision and text data on top of Kimi K2 Base. This native multimodal training is important because the model learns joint structure over images, documents, and language from the start.

The released checkpoints support standard inference stacks such as vLLM, SGLang, and KTransformers with transformers version 4.57.1 or newer. Quantized INT4 variants are available, reusing the method from Kimi K2 Thinking. This allows deployment on commodity GPUs with lower memory budgets.

Coding and Multimodal Capabilities

Kimi K2.5 is positioned as a strong open source coding model, especially when code generation depends on visual context. The model can read UI mockups, design screenshots, or even videos, then emit structured frontend code with layout, styling, and interaction logic.

Moonshot shows examples where the model reads a puzzle image, reasons about the shortest path, and then writes code that produces a visualized solution. This demonstrates cross modal reasoning, where the model combines image understanding, algorithmic planning, and code synthesis in a single flow.

Because K2.5 has a 256K context window, it can keep long specification histories in context. A practical workflow for developers is to mix design assets, product docs, and existing code in one prompt. The model can then refactor or extend the codebase while keeping visual constraints aligned with the original design.

https://www.kimi.com/blog/kimi-k2-5.html?

Agent Swarm and Parallel Agent Reinforcement Learning

A key feature of Kimi K2.5 is Agent Swarm. This is a multi agent system trained with Parallel Agent Reinforcement Learning, PARL. In this setup an orchestrator agent decomposes a complex goal into many subtasks. It then spins up domain specific sub agents to work in parallel.

Kimi team reports that K2.5 can manage up to 100 sub agents within a task. It supports up to 1,500 coordinated steps or tool calls in one run. This parallelism gives about 4.5 times faster completion compared with a single agent pipeline on wide search tasks.

PARL introduces a metric called Critical Steps. The system rewards policies that reduce the number of serial steps needed to solve the task. This discourages naive sequential planning and pushes the agent to split work into parallel branches while still maintaining consistency.

One example by the Kimi team is a research workflow where the system needs to discover many niche creators. The orchestrator uses Agent Swarm to spawn a large number of researcher agents. Each agent explores different regions of the web, and the system merges results into a structured table.

https://www.kimi.com/blog/kimi-k2-5.html?

Benchmark Performance

On agentic benchmarks, Kimi K2.5 reports strong numbers. On HLE Full with tools the score is 50.2. On BrowseComp with context management the score is 74.9. In Agent Swarm mode the BrowseComp score increases further to 78.4 and WideSearch metrics also improve. The Kimi team compares these values with GPT 5.2, Claude 4.5, Gemini 3 Pro, and DeepSeek V3, and K2.5 shows the highest scores among the listed models on these specific agentic suites.

On vision and video benchmarks K2.5 also reports high scores. MMMU Pro is 78.5 and VideoMMMU is 86.6. The model performs well on OmniDocBench, OCRBench, WorldVQA, and other document and scene understanding tasks. These results indicate that the MoonViT encoder and long context training are effective for real world multimodal problems, such as reading complex documents and reasoning over videos.

https://www.kimi.com/blog/kimi-k2-5.html?

For coding benchmarks it lists SWE Bench Verified at 76.8, SWE Bench Pro at 50.7, SWE Bench Multilingual at 73.0, Terminal Bench 2.0 at 50.8, and LiveCodeBench v6 at 85.0. These numbers place K2.5 among the strongest open source coding models currently reported on these tasks.

On long context language benchmarks, K2.5 reaches 61.0 on LongBench V2 and 70.0 on AA LCR under standard evaluation settings. For reasoning benchmarks it achieves high scores on AIME 2025, HMMT 2025 February, GPQA Diamond, and MMLU Pro when used in thinking mode.

Key Takeaways

Mixture of Experts at trillion scale: Kimi K2.5 uses a Mixture of Experts architecture with 1T total parameters and about 32B active parameters per token, 61 layers, 384 experts, and 256K context length, optimized for long multimodal and tool heavy workflows.

Native multimodal training with MoonViT: The model integrates a MoonViT vision encoder of about 400M parameters and is trained on about 15T mixed vision and text tokens, so images, documents, and language are handled in a single unified backbone.

Parallel Agent Swarm with PARL: Agent Swarm, trained with Parallel Agent Reinforcement Learning, can coordinate up to 100 sub agents and about 1,500 tool calls per task, giving around 4.5 times faster execution versus a single agent on wide research tasks.

Strong benchmark results in coding, vision, and agents: K2.5 reports 76.8 on SWE Bench Verified, 78.5 on MMMU Pro, 86.6 on VideoMMMU, 50.2 on HLE Full with tools, and 74.9 on BrowseComp, matching or exceeding listed closed models on several agentic and multimodal suites.

Check out the Technical details and Model Weight. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Moonshot AI Releases Kimi K2.5: An Open Source Visual Agentic Intelligence Model with Native Swarm Execution appeared first on MarkTechPost.

DSGym Offers a Reusable Container Based Substrate for Building and Ben …

Data science agents should inspect datasets, design workflows, run code, and return verifiable answers, not just autocomplete Pandas code. DSGym, introduced by researchers from Stanford University, Together AI, Duke University, and Harvard University, is a framework that evaluates and trains such agents across more than 1,000 data science challenges with expert curated ground truth and a consistent post training pipeline.

https://arxiv.org/pdf/2601.16344

Why existing benchmarks fall short?

The research team first probe existing benchmarks that claim to test data aware agents. When data files are hidden, models still retain high accuracy. On QRData the average drop is 40.5 percent, on DAEval it is 86.8 percent, and on DiscoveryBench it is 44.4 percent. Many questions are solvable using priors and pattern matching on the text alone instead of genuine data analysis, and they also find annotation errors and inconsistent numerical tolerances.

Task, Agent, and Environment

DSGym standardizes evaluation into three objects, Task, Agent, and Environment. Tasks are either Data Analysis or Data Prediction. Data Analysis tasks provide one or more files along with a natural language question that must be answered through code. Data Prediction tasks provide train and test splits along with an explicit metric and require the agent to build a modeling pipeline and output predictions.

Each task is packed into a Task Object that holds the data files, query prompt, scoring function, and metadata. Agents interact through a CodeAct style loop. At each turn, the agent writes a reasoning block that describes its plan, a code block that runs inside the environment, and an answer block when it is ready to commit. The Environment is implemented as a manager and worker cluster of Docker containers, where each worker mounts data as read only volumes, exposes a writable workspace, and ships with domain specific Python libraries.

DSGym Tasks, DSBio, and DSPredict

On top of this runtime, DSGym Tasks aggregates and refines existing datasets and adds new ones. The research team clean QRData, DAEval, DABStep, MLEBench Lite, and others by dropping unscorable items and applying a shortcut filter that removes questions solved easily by multiple models without data access.

To cover scientific discovery, they introduce DSBio, a suite of 90 bioinformatics tasks derived from peer reviewed papers and open source datasets. Tasks cover single cell analysis, spatial and multi-omics, and human genetics, with deterministic numerical or categorical answers supported by expert reference notebooks.

DSPredict targets modeling on real Kaggle competitions. A crawler collects recent competitions that accept CSV submissions and satisfy size and clarity rules. After preprocessing, the suite is split into DSPredict Easy with 38 playground style and introductory competitions, and DSPredict Hard with 54 high complexity challenges. In total, DSGym Tasks includes 972 data analysis tasks and 114 prediction tasks.

What current agents can and cannot do

The evaluation covers closed source models such as GPT-5.1, GPT-5, and GPT-4o, open weights models such as Qwen3-Coder-480B, Qwen3-235B-Instruct, and GPT-OSS-120B, and smaller models such as Qwen2.5-7B-Instruct and Qwen3-4B-Instruct. All are run with the same CodeAct agent, temperature 0, and tools disabled.

On cleaned general analysis benchmarks, such as QRData Verified, DAEval Verified, and the easier split of DABStep, top models reach between 60 percent and 90 percent exact match accuracy. On DABStep Hard, accuracy drops for every model, which shows that multi step quantitative reasoning over financial tables is still brittle.

DSBio exposes a more severe weakness. Kimi-K2-Instruct achieves the best overall accuracy of 43.33 percent. For all models, between 85 and 96 percent of inspected failures on DSBio are domain grounding errors, including misuse of specialized libraries and incorrect biological interpretations, rather than basic coding mistakes.

On MLEBench Lite and DSPredict Easy, most frontier models achieve near perfect Valid Submission Rate above 80 percent. On DSPredict Hard, valid submissions rarely exceed 70 percent and medal rates on Kaggle leaderboards are near 0 percent. This pattern supports the research team’s observation of a simplicity bias where agents stop after a baseline solution instead of exploring more competitive models and hyperparameters.

DSGym as a data factory and training ground

The same environment can also synthesize training data. Starting from a subset of QRData and DABStep, the research team ask agents to explore datasets, propose questions, solve them with code, and record trajectories, which yields 3,700 synthetic queries. A judge model filters these to a set of 2,000 high quality query plus trajectory pairs called DSGym-SFT, and fine-tuning a 4B Qwen3 based model on DSGym-SFT produces an agent that reaches competitive performance with GPT-4o on standardized analysis benchmarks despite having far fewer parameters.

source: marktechpost.com

Key Takeaways

DSGym provides a unified Task, Agent, and Environment framework, with containerized execution and a CodeAct style loop, to evaluate data science agents on real code based workflows instead of static prompts.

The benchmark suite, DSGym-Tasks, consolidates and cleans prior datasets and adds DSBio and DSPredict, reaching 972 data analysis tasks and 114 prediction tasks across domains such as finance, bioinformatics, and earth science.

Shortcut analysis on existing benchmarks shows that removing data access only moderately reduces accuracy in many cases, which confirms that prior evaluations often measure pattern matching on text rather than genuine data analysis.

Frontier models achieve strong performance on cleaned general analysis tasks and on easier prediction tasks, but they perform poorly on DSBio and DSPredict-Hard, where most errors come from domain grounding issues and conservative, under tuned modeling pipelines.

The DSGym-SFT dataset, built from 2,000 filtered synthetic trajectories, enables a 4B Qwen3 based agent to approach GPT-4o level accuracy on several analysis benchmarks, which shows that execution grounded supervision on structured tasks is an effective way to improve data science agents.

Check out the Paper, and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DSGym Offers a Reusable Container Based Substrate for Building and Benchmarking Data Science Agents appeared first on MarkTechPost.

How Tree-KG Enables Hierarchical Knowledge Graphs for Contextual Navig …

In this tutorial, we implement Tree-KG, an advanced hierarchical knowledge graph system that goes beyond traditional retrieval-augmented generation by combining semantic embeddings with explicit graph structure. We show how we can organize knowledge in a tree-like hierarchy that mirrors how humans learn, from broad domains to fine-grained concepts, and then reason across this structure using controlled multi-hop exploration. By building the graph from scratch, enriching nodes with embeddings, and designing a reasoning agent that navigates ancestors, descendants, and related concepts, we demonstrate how we can achieve contextual navigation and explainable reasoning rather than flat, chunk-based retrieval. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install networkx matplotlib anthropic sentence-transformers scikit-learn numpy

import networkx as nx
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple, Optional, Set
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from collections import defaultdict, deque
import json

We install and import all the core libraries required to build and reason over the Tree-KG system. We set up tools for graph construction and visualization, semantic embedding and similarity search, and efficient data handling for traversal and scoring. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TreeKnowledgeGraph:
“””
Hierarchical Knowledge Graph that mimics human learning patterns.
Supports multi-hop reasoning and contextual navigation.
“””

def __init__(self, embedding_model: str = ‘all-MiniLM-L6-v2’):
self.graph = nx.DiGraph()
self.embedder = SentenceTransformer(embedding_model)
self.node_embeddings = {}
self.node_metadata = {}

def add_node(self,
node_id: str,
content: str,
node_type: str = ‘concept’,
metadata: Optional[Dict] = None):
“””Add a node with semantic embedding and metadata.”””

embedding = self.embedder.encode(content, convert_to_tensor=False)

self.graph.add_node(node_id,
content=content,
node_type=node_type,
metadata=metadata or {})

self.node_embeddings[node_id] = embedding
self.node_metadata[node_id] = {
‘content’: content,
‘type’: node_type,
‘metadata’: metadata or {}
}

def add_edge(self,
parent: str,
child: str,
relationship: str = ‘contains’,
weight: float = 1.0):
“””Add hierarchical or associative edge between nodes.”””
self.graph.add_edge(parent, child,
relationship=relationship,
weight=weight)

def get_ancestors(self, node_id: str, max_depth: int = 5) -> List[str]:
“””Get all ancestor nodes (hierarchical context).”””
ancestors = []
current = node_id
depth = 0

while depth < max_depth:
predecessors = list(self.graph.predecessors(current))
if not predecessors:
break
current = predecessors[0]
ancestors.append(current)
depth += 1

return ancestors

def get_descendants(self, node_id: str, max_depth: int = 2) -> List[str]:
“””Get all descendant nodes.”””
descendants = []
queue = deque([(node_id, 0)])
visited = {node_id}

while queue:
current, depth = queue.popleft()
if depth >= max_depth:
continue

for child in self.graph.successors(current):
if child not in visited:
visited.add(child)
descendants.append(child)
queue.append((child, depth + 1))

return descendants

def semantic_search(self, query: str, top_k: int = 5) -> List[Tuple[str, float]]:
“””Find most semantically similar nodes to query.”””
query_embedding = self.embedder.encode(query, convert_to_tensor=False)

similarities = []
for node_id, embedding in self.node_embeddings.items():
sim = cosine_similarity(
query_embedding.reshape(1, -1),
embedding.reshape(1, -1)
)[0][0]
similarities.append((node_id, float(sim)))

similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]

def get_subgraph_context(self, node_id: str, depth: int = 2) -> Dict:
“””Get rich contextual information around a node.”””
context = {
‘node’: self.node_metadata.get(node_id, {}),
‘ancestors’: [],
‘descendants’: [],
‘siblings’: [],
‘related’: []
}

ancestors = self.get_ancestors(node_id)
context[‘ancestors’] = [
self.node_metadata.get(a, {}) for a in ancestors
]

descendants = self.get_descendants(node_id, depth)
context[‘descendants’] = [
self.node_metadata.get(d, {}) for d in descendants
]

parents = list(self.graph.predecessors(node_id))
if parents:
siblings = list(self.graph.successors(parents[0]))
siblings = [s for s in siblings if s != node_id]
context[‘siblings’] = [
self.node_metadata.get(s, {}) for s in siblings
]

return context

We define the core TreeKnowledgeGraph class that structures knowledge as a directed hierarchy enriched with semantic embeddings. We store both graph relationships and dense representations to navigate concepts structurally while also performing similarity-based retrieval. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MultiHopReasoningAgent:
“””
Agent that performs intelligent multi-hop reasoning across the knowledge graph.
“””

def __init__(self, kg: TreeKnowledgeGraph):
self.kg = kg
self.reasoning_history = []

def reason(self,
query: str,
max_hops: int = 3,
exploration_width: int = 3) -> Dict:
“””
Perform multi-hop reasoning to answer a query.

Strategy:
1. Find initial relevant nodes (semantic search)
2. Explore graph context around these nodes
3. Perform breadth-first exploration with relevance scoring
4. Aggregate information from multiple hops
“””

reasoning_trace = {
‘query’: query,
‘hops’: [],
‘final_context’: {},
‘reasoning_path’: []
}

initial_nodes = self.kg.semantic_search(query, top_k=exploration_width)
reasoning_trace[‘hops’].append({
‘hop_number’: 0,
‘action’: ‘semantic_search’,
‘nodes_found’: initial_nodes
})

visited = set()
current_frontier = [node_id for node_id, _ in initial_nodes]
all_relevant_nodes = set(current_frontier)

for hop in range(1, max_hops + 1):
next_frontier = []
hop_info = {
‘hop_number’: hop,
‘explored_nodes’: [],
‘new_discoveries’: []
}

for node_id in current_frontier:
if node_id in visited:
continue

visited.add(node_id)

context = self.kg.get_subgraph_context(node_id, depth=1)

connected_nodes = []
for ancestor in context[‘ancestors’]:
if ‘content’ in ancestor:
connected_nodes.append(ancestor)

for descendant in context[‘descendants’]:
if ‘content’ in descendant:
connected_nodes.append(descendant)

for sibling in context[‘siblings’]:
if ‘content’ in sibling:
connected_nodes.append(sibling)

relevant_connections = self._score_relevance(
query, connected_nodes, top_k=exploration_width
)

hop_info[‘explored_nodes’].append({
‘node_id’: node_id,
‘content’: self.kg.node_metadata[node_id][‘content’][:100],
‘connections_found’: len(relevant_connections)
})

for conn_content, score in relevant_connections:
for nid, meta in self.kg.node_metadata.items():
if meta[‘content’] == conn_content and nid not in visited:
next_frontier.append(nid)
all_relevant_nodes.add(nid)
hop_info[‘new_discoveries’].append({
‘node_id’: nid,
‘relevance_score’: score
})
break

reasoning_trace[‘hops’].append(hop_info)
current_frontier = next_frontier

if not current_frontier:
break

final_context = self._aggregate_context(query, all_relevant_nodes)
reasoning_trace[‘final_context’] = final_context
reasoning_trace[‘reasoning_path’] = list(all_relevant_nodes)

self.reasoning_history.append(reasoning_trace)
return reasoning_trace

def _score_relevance(self,
query: str,
candidates: List[Dict],
top_k: int = 3) -> List[Tuple[str, float]]:
“””Score candidate nodes by relevance to query.”””
if not candidates:
return []

query_embedding = self.kg.embedder.encode(query)

scores = []
for candidate in candidates:
content = candidate.get(‘content’, ”)
if not content:
continue

candidate_embedding = self.kg.embedder.encode(content)
similarity = cosine_similarity(
query_embedding.reshape(1, -1),
candidate_embedding.reshape(1, -1)
)[0][0]
scores.append((content, float(similarity)))

scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]

def _aggregate_context(self, query: str, node_ids: Set[str]) -> Dict:
“””Aggregate and rank information from all discovered nodes.”””

aggregated = {
‘total_nodes’: len(node_ids),
‘hierarchical_paths’: [],
‘key_concepts’: [],
‘synthesized_answer’: []
}

for node_id in node_ids:
ancestors = self.kg.get_ancestors(node_id)
if ancestors:
path = ancestors[::-1] + [node_id]
path_contents = [
self.kg.node_metadata[n][‘content’]
for n in path if n in self.kg.node_metadata
]
aggregated[‘hierarchical_paths’].append(path_contents)

for node_id in node_ids:
meta = self.kg.node_metadata.get(node_id, {})
aggregated[‘key_concepts’].append({
‘id’: node_id,
‘content’: meta.get(‘content’, ”),
‘type’: meta.get(‘type’, ‘unknown’)
})

for node_id in node_ids:
content = self.kg.node_metadata.get(node_id, {}).get(‘content’, ”)
if content:
aggregated[‘synthesized_answer’].append(content)

return aggregated

def explain_reasoning(self, trace: Dict) -> str:
“””Generate human-readable explanation of reasoning process.”””

explanation = [f”Query: {trace[‘query’]}n”]
explanation.append(f”Total hops performed: {len(trace[‘hops’]) – 1}n”)
explanation.append(f”Total relevant nodes discovered: {len(trace[‘reasoning_path’])}nn”)

for hop_info in trace[‘hops’]:
hop_num = hop_info[‘hop_number’]
explanation.append(f”— Hop {hop_num} —“)

if hop_num == 0:
explanation.append(f”Action: Initial semantic search”)
explanation.append(f”Found {len(hop_info[‘nodes_found’])} candidate nodes”)
for node_id, score in hop_info[‘nodes_found’][:3]:
explanation.append(f” – {node_id} (relevance: {score:.3f})”)
else:
explanation.append(f”Explored {len(hop_info[‘explored_nodes’])} nodes”)
explanation.append(f”Discovered {len(hop_info[‘new_discoveries’])} new relevant nodes”)

explanation.append(“”)

explanation.append(“n— Final Aggregated Context —“)
context = trace[‘final_context’]
explanation.append(f”Total concepts integrated: {context[‘total_nodes’]}”)
explanation.append(f”Hierarchical paths found: {len(context[‘hierarchical_paths’])}”)

return “n”.join(explanation)

We implement a multi-hop reasoning agent that actively navigates the knowledge graph instead of passively retrieving nodes. We start from semantically relevant concepts, expand through ancestors, descendants, and siblings, and iteratively score connections to guide exploration across hops. By aggregating hierarchical paths and synthesizing content, we produce both an explainable reasoning trace and a coherent, context-rich answer. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef build_software_development_kb() -> TreeKnowledgeGraph:
“””Build a comprehensive software development knowledge graph.”””

kg = TreeKnowledgeGraph()

kg.add_node(‘root’, ‘Software Development and Computer Science’, ‘domain’)

kg.add_node(‘programming’,
‘Programming encompasses writing, testing, and maintaining code to create software applications’,
‘domain’)
kg.add_node(‘architecture’,
‘Software Architecture involves designing the high-level structure and components of software systems’,
‘domain’)
kg.add_node(‘domain’)

kg.add_edge(‘root’, ‘programming’, ‘contains’)
kg.add_edge(‘root’, ‘architecture’, ‘contains’)
kg.add_edge(‘root’, ‘devops’, ‘contains’)

kg.add_node(‘python’,
‘language’)
kg.add_node(‘javascript’,
‘JavaScript is a dynamic language primarily used for web development, enabling interactive client-side and server-side applications’,
‘language’)
kg.add_node(‘rust’,
‘language’)

kg.add_edge(‘programming’, ‘python’, ‘includes’)
kg.add_edge(‘programming’, ‘javascript’, ‘includes’)
kg.add_edge(‘programming’, ‘rust’, ‘includes’)

kg.add_node(‘python_basics’,
‘Python basics include variables, data types, control flow, functions, and object-oriented programming fundamentals’,
‘concept’)
kg.add_node(‘python_performance’,
‘Python Performance optimization involves techniques like profiling, caching, using C extensions, and leveraging async programming’,
‘concept’)
kg.add_node(‘python_data’,
‘Python for Data Science uses libraries like NumPy, Pandas, and Scikit-learn for data manipulation, analysis, and machine learning’,
‘concept’)

kg.add_edge(‘python’, ‘python_basics’, ‘contains’)
kg.add_edge(‘python’, ‘python_performance’, ‘contains’)
kg.add_edge(‘python’, ‘python_data’, ‘contains’)

kg.add_node(‘async_io’,
‘Asynchronous IO in Python allows non-blocking operations using async/await syntax with asyncio library for concurrent tasks’,
‘technique’)
kg.add_node(‘multiprocessing’,
‘Python Multiprocessing uses separate processes to bypass GIL, enabling true parallel execution for CPU-bound tasks’,
‘technique’)
kg.add_node(‘cython’,
‘Cython compiles Python to C for significant performance gains, especially in numerical computations and tight loops’,
‘tool’)
kg.add_node(‘profiling’,
‘Python Profiling identifies performance bottlenecks using tools like cProfile, line_profiler, and memory_profiler’,
‘technique’)

kg.add_edge(‘python_performance’, ‘async_io’, ‘contains’)
kg.add_edge(‘python_performance’, ‘multiprocessing’, ‘contains’)
kg.add_edge(‘python_performance’, ‘cython’, ‘contains’)
kg.add_edge(‘python_performance’, ‘profiling’, ‘contains’)

kg.add_node(‘event_loop’,
‘Event Loop is the core of asyncio that manages and schedules asynchronous tasks, handling callbacks and coroutines’,
‘concept’)
kg.add_node(‘coroutines’,
‘Coroutines are special functions defined with async def that can pause execution with await, enabling cooperative multitasking’,
‘concept’)
kg.add_node(‘asyncio_patterns’,
‘AsyncIO patterns include gather for concurrent execution, create_task for background tasks, and queues for producer-consumer’,
‘pattern’)

kg.add_edge(‘async_io’, ‘event_loop’, ‘contains’)
kg.add_edge(‘async_io’, ‘coroutines’, ‘contains’)
kg.add_edge(‘async_io’, ‘asyncio_patterns’, ‘contains’)

kg.add_node(‘microservices’,
‘Microservices architecture decomposes applications into small, independent services that communicate via APIs’,
‘pattern’)
kg.add_edge(‘architecture’, ‘microservices’, ‘contains’)
kg.add_edge(‘async_io’, ‘microservices’, ‘related_to’)

kg.add_node(‘containers’,
‘Containers package applications with dependencies into isolated units, ensuring consistency across environments’,
‘technology’)
kg.add_edge(‘devops’, ‘containers’, ‘contains’)
kg.add_edge(‘microservices’, ‘containers’, ‘deployed_with’)

kg.add_node(‘numpy_optimization’,
‘NumPy optimization uses vectorization and broadcasting to avoid Python loops, leveraging optimized C and Fortran libraries’,
‘technique’)
kg.add_edge(‘python_data’, ‘numpy_optimization’, ‘contains’)
kg.add_edge(‘python_performance’, ‘numpy_optimization’, ‘related_to’)

return kg

We construct a rich, hierarchical software development knowledge base that progresses from high-level domains down to concrete techniques and tools. We explicitly encode parent–child and cross-domain relationships so that concepts such as Python performance, async I/O, and microservices are structurally connected rather than isolated. This setup allows us to simulate how knowledge is learned and revisited across layers, enabling meaningful multi-hop reasoning over real-world software topics. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef visualize_knowledge_graph(kg: TreeKnowledgeGraph,
highlight_nodes: Optional[List[str]] = None):
“””Visualize the knowledge graph structure.”””

plt.figure(figsize=(16, 12))

pos = nx.spring_layout(kg.graph, k=2, iterations=50, seed=42)

node_colors = []
for node in kg.graph.nodes():
if highlight_nodes and node in highlight_nodes:
node_colors.append(‘yellow’)
else:
node_type = kg.graph.nodes[node].get(‘node_type’, ‘concept’)
color_map = {
‘domain’: ‘lightblue’,
‘language’: ‘lightgreen’,
‘concept’: ‘lightcoral’,
‘technique’: ‘lightyellow’,
‘tool’: ‘lightpink’,
‘pattern’: ‘lavender’,
‘technology’: ‘peachpuff’
}
node_colors.append(color_map.get(node_type, ‘lightgray’))

nx.draw_networkx_nodes(kg.graph, pos,
node_color=node_colors,
node_size=2000,
alpha=0.9)

nx.draw_networkx_edges(kg.graph, pos,
edge_color=’gray’,
arrows=True,
arrowsize=20,
alpha=0.6,
width=2)

nx.draw_networkx_labels(kg.graph, pos,
font_size=8,
font_weight=’bold’)

plt.title(“Tree-KG: Hierarchical Knowledge Graph”, fontsize=16, fontweight=’bold’)
plt.axis(‘off’)
plt.tight_layout()
plt.show()

def run_demo():
“””Run complete demonstration of Tree-KG system.”””

print(“=” * 80)
print(“Tree-KG: Hierarchical Knowledge Graph Demo”)
print(“=” * 80)
print()

print(“Building knowledge graph…”)
kg = build_software_development_kb()
print(f”✓ Created graph with {kg.graph.number_of_nodes()} nodes and {kg.graph.number_of_edges()} edgesn”)

print(“Visualizing knowledge graph…”)
visualize_knowledge_graph(kg)

agent = MultiHopReasoningAgent(kg)

queries = [
“How can I improve Python performance for IO-bound tasks?”,
“What are the best practices for async programming?”,
“How does microservices architecture relate to Python?”
]

for i, query in enumerate(queries, 1):
print(f”n{‘=’ * 80}”)
print(f”QUERY {i}: {query}”)
print(‘=’ * 80)

trace = agent.reason(query, max_hops=3, exploration_width=3)

explanation = agent.explain_reasoning(trace)
print(explanation)

print(“n— Sample Hierarchical Paths —“)
for j, path in enumerate(trace[‘final_context’][‘hierarchical_paths’][:3], 1):
print(f”nPath {j}:”)
for k, concept in enumerate(path):
indent = ” ” * k
print(f”{indent}→ {concept[:80]}…”)

print(“n— Synthesized Context —“)
answer_parts = trace[‘final_context’][‘synthesized_answer’][:5]
for part in answer_parts:
print(f”• {part[:150]}…”)

print()

print(“nVisualizing reasoning path for last query…”)
last_trace = agent.reasoning_history[-1]
visualize_knowledge_graph(kg, highlight_nodes=last_trace[‘reasoning_path’])

print(“n” + “=” * 80)
print(“Demo complete!”)
print(“=” * 80)

We visualize the hierarchical structure of the knowledge graph using color and layout to distinguish domains, concepts, techniques, and tools, and optionally highlight the reasoning path. We then run an end-to-end demo in which we build the graph, execute multi-hop reasoning on realistic queries, and print both the reasoning trace and the synthesized context. It allows us to observe how the agent navigates the graph, surfaces hierarchical paths, and explains its conclusions in a transparent and interpretable manner. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedTreeKG(TreeKnowledgeGraph):
“””Extended Tree-KG with advanced features.”””

def __init__(self, embedding_model: str = ‘all-MiniLM-L6-v2’):
super().__init__(embedding_model)
self.node_importance = {}

def compute_node_importance(self):
“””Compute importance scores using PageRank-like algorithm.”””
if self.graph.number_of_nodes() == 0:
return

pagerank = nx.pagerank(self.graph)
betweenness = nx.betweenness_centrality(self.graph)

for node in self.graph.nodes():
self.node_importance[node] = {
‘pagerank’: pagerank.get(node, 0),
‘betweenness’: betweenness.get(node, 0),
‘combined’: pagerank.get(node, 0) * 0.7 + betweenness.get(node, 0) * 0.3
}

def find_shortest_path_with_context(self,
source: str,
target: str) -> Dict:
“””Find shortest path and extract all context along the way.”””
try:
path = nx.shortest_path(self.graph, source, target)

context = {
‘path’: path,
‘path_length’: len(path) – 1,
‘nodes_detail’: []
}

for node in path:
detail = {
‘id’: node,
‘content’: self.node_metadata.get(node, {}).get(‘content’, ”),
‘importance’: self.node_importance.get(node, {}).get(‘combined’, 0)
}
context[‘nodes_detail’].append(detail)

return context
except nx.NetworkXNoPath:
return {‘path’: [], ‘error’: ‘No path exists’}

We extend the base Tree-KG with graph-level intelligence by computing node importance using centrality measures. We combine PageRank and betweenness scores to identify concepts that play a structurally critical role in connecting knowledge across the graph. It also allows us to retrieve shortest paths enriched with contextual and importance information, enabling more informed and explainable reasoning between any two concepts. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
run_demo()

print(“nn” + “=” * 80)
print(“ADVANCED FEATURES DEMO”)
print(“=” * 80)

print(“nBuilding advanced Tree-KG…”)
adv_kg = AdvancedTreeKG()

adv_kg = build_software_development_kb()

adv_kg_new = AdvancedTreeKG()
adv_kg_new.graph = adv_kg.graph
adv_kg_new.node_embeddings = adv_kg.node_embeddings
adv_kg_new.node_metadata = adv_kg.node_metadata

print(“Computing node importance scores…”)
adv_kg_new.compute_node_importance()

print(“nTop 5 most important nodes:”)
sorted_nodes = sorted(
adv_kg_new.node_importance.items(),
key=lambda x: x[1][‘combined’],
reverse=True
)[:5]

for node, scores in sorted_nodes:
content = adv_kg_new.node_metadata[node][‘content’][:60]
print(f” {node}: {content}…”)
print(f” Combined score: {scores[‘combined’]:.4f}”)

print(“n✓ Tree-KG Tutorial Complete!”)
print(“nKey Takeaways:”)
print(“1. Tree-KG enables contextual navigation vs simple chunk retrieval”)
print(“2. Multi-hop reasoning discovers relevant information across graph structure”)
print(“3. Hierarchical organization mirrors human learning patterns”)
print(“4. Semantic search + graph traversal = powerful RAG alternative”)

We execute the full Tree-KG demo and then showcase the advanced features to close the loop on the system’s capabilities. We compute node importance scores to surface the most influential concepts in the graph and inspect how structural centrality aligns with semantic relevance. 

In conclusion, we demonstrated how Tree-KG enables richer understanding by unifying semantic search, hierarchical context, and multi-hop reasoning within a single framework. We showed that, instead of merely retrieving isolated text fragments, we can traverse meaningful knowledge paths, aggregate insights across levels, and produce explanations that reflect how conclusions are formed. By extending the system with importance scoring and path-aware context extraction, we illustrated how Tree-KG can serve as a strong foundation for building intelligent agents, research assistants, or domain-specific reasoning systems that demand structure, transparency, and depth beyond conventional RAG approaches.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How Tree-KG Enables Hierarchical Knowledge Graphs for Contextual Navigation and Explainable Multi-Hop Reasoning Beyond Traditional RAG appeared first on MarkTechPost.

Build reliable Agentic AI solution with Amazon Bedrock: Learn from Pus …

This post was co-written with Saurabh Gupta and Todd Colby from Pushpay.
Pushpay is a market-leading digital giving and engagement platform designed to help churches and faith-based organizations drive community engagement, manage donations, and strengthen generosity fundraising processes efficiently. Pushpay’s church management system provides church administrators and ministry leaders with insight-driven reporting, donor development dashboards, and automation of financial workflows.
Using the power of generative AI, Pushpay developed an innovative agentic AI search feature built for the unique needs of ministries. The approach uses natural language processing so ministry staff can ask questions in plain English and generate real-time, actionable insights from their community data. The AI search feature addresses a critical challenge faced by ministry leaders: the need for quick access to community insights without requiring technical expertise. For example, ministry leaders can enter “show me people who are members in a group, but haven’t given this year” or “show me people who are not engaged in my church,” and use the results to take meaningful action to better support individuals in their community. Most community leaders are time-constrained and lack technical backgrounds; they can use this solution to obtain meaningful data about their congregations in seconds using natural language queries.
By empowering ministry staff with faster access to community insights, the AI search feature supports Pushpay’s mission to encourage generosity and connection between churches and their community members. Early adoption users report that this solution has shortened their time to insights from minutes to seconds. To achieve this result, the Pushpay team built the feature using agentic AI capabilities on Amazon Web Services (AWS) while implementing robust quality assurance measures and establishing a rapid iterative feedback loop for continuous improvements.
In this post, we walk you through Pushpay’s journey in building this solution and explore how Pushpay used Amazon Bedrock to create a custom generative AI evaluation framework for continuous quality assurance and establishing rapid iteration feedback loops on AWS.
Solution overview: AI powered search architecture
The solution consists of several key components that work together to deliver an enhanced search experience. The following figure shows the solution architecture diagram and the overall workflow.

Figure 1: AI Search Solution Architecture

User interface layer: The solution begins with Pushpay users submitting natural language queries through the existing Pushpay application interface. By using natural language queries, church ministry staff can obtain data insights using AI capabilities without learning new tools or interfaces.
AI search agent: At the heart of the system lies the AI search agent, which consists of two key components:

System prompt: Contains the large language model (LLM) role definitions, instructions, and application descriptions that guide the agent’s behavior.
Dynamic prompt constructor (DPC): automatically constructs additional customized system prompts based on the user specific information, such as church context, sample queries, and application filter inventory. They also use semantic search to select only relevant filters among hundreds of available application filters. The DPC improves response accuracy and user experience.

Amazon Bedrock advanced feature: The solution uses the following Amazon Bedrock managed services:

Prompt caching: Reduces latency and costs by caching frequently used system prompt.
LLM processing: Uses Claude Sonnet 4.5 to process prompts and generate JSON output required by the application to display the desired query results as insights to users.

Evaluation system: The evaluation system implements a closed-loop improvement solution where user interactions are instrumented, captured and evaluated offline. The evaluation results feed into a dashboard for product and engineering teams to analyze and drive iterative improvements to the AI search agent. During this process, the data science team collects a golden dataset and continuously curates this dataset based on the actual user queries coupled with validated responses.

The challenges of initial solution without evaluation
To create the AI search feature, Pushpay developed the first iteration of the AI search agent. The solution implements a single agent configured with a carefully tuned system prompt that includes the system role, instructions, and how the user interface works with detailed explanation of each filter tool and their sub-settings. The system prompt is cached using Amazon Bedrock prompt caching to reduce token cost and latency. The agent uses the system prompt to invoke an Amazon Bedrock LLM which generates the JSON document that Pushpay’s application uses to apply filters and present query results to users.
However, this first iteration quickly revealed some limitations. While it demonstrated a 60-70% success rate with basic business queries, the team reached an accuracy plateau. The evaluation of the agent was a manual and tedious process Tuning the system prompt beyond this accuracy threshold proved challenging given the diverse spectrum of user queries and the application’s coverage of over 100 distinct configurable filters. These presented critical blockers for the team’s path to production.
Figure 2: AI Search First Solution
Improving the solution by adding a custom generative AI evaluation framework
To address the challenges of measuring and improving agent accuracy, the team implemented a generative AI evaluation framework integrated into the existing architecture, shown in the following figure. This framework consists of four key components that work together to provide comprehensive performance insights and enable data-driven improvements.

Figure 3: Introducing the GenAI Evaluation Framework

The golden dataset: A curated golden dataset containing over 300 representative queries, each paired with its corresponding expected output, forms the foundation of automated evaluation. The product and data science teams carefully developed and validated this dataset to achieve comprehensive coverage of real-world use cases and edge cases. Additionally, there is a continuous curation process of adding representative actual user queries with validated results.
The evaluator: The evaluator component processes user input queries and compares the agent-generated output against the golden dataset using the LLM as a judge pattern This approach generates core accuracy metrics while capturing detailed logs and performance data, such as latency, for further analysis and debugging.
Domain category: Domain categories are developed using a combination of generative AI domain summarization and human-defined regular expressions to effectively categorize user queries. The evaluator determines the domain category for each query, enabling nuanced, category-based evaluation as an additional dimension of evaluation metrics.
Generative AI evaluation dashboard: The dashboard serves as the mission control for Pushpay’s product and engineering teams, displaying domain category-level metrics to assess performance and latency and guide decisions. It shifts the team from single aggregate scores to nuanced, domain-based performance insights.

The accuracy dashboard: Pinpointing weaknesses by domain
Because user queries are categorized into domain categories, the dashboard incorporates statistical confidence visualization using a 95% Wilson score interval to display accuracy metrics and query volumes at each domain level. By using categories, the team can pinpoint the AI agent’s weaknesses by domain. In the following example , the “activity” domain shows significantly lower accuracy than other categories.

Figure 4: Pinpointing Agent Weaknesses by Domain
Additionally, a performance dashboard, shown in the following figure, visualizes latency indicators at the domain category level, including latency distributions from p50 to p90 percentiles. In the following example, the activity domain exhibits notably higher latency than others.

Figure 5: Identifying Latency Bottlenecks by Domain
Strategic rollout through domain-Level insights
Domain-based metrics revealed varying performance levels across semantic domains, providing crucial insights into agent effectiveness. Pushpay used this granular visibility to make strategic feature rollout decisions. By temporarily suppressing underperforming categories—such as activity queries—while undergoing optimization, the system achieved 95% overall accuracy. By using this approach, users experienced only the highest-performing features while the team refined others to production standards.

Figure 6: Achieving 95% Accuracy with Domain-Level Feature Rollout
Strategic prioritization: Focusing on high-impact domains
To prioritize improvements systematically, Pushpay employed a 2×2 matrix framework plotting topics against two dimensions (shown in the following figure): Business priority (vertical axis) and current performance or feasibility (horizontal axis). This visualization placed topics with both high business value and strong existing performance in the top-right quadrant. The team then focused on these areas because they required less heavy lifting to achieve further accuracy improvement from already-good levels to an exceptional 95% accuracy for the business focused topics.
The implementation followed an iterative cycle: after each round of enhancements, they re-analyze the results to identify the next set of high-potential topics. This systematic, cyclical approach enabled continuous optimization while maintaining focus on business-critical areas.

Figure 7: Strategic Prioritization Framework for Domain Category Optimization
Dynamic prompt construction
The insights gained from the evaluation framework led to an architectural enhancement: the introduction of a dynamic prompt constructor. This component enabled rapid iterative improvements by allowing fine-grained control over which domain categories the agent could address. The structured field inventory – previously embedded in the system prompt – was transformed into a dynamic element, using semantic search to construct contextually relevant prompts for each user query. This approach tailors the prompt filter inventory based on three key contextual dimensions: query content, user persona, and tenant-specific requirements. The result is a more precise and efficient system that generates highly relevant responses while maintaining the flexibility needed for continuous optimization.
Business impact
The generative AI evaluation framework became the cornerstone of Pushpay’s AI feature development, delivering measurable value across three dimensions:

User experience: The AI search feature reduced time-to-insight from approximately 120 seconds (experienced users manually navigating complex UX) to under 4 seconds – a 15-fold acceleration that directly helps enhance ministry leaders’ productivity and decision-making speed. This feature democratized data insights, so that users of different technical levels can access meaningful intelligence without requiring specialized expertise.
Development velocity: The scientific evaluation approach transformed optimization cycles. Rather than debating prompt modifications, the team now validates changes and measures domain-specific impacts within minutes, replacing prolonged deliberations with data-driven iteration.
Production readiness: Improvements from 60–70% accuracy to more than 95% accuracy using high-performance domains provided the quantitative confidence required for customer-facing deployment, while the framework’s architecture enables continuous refinement across other domain categories.

Key takeaways for your AI agent journey
The following are key takeaways from Pushpay’s experience that you can use in your own AI agent journey.
1/ Build with production in mind from day one
Building agentic AI systems is straightforward, but scaling them to production is challenging. Developers should adopt a scaling mindset during the proof-of-concept phase, not after. Implementing robust tracing and evaluation frameworks early, provides a clear pathway from experimentation to production. By using this method, teams can identify and address accuracy issues systematically before they become blockers.
2/ Take advantage of the advanced features of Amazon Bedrock
Amazon Bedrock prompt caching significantly reduces token costs and latency by caching frequently used system prompts. For agents with large, stable system prompts, this feature is essential for production-grade performance.
3/ Think beyond aggregate metrics
Aggregate accuracy scores can sometimes mask critical performance variations. By evaluating agent performance at the domain category level, Pushpay uncovered weaknesses beyond what a single accuracy metric can capture. This granular approach enables targeted optimization and informed rollout decisions, making sure users only experience high-performing features while others are refined.
4/ Data security and responsible AI
When developing agentic AI systems, consider information protection and LLM security considerations from the outset, following the AWS Shared Responsibility Model, because security requirements fundamentally impact the architectural design. Pushpay’s customers are churches and faith-based organizations who are stewards of sensitive information—including pastoral care conversations, financial giving patterns, family struggles, prayer requests and more. In this implementation example, Pushpay set a clear approach to incorporating AI ethically within its product ecosystem, maintaining strict security standards to ensure church data and personally identifiable information (PII) remains within its secure partnership ecosystem. Data is shared only with secure and appropriate data protections applied and is never used to train external models. To learn more about Pushpay’s standards for incorporating AI within their products, visit the Pushpay Knowledge Center for a more in-depth review of company standards.
Conclusion: Your Path to Production-Ready AI Agents
Pushpay’s journey from a 60–70% accuracy prototype to a 95% accurate production-ready AI agent demonstrates that building reliable agentic AI systems requires more than just sophisticated prompts—it demands a scientific, data-driven approach to evaluation and optimization. The key breakthrough wasn’t in the AI technology itself, but in implementing a comprehensive evaluation framework built on strong observability foundation that provided granular visibility into agent performance across different domains. This systematic approach enabled rapid iteration, strategic rollout decisions, and continuous improvement.
Ready to build your own production-ready AI agent?

Explore Amazon Bedrock: Begin building your agent with Amazon Bedrock
Implement LLM-as-a-judge: Create your own evaluation system using the patterns described in this LLM-as-a-judge on Amazon Bedrock Model Evaluation
Build your golden dataset: Start curating representative queries and expected outputs for your specific use case

About the authors
Roger Wang is a Senior Solution Architect at AWS. He is a seasoned architect with over 20 years of experience in the software industry. He helps New Zealand and global software and SaaS companies use cutting-edge technology at AWS to solve complex business challenges. Roger is passionate about bridging the gap between business drivers and technological capabilities and thrives on facilitating conversations that drive impactful results.
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Frank Huang, PhD, is a Senior Analytics Specialist Solutions Architect at AWS based in Auckland, New Zealand. He focuses on helping customers deliver advanced analytics and AI/ML solutions. Throughout his career, Frank has worked across a variety of industries such as financial services, Web3, hospitality, media and entertainment, and telecommunications. Frank is eager to use his deep expertise in cloud architecture, AIOps, and end-to-end solution delivery to help customers achieve tangible business outcomes with the power of data and AI.
Saurabh Gupta is a data science and AI professional at Pushpay based in Auckland, New Zealand, where he focuses on implementing practical AI solutions and statistical modeling. He has extensive experience in machine learning, data science, and Python for data science applications, with specialized experience training in database agents and AI implementation. Prior to his current role, he gained experience in telecom, retail and financial services, developing expertise in marketing analytics and customer retention programs. He has a Master’s in Statistics from University of Auckland and a Master’s in Business Administration from the Indian Institute of Management, Calcutta.
Todd Colby is a Senior Software Engineer at Pushpay based in Seattle. His expertise is focused on evolving complex legacy applications with AI, and translating user needs into structured, high-accuracy solutions. He leverages AI to increase delivery velocity and produce cutting edge metrics and business decision tools.