Forget Keyword Imitation: ByteDance AI Maps Molecular Bonds in AI Reas …

ByteDance Seed recently dropped a research that might change how we build reasoning AI. For years, devs and AI researchers have struggled to ‘cold-start’ Large Language Models (LLMs) into Long Chain-of-Thought (Long CoT) models. Most models lose their way or fail to transfer patterns during multi-step reasoning.

The ByteDance team discovered the problem: we have been looking at reasoning the wrong way. Instead of just words or nodes, effective AI reasoning has a stable, molecular-like structure.

https://arxiv.org/pdf/2601.06002

The 3 ‘Chemical Bonds’ of Thought

The researchers posit that high-quality reasoning trajectories are held together by 3 interaction types. These mirror the forces found in organic chemistry:

Deep Reasoning as Covalent Bonds: This forms the primary ‘bone’ of the thought process. It encodes strong logical dependencies where Step A must justify Step B. Breaking this bond destabilizes the entire answer.

Self-Reflection as Hydrogen Bonds: This acts as a stabilizer. Just as proteins gain stability when chains fold, reasoning stabilizes when later steps (like Step 100) revise or reinforce earlier premises (like Step 10). In their tests, 81.72% of reflection steps successfully reconnected to previously formed clusters.

Self-Exploration as Van der Waals Forces: These are weak bridges between distant clusters of logic. They allow the model to probe new possibilities or alternative hypotheses before enforcing stronger logical constraints.

Why ‘Wait, Let Me Think’ Isn’t Enough

Most AI devs/researchers try to fix reasoning by training models to imitate keywords like ‘wait’ or ‘maybe’. ByteDance team proved that models actually learn the underlying reasoning behavior, not the surface words.

The research team identifies a phenomenon called Semantic Isomers. These are reasoning chains that solve the same task and use the same concepts but differ in how their logical ‘bonds’ are distributed.

Key findings include:

Imitation Fails: Fine-tuning on human-annotated traces or using In-Context Learning (ICL) from weak models fails to build stable Long CoT structures.

Structural Conflict: Mixing reasoning data from different strong teachers (like DeepSeek-R1 and OpenAI-OSS) actually destabilizes the model. Even if the data is similar, the different “molecular” structures cause structural chaos and drop performance.

Information Flow: Unlike humans, who have uniform information gain, strong reasoning models exhibit metacognitive oscillation. They alternate between high-entropy exploration and stable convergent validation.

https://arxiv.org/pdf/2601.06002

MOLE-SYN: The Synthesis Method

To fix these issues, ByteDance team introduced MOLE-SYN. This is a ‘distribution-transfer-graph’ method. Instead of directly copying a teacher’s text, it transfers the behavioral structure to the student model.

It works by estimating a behavior transition graph from strong models and guiding a cheaper model to synthesize its own effective Long CoT structures. This decoupling of structure from surface text yields consistent gains across 6 major benchmarks, including GSM8K, MATH-500, and OlymBench.

Protecting the ‘Thought Molecule‘

This research also sheds light on how private AI companies protect their models. Exposing full reasoning traces allows others to clone the model’s internal procedures.

ByteDance team found that summarization and reasoning compression are effective defenses. By reducing the token count—often by more than 45%—companies disrupt the reasoning bond distributions. This creates a gap between what the model outputs and its internal ‘error-bounded transitions,’ making it much harder to distill the model’s capabilities.

Key Takeaways

Reasoning as ‘Molecular’ Bonds: Effective Long Chain-of-Thought (Long CoT) is defined by three specific ‘chemical’ bonds: Deep Reasoning (covalent-like) forms the logical backbone, Self-Reflection (hydrogen-bond-like) provides global stability through logical folding, and Self-Exploration (van der Waals-like) bridges distant semantic concepts.

Behavior Over Keywords: Models internalize underlying reasoning structures and transition distributions rather than just surface-level lexical cues like ‘wait’ or ‘maybe’. Replacing keywords with synonyms does not significantly impact performance, proving that true reasoning depth comes from learned behavioral motifs.

The ‘Semantic Isomer’ Conflict: Combining heterogeneous reasoning data from different strong models (e.g., DeepSeek-R1 and OpenAI-OSS) can trigger ‘structural chaos’. Even if data sources are statistically similar, incompatible behavioral distributions can break logical coherence and degrade model performance.

MOLE-SYN Methodology: This ‘distribution-transfer-graph’ framework enables models to synthesize effective Long CoT structures from scratch using cheaper instruction LLMs. By transferring the behavioral transition graph instead of direct text, MOLE-SYN achieves performance close to expensive distillation while stabilizing Reinforcement Learning (RL).

Protection via Structural Disruption: Private LLMs can protect their internal reasoning processes through summarization and compression. Reducing token count by roughly 45% or more effectively ‘breaks’ the bond distributions, making it significantly harder for unauthorized models to clone internal reasoning procedures via distillation.

Check out the Paper. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Forget Keyword Imitation: ByteDance AI Maps Molecular Bonds in AI Reasoning to Stabilize Long Chain-of-Thought Performance and Reinforcement Learning (RL) Training appeared first on MarkTechPost.

A New Google AI Research Proposes Deep-Thinking Ratio to Improve LLM A …

For the last few years, the AI world has followed a simple rule: if you want a Large Language Model (LLM) to solve a harder problem, make its Chain-of-Thought (CoT) longer. But new research from the University of Virginia and Google proves that ‘thinking long’ is not the same as ‘thinking hard’.

The research team reveals that simply adding more tokens to a response can actually make an AI less accurate. Instead of counting words, the Google researchers introduce a new measurement: the Deep-Thinking Ratio (DTR).

https://arxiv.org/pdf/2602.13517

The Failure of ‘Token Maxing‘

Engineers often use token count as a proxy for the effort an AI puts into a task. However, the researchers found that raw token count has an average correlation of r= -0.59 with accuracy.

This negative number means that as the model generates more text, it is more likely to be wrong. This happens because of ‘overthinking,’ where the model gets stuck in loops, repeats redundant steps, or amplifies its own mistakes. Relying on length alone wastes expensive compute on uninformative tokens.

What are Deep-Thinking Tokens?

The research team argued that real ‘thinking’ happens inside the layers of the model, not just in the final output. When a model predicts a token, it processes data through a series of transformer layers (L).

Shallow Tokens: For easy words, the model’s prediction stabilizes early. The ‘guess’ doesn’t change much from layer 5 to layer 36.

Deep-Thinking Tokens: For difficult logic or math symbols, the prediction shifts significantly in the deeper layers.

How to Measure Depth

To identify these tokens, the research team uses a technique to peek at the model’s internal ‘drafts’ at every layer. They project the intermediate hidden states (htl) into the vocabulary space using the model’s unembedding matrix (WU). This produces a probability distribution (pt,l) for every layer.

They then calculate the Jensen-Shannon Divergence (JSD) between the intermediate layer distribution and the final layer distribution (pt,L):

Dt,l := JSD(pt,L || pt,l)

A token is a deep-thinking token if its prediction only settles in the ‘late regime’—defined by a depth fraction (⍴). In their tests, they set ⍴= 0.85, meaning the token only stabilized in the final 15% of the layers.

The Deep-Thinking Ratio (DTR) is the percentage of these ‘hard’ tokens in a full sequence. Across models like DeepSeek-R1-70B, Qwen3-30B-Thinking, and GPT-OSS-120B, DTR showed a strong average positive correlation of r = 0.683 with accuracy.

https://arxiv.org/pdf/2602.13517

Think@n: Better Accuracy at 50% the Cost

The research team used this innovative approach to create Think@n, a new way to scale AI performance during inference.

Most devs use Self-Consistency (Cons@n), where they sample 48 different answers and use majority voting to pick the best one. This is very expensive because you have to generate every single token for every answer.

Think@n changes the game by using ‘early halting’:

The model starts generating multiple candidate answers.

After just 50 prefix tokens, the system calculates the DTR for each candidate.

It immediately stops generating the ‘unpromising’ candidates with low DTR.

It only finishes the candidates with high deep-thinking scores.

The Results on AIME 2025

MethodAccuracyAvg. Cost (k tokens)Cons@n (Majority Vote)92.7% 307.6 Think@n (DTR-based Selection)94.7% 155.4

On the AIME 25 math benchmark, Think@n achieved higher accuracy than standard voting while reducing the inference cost by 49%.

Key Takeaways

Token count is a poor predictor of accuracy: Raw output length has an average negative correlation (r = -0.59) with performance, meaning longer reasoning traces often signal ‘overthinking’ rather than higher quality.

Deep-thinking tokens define true effort: Unlike simple tokens that stabilize in early layers, deep-thinking tokens are those whose internal predictions undergo significant revision in deeper model layers before converging.

The Deep-Thinking Ratio (DTR) is a superior metric: DTR measures the proportion of deep-thinking tokens in a sequence and exhibits a robust positive correlation with accuracy (average r = 0.683), consistently outperforming length-based or confidence-based baselines.

Think@n enables efficient test-time scaling: By prioritizing and finishing only the samples with high deep-thinking ratios, the Think@n strategy matches or exceeds the performance of standard majority voting (Cons@n).

Massive cost reduction via early halting: Because DTR can be estimated from a short prefix of just 50 tokens, unpromising generations can be rejected early, reducing total inference costs by approximately 50%.

Check out the Paper. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A New Google AI Research Proposes Deep-Thinking Ratio to Improve LLM Accuracy While Cutting Total Inference Costs by Half appeared first on MarkTechPost.

How to Design an Agentic Workflow for Tool-Driven Route Optimization w …

In this tutorial, we build a production-style Route Optimizer Agent for a logistics dispatch center using the latest LangChain agent APIs. We design a tool-driven workflow in which the agent reliably computes distances, ETAs, and optimal routes rather than guessing, and we enforce structured outputs to make the results directly usable in downstream systems. We integrate geographic calculations, configurable speed profiles, traffic buffers, and multi-stop route optimization, ensuring the agent behaves deterministically while still reasoning flexibly through tools.

Copy CodeCopiedUse a different Browser!pip -q install -U langchain langchain-openai pydantic

import os
from getpass import getpass

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY (input hidden): “)

from typing import Dict, List, Optional, Tuple, Any
from math import radians, sin, cos, sqrt, atan2

from pydantic import BaseModel, Field, ValidationError

from langchain_openai import ChatOpenAI
from langchain.tools import tool
from langchain.agents import create_agent

We set up the execution environment and ensure all required libraries are installed and imported correctly. We securely load the OpenAI API key so the agent can interact with the language model without hardcoding credentials. We also prepare the core dependencies that power tools, agents, and structured outputs.

Copy CodeCopiedUse a different BrowserSITES: Dict[str, Dict[str, Any]] = {
“Rig_A”: {“lat”: 23.5880, “lon”: 58.3829, “type”: “rig”},
“Rig_B”: {“lat”: 23.6100, “lon”: 58.5400, “type”: “rig”},
“Rig_C”: {“lat”: 23.4500, “lon”: 58.3000, “type”: “rig”},
“Yard_Main”: {“lat”: 23.5700, “lon”: 58.4100, “type”: “yard”},
“Depot_1”: {“lat”: 23.5200, “lon”: 58.4700, “type”: “depot”},
“Depot_2”: {“lat”: 23.6400, “lon”: 58.4300, “type”: “depot”},
}

SPEED_PROFILES: Dict[str, float] = {
“highway”: 90.0,
“arterial”: 65.0,
“local”: 45.0,
}

DEFAULT_TRAFFIC_MULTIPLIER = 1.10

def haversine_km(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
R = 6371.0
dlat = radians(lat2 – lat1)
dlon = radians(lon2 – lon1)
a = sin(dlat / 2) ** 2 + cos(radians(lat1)) * cos(radians(lat2)) * sin(dlon / 2) ** 2
return R * c

We define the core domain data representing rigs, yards, and depots along with their geographic coordinates. We establish speed profiles and a default traffic multiplier to reflect realistic driving conditions. We also implement the Haversine distance function, which serves as the mathematical backbone of all routing decisions.

Copy CodeCopiedUse a different Browserdef _normalize_site_name(name: str) -> str:
return name.strip()

def _assert_site_exists(name: str) -> None:
if name not in SITES:
raise ValueError(f”Unknown site ‘{name}’. Use list_sites() or suggest_site().”)

def _distance_between(a: str, b: str) -> float:
_assert_site_exists(a)
_assert_site_exists(b)
sa, sb = SITES[a], SITES[b]
return float(haversine_km(sa[“lat”], sa[“lon”], sb[“lat”], sb[“lon”]))

def _eta_minutes(distance_km: float, speed_kmph: float, traffic_multiplier: float) -> float:
speed = max(float(speed_kmph), 1e-6)
base_minutes = (distance_km / speed) * 60.0
return float(base_minutes * max(float(traffic_multiplier), 0.0))

def compute_route_metrics(path: List[str], speed_kmph: float, traffic_multiplier: float) -> Dict[str, Any]:
if len(path) < 2:
raise ValueError(“Route path must include at least origin and destination.”)
for s in path:
_assert_site_exists(s)
legs = []
total_km = 0.0
total_min = 0.0
for i in range(len(path) – 1):
a, b = path[i], path[i + 1]
d_km = _distance_between(a, b)
t_min = _eta_minutes(d_km, speed_kmph, traffic_multiplier)
legs.append({“from”: a, “to”: b, “distance_km”: d_km, “eta_minutes”: t_min})
total_km += d_km
total_min += t_min
return {“route”: path, “distance_km”: float(total_km), “eta_minutes”: float(total_min), “legs”: legs}

We build the low-level utility functions that validate site names and compute distances and travel times. We implement logic to calculate per-leg and total route metrics deterministically. This ensures that every ETA and distance returned by the agent is based on explicit computation rather than inference.

Copy CodeCopiedUse a different Browserdef _all_paths_with_waypoints(origin: str, destination: str, waypoints: List[str], max_stops: int) -> List[List[str]]:
from itertools import permutations
waypoints = [w for w in waypoints if w not in (origin, destination)]
max_stops = int(max(0, max_stops))
candidates = []
for k in range(0, min(len(waypoints), max_stops) + 1):
for perm in permutations(waypoints, k):
candidates.append([origin, *perm, destination])
if [origin, destination] not in candidates:
candidates.insert(0, [origin, destination])
return candidates

def find_best_route(origin: str, destination: str, allowed_waypoints: Optional[List[str]], max_stops: int, speed_kmph: float, traffic_multiplier: float, objective: str, top_k: int) -> Dict[str, Any]:
origin = _normalize_site_name(origin)
destination = _normalize_site_name(destination)
_assert_site_exists(origin)
_assert_site_exists(destination)
allowed_waypoints = allowed_waypoints or []
for w in allowed_waypoints:
_assert_site_exists(_normalize_site_name(w))
objective = (objective or “eta”).strip().lower()
if objective not in {“eta”, “distance”}:
raise ValueError(“objective must be one of: ‘eta’, ‘distance'”)
top_k = max(1, int(top_k))
candidates = _all_paths_with_waypoints(origin, destination, allowed_waypoints, max_stops=max_stops)
scored = []
for path in candidates:
metrics = compute_route_metrics(path, speed_kmph=speed_kmph, traffic_multiplier=traffic_multiplier)
score = metrics[“eta_minutes”] if objective == “eta” else metrics[“distance_km”]
scored.append((score, metrics))
scored.sort(key=lambda x: x[0])
best = scored[0][1]
alternatives = [m for _, m in scored[1:top_k]]
return {“best”: best, “alternatives”: alternatives, “objective”: objective}

We introduce multi-stop routing logic by generating candidate paths with optional waypoints. We evaluate each candidate route against a clear optimization objective, such as ETA or distance. We then rank routes and extract the best option along with a set of strong alternatives.

Copy CodeCopiedUse a different Browser@tool
def list_sites(site_type: Optional[str] = None) -> List[str]:
if site_type:
st = site_type.strip().lower()
return sorted([k for k, v in SITES.items() if str(v.get(“type”, “”)).lower() == st])
return sorted(SITES.keys())

@tool
def get_site_details(site: str) -> Dict[str, Any]:
s = _normalize_site_name(site)
_assert_site_exists(s)
return {“site”: s, **SITES[s]}

@tool
def suggest_site(query: str, max_suggestions: int = 5) -> List[str]:
q = (query or “”).strip().lower()
max_suggestions = max(1, int(max_suggestions))
scored = []
for name in SITES.keys():
n = name.lower()
common = len(set(q) & set(n))
bonus = 5 if q and q in n else 0
scored.append((common + bonus, name))
scored.sort(key=lambda x: x[0], reverse=True)
return [name for _, name in scored[:max_suggestions]]

@tool
def compute_direct_route(origin: str, destination: str, road_class: str = “arterial”, traffic_multiplier: float = DEFAULT_TRAFFIC_MULTIPLIER) -> Dict[str, Any]:
origin = _normalize_site_name(origin)
destination = _normalize_site_name(destination)
rc = (road_class or “arterial”).strip().lower()
if rc not in SPEED_PROFILES:
raise ValueError(f”Unknown road_class ‘{road_class}’. Use one of: {sorted(SPEED_PROFILES.keys())}”)
speed = SPEED_PROFILES[rc]
return compute_route_metrics([origin, destination], speed_kmph=speed, traffic_multiplier=float(traffic_multiplier))

@tool
def optimize_route(origin: str, destination: str, allowed_waypoints: Optional[List[str]] = None, max_stops: int = 2, road_class: str = “arterial”, traffic_multiplier: float = DEFAULT_TRAFFIC_MULTIPLIER, objective: str = “eta”, top_k: int = 3) -> Dict[str, Any]:
origin = _normalize_site_name(origin)
destination = _normalize_site_name(destination)
rc = (road_class or “arterial”).strip().lower()
if rc not in SPEED_PROFILES:
raise ValueError(f”Unknown road_class ‘{road_class}’. Use one of: {sorted(SPEED_PROFILES.keys())}”)
speed = SPEED_PROFILES[rc]
allowed_waypoints = allowed_waypoints or []
allowed_waypoints = [_normalize_site_name(w) for w in allowed_waypoints]
return find_best_route(origin, destination, allowed_waypoints, int(max_stops), float(speed), float(traffic_multiplier), str(objective), int(top_k))

We expose the routing and discovery logic as callable tools for the agent. We allow the agent to list sites, inspect site details, resolve ambiguous names, and compute both direct and optimized routes. This tool layer ensures that the agent always reasons by calling verified functions rather than hallucinating results.

Copy CodeCopiedUse a different Browserclass RouteLeg(BaseModel):
from_site: str
to_site: str
distance_km: float
eta_minutes: float

class RoutePlan(BaseModel):
route: List[str]
distance_km: float
eta_minutes: float
legs: List[RouteLeg]
objective: str

class RouteDecision(BaseModel):
chosen: RoutePlan
alternatives: List[RoutePlan] = []
assumptions: Dict[str, Any] = {}
notes: str = “”
audit: List[str] = []

llm = ChatOpenAI(model=”gpt-4o-mini”, temperature=0.2)

SYSTEM_PROMPT = (
“You are the Route Optimizer Agent for a logistics dispatch center.n”
“You MUST use tools for any distance/ETA calculation.n”
“Return ONLY the structured RouteDecision.”
)

route_agent = create_agent(
model=llm,
tools=[list_sites, get_site_details, suggest_site, compute_direct_route, optimize_route],
system_prompt=SYSTEM_PROMPT,
response_format=RouteDecision,
)

def get_route_decision(origin: str, destination: str, road_class: str = “arterial”, traffic_multiplier: float = DEFAULT_TRAFFIC_MULTIPLIER, allowed_waypoints: Optional[List[str]] = None, max_stops: int = 2, objective: str = “eta”, top_k: int = 3) -> RouteDecision:
user_msg = {
“role”: “user”,
“content”: (
f”Optimize the route from {origin} to {destination}.n”
f”road_class={road_class}, traffic_multiplier={traffic_multiplier}n”
f”objective={objective}, top_k={top_k}n”
f”allowed_waypoints={allowed_waypoints}, max_stops={max_stops}n”
“Return the structured RouteDecision only.”
),
}
result = route_agent.invoke({“messages”: [user_msg]})
return result[“structured_response”]

decision1 = get_route_decision(“Yard_Main”, “Rig_B”, road_class=”arterial”, traffic_multiplier=1.12)
print(decision1.model_dump())

decision2 = get_route_decision(“Rig_C”, “Rig_B”, road_class=”highway”, traffic_multiplier=1.08, allowed_waypoints=[“Depot_1”, “Depot_2”, “Yard_Main”], max_stops=2, objective=”eta”, top_k=3)
print(decision2.model_dump())

We define strict Pydantic schemas to enforce structured, machine-readable outputs from the agent. We initialize the language model and create the agent with a clear system prompt and response format. We then demonstrate how to invoke the agent and obtain reliable route decisions ready for real logistics workflows.

In conclusion, we have implemented a robust, extensible route optimization agent that selects the best path between sites while clearly explaining its assumptions and alternatives. We demonstrated how combining deterministic routing logic with a tool-calling LLM produces reliable, auditable decisions suitable for real logistics operations. This foundation allows us to easily extend the system with live traffic data, fleet constraints, or cost-based objectives, making the agent a practical component in a larger dispatch or fleet-management platform.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design an Agentic Workflow for Tool-Driven Route Optimization with Deterministic Computation and Structured Outputs appeared first on MarkTechPost.

Is There a Community Edition of Palantir? Meet OpenPlanter: An Open So …

The balance of power in the digital age is shifting. While governments and large corporations have long used data to track individuals, a new open-source project called OpenPlanter is giving that power back to the public. Created by a developer ‘Shin Megami Boson‘, OpenPlanter is a recursive-language-model investigation agent. Its goal is simple: help you keep tabs on your government, since they are almost certainly keeping tabs on you.

Solving the ‘Heterogeneous Data’ Problem

Investigative work is difficult because data is messy. Public records are often spread across 100 different formats. You might have a CSV of campaign finance records, a JSON file of government contracts, and a PDF of lobbying disclosures.

OpenPlanter ingests these disparate structured and unstructured data sources effortlessly. It uses Large Language Models (LLMs) to perform entity resolution. This is the process of identifying when different records refer to the same person or company. Once it connects these dots, the agent probabilistically looks for anomalies. It searches for patterns that a human might miss, such as a sudden spike in contract wins following a specific lobbying event.

The Architecture: Recursive Sub-Agent Delegation

What makes OpenPlanter unique is its recursive engine. Most AI agents handle 1 request at a time. OpenPlanter, however, breaks large objectives into smaller pieces. If you give it a massive task, it uses a sub-agent delegation strategy.

The agent has a default max-depth of 4. This means the main agent can spawn a sub-agent, which can spawn another, and so on. These agents work in parallel to:

Resolve entities across massive datasets.

Link datasets that have no common ID numbers.

Construct evidence chains that back up every single finding.

This recursive approach allows the system to handle investigations that are too large for a single ‘context window.’

The 2026 AI Stack

OpenPlanter is built for the high-performance requirements of 2026. It is written in Python 3.10+ and integrates with the most advanced models available today. The technical documentation lists several supported providers:

OpenAI: It uses gpt-5.2 as the default.

Anthropic: It supports claude-opus-4-6.

OpenRouter: It defaults to anthropic/claude-sonnet-4-5.

Cerebras: It uses qwen-3-235b-a22b-instruct-2507 for high-speed tasks.

The system also uses Exa for web searches and Voyage for high-accuracy embeddings. This multi-model strategy ensures that the agent uses the best ‘brain’ for each specific sub-task.

19 Tools for Digital Forensics

The agent is equipped with 19 specialized tools. These tools allow it to interact with the real world rather than just ‘chatting.’ These are organized into 4 core areas:

File I/O and Workspace: Tools like read_file, write_file, and hashline_edit allow the agent to manage its own database of findings.

Shell Execution: The agent can use run_shell to execute actual code. It can write a Python script to analyze a dataset and then run that script to get results.

Web Retrieval: With web_search and fetch_url, it can pull live data from government registries or news sites.

Planning and Logic: The think tool lets the agent pause and strategize. It uses acceptance-criteria to verify that a sub-task was completed correctly before moving to the next step.

Deployment and Interface

OpenPlanter is designed to be accessible but powerful. It features a Terminal User Interface (TUI) built with rich and prompt_toolkit. The interface includes a splash art screen of ASCII potted plants, but the work it does is serious.

You can get started quickly using Docker. By running docker compose up, the agent starts in a container. This is a critical security feature because it isolates the agent’s run_shell commands from the user’s host operating system.

The command-line interface allows for ‘headless’ tasks. You can run a single command like:

Copy CodeCopiedUse a different Browseropenplanter-agent –task “Flag all vendor overlaps in lobbying data” –workspace ./data

The agent will then work autonomously until it produces a final report.

Key Takeaways

Autonomous Recursive Logic: Unlike standard agents, OpenPlanter uses a recursive sub-agent delegation strategy (default max-depth of 4). It breaks complex investigative objectives into smaller sub-tasks, parallelizing work across multiple agents to build detailed evidence chains.

Heterogeneous Data Correlation: The agent is built to ingest and resolve disparate structured and unstructured data. It can simultaneously process CSV files, JSON records, and unstructured text (like PDFs) to identify entities across fragmented datasets.

Probabilistic Anomaly Detection: By performing entity resolution, OpenPlanter automatically connects records—such as matching a corporate alias to a lobbying disclosure—and looks for probabilistic anomalies to surface hidden connections between government spending and private interests.

High-End 2026 Model Stack: The system is provider-agnostic and utilizes the latest frontier models, including OpenAI gpt-5.2, Anthropic claude-opus-4-6, and Cerebras qwen-3-235b-a22b-instruct-2507 for high-speed inference.

Integrated Toolset for Forensics: OpenPlanter features 19 distinct tools, including shell execution (run_shell), web search (Exa), and file patching (hashline_edit). This allows it to write and run its own analysis scripts while verifying results against real-world acceptance criteria.

Check out the Repo here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Disclaimer: MarkTechPost does not endorse the OpenPlanter project and provides this technical report for informational purposes only.
The post Is There a Community Edition of Palantir? Meet OpenPlanter: An Open Source Recursive AI Agent for Your Micro Surveillance Use Cases appeared first on MarkTechPost.

A Coding Guide to High-Quality Image Generation, Control, and Editing …

In this tutorial, we design a practical image-generation workflow using the Diffusers library. We start by stabilizing the environment, then generate high-quality images from text prompts using Stable Diffusion with an optimized scheduler. We accelerate inference with a LoRA-based latent consistency approach, guide composition with ControlNet under edge conditioning, and finally perform localized edits via inpainting. Also, we focus on real-world techniques that balance image quality, speed, and controllability.

Copy CodeCopiedUse a different Browser!pip -q uninstall -y pillow Pillow || true
!pip -q install –upgrade –force-reinstall “pillow<12.0”
!pip -q install –upgrade diffusers transformers accelerate safetensors huggingface_hub opencv-python

import os, math, random
import torch
import numpy as np
import cv2
from PIL import Image, ImageDraw, ImageFilter
from diffusers import (
StableDiffusionPipeline,
StableDiffusionInpaintPipeline,
ControlNetModel,
StableDiffusionControlNetPipeline,
UniPCMultistepScheduler,
)

We prepare a clean and compatible runtime by resolving dependency conflicts and installing all required libraries. We ensure image processing works reliably by pinning the correct Pillow version and loading the Diffusers ecosystem. We also import all core modules needed for generation, control, and inpainting workflows.

Copy CodeCopiedUse a different Browserdef seed_everything(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

def to_grid(images, cols=2, bg=255):
if isinstance(images, Image.Image):
images = [images]
w, h = images[0].size
rows = math.ceil(len(images) / cols)
grid = Image.new(“RGB”, (cols*w, rows*h), (bg, bg, bg))
for i, im in enumerate(images):
grid.paste(im, ((i % cols)*w, (i // cols)*h))
return grid

device = “cuda” if torch.cuda.is_available() else “cpu”
dtype = torch.float16 if device == “cuda” else torch.float32
print(“device:”, device, “| dtype:”, dtype)

We define utility functions to ensure reproducibility and to organize visual outputs efficiently. We set global random seeds so our generations remain consistent across runs. We also detect the available hardware and configure precision to optimize performance on the GPU or CPU.

Copy CodeCopiedUse a different Browserseed_everything(7)
BASE_MODEL = “runwayml/stable-diffusion-v1-5”

pipe = StableDiffusionPipeline.from_pretrained(
BASE_MODEL,
torch_dtype=dtype,
safety_checker=None,
).to(device)

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

if device == “cuda”:
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()

prompt = “a cinematic photo of a futuristic street market at dusk, ultra-detailed, 35mm, volumetric lighting”
negative_prompt = “blurry, low quality, deformed, watermark, text”

img_text = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=25,
guidance_scale=6.5,
width=768,
height=512,
).images[0]

We initialize the base Stable Diffusion pipeline and switch to a more efficient UniPC scheduler. We generate a high-quality image directly from a text prompt using carefully chosen guidance and resolution settings. This establishes a strong baseline for subsequent improvements in speed and control.

Copy CodeCopiedUse a different BrowserLCM_LORA = “latent-consistency/lcm-lora-sdv1-5”
pipe.load_lora_weights(LCM_LORA)

try:
pipe.fuse_lora()
lora_fused = True
except Exception as e:
lora_fused = False
print(“LoRA fuse skipped:”, e)

fast_prompt = “a clean product photo of a minimal smartwatch on a reflective surface, studio lighting”
fast_images = []
for steps in [4, 6, 8]:
fast_images.append(
pipe(
prompt=fast_prompt,
negative_prompt=negative_prompt,
num_inference_steps=steps,
guidance_scale=1.5,
width=768,
height=512,
).images[0]
)

grid_fast = to_grid(fast_images, cols=3)
print(“LoRA fused:”, lora_fused)

W, H = 768, 512
layout = Image.new(“RGB”, (W, H), “white”)
draw = ImageDraw.Draw(layout)
draw.rectangle([40, 80, 340, 460], outline=”black”, width=6)
draw.ellipse([430, 110, 720, 400], outline=”black”, width=6)
draw.line([0, 420, W, 420], fill=”black”, width=5)

edges = cv2.Canny(np.array(layout), 80, 160)
edges = np.stack([edges]*3, axis=-1)
canny_image = Image.fromarray(edges)

CONTROLNET = “lllyasviel/sd-controlnet-canny”
controlnet = ControlNetModel.from_pretrained(
CONTROLNET,
torch_dtype=dtype,
).to(device)

cn_pipe = StableDiffusionControlNetPipeline.from_pretrained(
BASE_MODEL,
controlnet=controlnet,
torch_dtype=dtype,
safety_checker=None,
).to(device)

cn_pipe.scheduler = UniPCMultistepScheduler.from_config(cn_pipe.scheduler.config)

if device == “cuda”:
cn_pipe.enable_attention_slicing()
cn_pipe.enable_vae_slicing()

cn_prompt = “a modern cafe interior, architectural render, soft daylight, high detail”
img_controlnet = cn_pipe(
prompt=cn_prompt,
negative_prompt=negative_prompt,
image=canny_image,
num_inference_steps=25,
guidance_scale=6.5,
controlnet_conditioning_scale=1.0,
).images[0]

We accelerate inference by loading and fusing a LoRA adapter and demonstrate fast sampling with very few diffusion steps. We then construct a structural conditioning image and apply ControlNet to guide the layout of the generated scene. This allows us to preserve composition while still benefiting from creative text guidance.

Copy CodeCopiedUse a different Browsermask = Image.new(“L”, img_controlnet.size, 0)
mask_draw = ImageDraw.Draw(mask)
mask_draw.rectangle([60, 90, 320, 170], fill=255)
mask = mask.filter(ImageFilter.GaussianBlur(2))

inpaint_pipe = StableDiffusionInpaintPipeline.from_pretrained(
BASE_MODEL,
torch_dtype=dtype,
safety_checker=None,
).to(device)

inpaint_pipe.scheduler = UniPCMultistepScheduler.from_config(inpaint_pipe.scheduler.config)

if device == “cuda”:
inpaint_pipe.enable_attention_slicing()
inpaint_pipe.enable_vae_slicing()

inpaint_prompt = “a glowing neon sign that says ‘CAFÉ’, cyberpunk style, realistic lighting”

img_inpaint = inpaint_pipe(
prompt=inpaint_prompt,
negative_prompt=negative_prompt,
image=img_controlnet,
mask_image=mask,
num_inference_steps=30,
guidance_scale=7.0,
).images[0]

os.makedirs(“outputs”, exist_ok=True)
img_text.save(“outputs/text2img.png”)
grid_fast.save(“outputs/lora_fast_grid.png”)
layout.save(“outputs/layout.png”)
canny_image.save(“outputs/canny.png”)
img_controlnet.save(“outputs/controlnet.png”)
mask.save(“outputs/mask.png”)
img_inpaint.save(“outputs/inpaint.png”)

print(“Saved outputs:”, sorted(os.listdir(“outputs”)))
print(“Done.”)

We create a mask to isolate a specific region and apply inpainting to modify only that part of the image. We refine the selected area using a targeted prompt while keeping the rest intact. Finally, we save all intermediate and final outputs to disk for inspection and reuse.

In conclusion, we demonstrated how a single Diffusers pipeline can evolve into a flexible, production-ready image generation system. We explained how to move from pure text-to-image generation to fast sampling, structural control, and targeted image editing without changing frameworks or tooling. This tutorial highlights how we can combine schedulers, LoRA adapters, ControlNet, and inpainting to create controllable and efficient generative pipelines that are easy to extend for more advanced creative or applied use cases.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to High-Quality Image Generation, Control, and Editing Using HuggingFace Diffusers appeared first on MarkTechPost.

How to Design a Swiss Army Knife Research Agent with Tool-Using AI, We …

In this tutorial, we build a “Swiss Army Knife” research agent that goes far beyond simple chat interactions and actively solves multi-step research problems end-to-end. We combine a tool-using agent architecture with live web search, local PDF ingestion, vision-based chart analysis, and automated report generation to demonstrate how modern agents can reason, verify, and produce structured outputs. By wiring together small agents, OpenAI models, and practical data-extraction utilities, we show how a single agent can explore sources, cross-check claims, and synthesize findings into professional-grade Markdown and DOCX reports.

Copy CodeCopiedUse a different Browser%pip -q install -U smolagents openai trafilatura duckduckgo-search pypdf pymupdf python-docx pillow tqdm

import os, re, json, getpass
from typing import List, Dict, Any
import requests
import trafilatura
from duckduckgo_search import DDGS
from pypdf import PdfReader
import fitz
from docx import Document
from docx.shared import Pt
from datetime import datetime

from openai import OpenAI
from smolagents import CodeAgent, OpenAIModel, tool

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass.getpass(“Paste your OpenAI API key (hidden): “).strip()
print(“OPENAI_API_KEY set:”, “YES” if os.environ.get(“OPENAI_API_KEY”) else “NO”)

if not os.environ.get(“SERPER_API_KEY”):
serper = getpass.getpass(“Optional: Paste SERPER_API_KEY for Google results (press Enter to skip): “).strip()
if serper:
os.environ[“SERPER_API_KEY”] = serper
print(“SERPER_API_KEY set:”, “YES” if os.environ.get(“SERPER_API_KEY”) else “NO”)

client = OpenAI()

def _now():
return datetime.utcnow().strftime(“%Y-%m-%d %H:%M:%SZ”)

def _safe_filename(s: str) -> str:
s = re.sub(r”[^a-zA-Z0-9._-]+”, “_”, s).strip(“_”)
return s[:180] if s else “file”

We set up the full execution environment and securely load all required credentials without hardcoding secrets. We import all dependencies required for web search, document parsing, vision analysis, and agent orchestration. We also initialize shared utilities to standardize timestamps and file naming throughout the workflow.

Copy CodeCopiedUse a different Browsertry:
from google.colab import files
os.makedirs(“/content/pdfs”, exist_ok=True)
uploaded = files.upload()
for name, data in uploaded.items():
if name.lower().endswith(“.pdf”):
with open(f”/content/pdfs/{name}”, “wb”) as f:
f.write(data)
print(“PDFs in /content/pdfs:”, os.listdir(“/content/pdfs”))
except Exception as e:
print(“Upload skipped:”, str(e))

def web_search(query: str, k: int = 6) -> List[Dict[str, str]]:
serper_key = os.environ.get(“SERPER_API_KEY”, “”).strip()
if serper_key:
resp = requests.post(
“https://google.serper.dev/search”,
headers={“X-API-KEY”: serper_key, “Content-Type”: “application/json”},
json={“q”: query, “num”: k},
timeout=30,
)
resp.raise_for_status()
data = resp.json()
out = []
for item in (data.get(“organic”) or [])[:k]:
out.append({
“title”: item.get(“title”,””),
“url”: item.get(“link”,””),
“snippet”: item.get(“snippet”,””),
})
return out

out = []
with DDGS() as ddgs:
for r in ddgs.text(query, max_results=k):
out.append({
“title”: r.get(“title”,””),
“url”: r.get(“href”,””),
“snippet”: r.get(“body”,””),
})
return out

def fetch_url_text(url: str) -> Dict[str, Any]:
try:
downloaded = trafilatura.fetch_url(url, timeout=30)
if not downloaded:
return {“url”: url, “ok”: False, “error”: “fetch_failed”, “text”: “”}
text = trafilatura.extract(downloaded, include_comments=False, include_tables=True)
if not text:
return {“url”: url, “ok”: False, “error”: “extract_failed”, “text”: “”}
title_guess = next((ln.strip() for ln in text.splitlines() if ln.strip()), “”)[:120]
return {“url”: url, “ok”: True, “title_guess”: title_guess, “text”: text}
except Exception as e:
return {“url”: url, “ok”: False, “error”: str(e), “text”: “”}

We enable local PDF ingestion and establish a flexible web search pipeline that works with or without a paid search API. We show how we gracefully handle optional inputs while maintaining a reliable research flow. We also implement robust URL fetching and text extraction to prepare clean source material for downstream reasoning.

Copy CodeCopiedUse a different Browserdef read_pdf_text(pdf_path: str, max_pages: int = 30) -> Dict[str, Any]:
reader = PdfReader(pdf_path)
pages = min(len(reader.pages), max_pages)
chunks = []
for i in range(pages):
try:
chunks.append(reader.pages[i].extract_text() or “”)
except Exception:
chunks.append(“”)
return {“pdf_path”: pdf_path, “pages_read”: pages, “text”: “nn”.join(chunks).strip()}

def extract_pdf_images(pdf_path: str, out_dir: str = “/content/extracted_images”, max_pages: int = 10) -> List[str]:
os.makedirs(out_dir, exist_ok=True)
doc = fitz.open(pdf_path)
saved = []
pages = min(len(doc), max_pages)
base = _safe_filename(os.path.basename(pdf_path).rsplit(“.”, 1)[0])

for p in range(pages):
page = doc[p]
img_list = page.get_images(full=True)
for img_i, img in enumerate(img_list):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n – pix.alpha >= 4:
pix = fitz.Pixmap(fitz.csRGB, pix)
img_path = os.path.join(out_dir, f”{base}_p{p+1}_img{img_i+1}.png”)
pix.save(img_path)
saved.append(img_path)

doc.close()
return saved

def vision_analyze_image(image_path: str, question: str, model: str = “gpt-4.1-mini”) -> Dict[str, Any]:
with open(image_path, “rb”) as f:
img_bytes = f.read()

resp = client.responses.create(
model=model,
input=[{
“role”: “user”,
“content”: [
{“type”: “input_text”, “text”: f”Answer concisely and accurately.nnQuestion: {question}”},
{“type”: “input_image”, “image_data”: img_bytes},
],
}],
)
return {“image_path”: image_path, “answer”: resp.output_text}

We focus on deep document understanding by extracting structured text and visual artifacts from PDFs. We integrate a vision-capable model to interpret charts and figures instead of treating them as opaque images. We ensure that numerical trends and visual insights can be converted into explicit, text-based evidence.

Copy CodeCopiedUse a different Browserdef write_markdown(path: str, content: str) -> str:
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, “w”, encoding=”utf-8″) as f:
f.write(content)
return path

def write_docx_from_markdown(docx_path: str, md: str, title: str = “Research Report”) -> str:
os.makedirs(os.path.dirname(docx_path), exist_ok=True)
doc = Document()
t = doc.add_paragraph()
run = t.add_run(title)
run.bold = True
run.font.size = Pt(18)
meta = doc.add_paragraph()
meta.add_run(f”Generated: {_now()}”).italic = True
doc.add_paragraph(“”)
for line in md.splitlines():
line = line.rstrip()
if not line:
doc.add_paragraph(“”)
continue
if line.startswith(“# “):
doc.add_heading(line[2:].strip(), level=1)
elif line.startswith(“## “):
doc.add_heading(line[3:].strip(), level=2)
elif line.startswith(“### “):
doc.add_heading(line[4:].strip(), level=3)
elif re.match(r”^s*[-*]s+”, line):
p = doc.add_paragraph(style=”List Bullet”)
p.add_run(re.sub(r”^s*[-*]s+”, “”, line).strip())
else:
doc.add_paragraph(line)
doc.save(docx_path)
return docx_path

@tool
def t_web_search(query: str, k: int = 6) -> str:
return json.dumps(web_search(query, k), ensure_ascii=False)

@tool
def t_fetch_url_text(url: str) -> str:
return json.dumps(fetch_url_text(url), ensure_ascii=False)

@tool
def t_list_pdfs() -> str:
pdf_dir = “/content/pdfs”
if not os.path.isdir(pdf_dir):
return json.dumps([])
paths = [os.path.join(pdf_dir, f) for f in os.listdir(pdf_dir) if f.lower().endswith(“.pdf”)]
return json.dumps(sorted(paths), ensure_ascii=False)

@tool
def t_read_pdf_text(pdf_path: str, max_pages: int = 30) -> str:
return json.dumps(read_pdf_text(pdf_path, max_pages=max_pages), ensure_ascii=False)

@tool
def t_extract_pdf_images(pdf_path: str, max_pages: int = 10) -> str:
imgs = extract_pdf_images(pdf_path, max_pages=max_pages)
return json.dumps(imgs, ensure_ascii=False)

@tool
def t_vision_analyze_image(image_path: str, question: str) -> str:
return json.dumps(vision_analyze_image(image_path, question), ensure_ascii=False)

@tool
def t_write_markdown(path: str, content: str) -> str:
return write_markdown(path, content)

@tool
def t_write_docx_from_markdown(docx_path: str, md_path: str, title: str = “Research Report”) -> str:
with open(md_path, “r”, encoding=”utf-8″) as f:
md = f.read()
return write_docx_from_markdown(docx_path, md, title=title)

We implement the full output layer by generating Markdown reports and converting them into polished DOCX documents. We expose all core capabilities as explicit tools that the agent can reason about and invoke step by step. We ensure that every transformation from raw data to final report remains deterministic and inspectable.

Copy CodeCopiedUse a different Browsermodel = OpenAIModel(model_id=”gpt-5″)

agent = CodeAgent(
tools=[
t_web_search,
t_fetch_url_text,
t_list_pdfs,
t_read_pdf_text,
t_extract_pdf_images,
t_vision_analyze_image,
t_write_markdown,
t_write_docx_from_markdown,
],
model=model,
add_base_tools=False,
additional_authorized_imports=[“json”,”re”,”os”,”math”,”datetime”,”time”,”textwrap”],
)

SYSTEM_INSTRUCTIONS = “””
You are a Swiss Army Knife Research Agent.
“””

def run_research(topic: str):
os.makedirs(“/content/report”, exist_ok=True)
prompt = f”””{SYSTEM_INSTRUCTIONS.strip()}

Research question:
{topic}

Steps:
1) List available PDFs (if any) and decide which are relevant.
2) Do web search for the topic.
3) Fetch and extract the text of the best sources.
4) If PDFs exist, extract text and images.
5) Visually analyze figures.
6) Write a Markdown report and convert to DOCX.
“””
return agent.run(prompt)

topic = “Build a research brief on the most reliable design patterns for tool-using agents (2024-2026), focusing on evaluation, citations, and failure modes.”
out = run_research(topic)
print(out[:1500] if isinstance(out, str) else out)

try:
from google.colab import files
files.download(“/content/report/report.md”)
files.download(“/content/report/report.docx”)
except Exception as e:
print(“Download skipped:”, str(e))

We assemble the complete research agent and define a structured execution plan for multi-step reasoning. We guide the agent to search, analyze, synthesize, and write using a single coherent prompt. We demonstrate how the agent produces a finished research artifact that can be reviewed, shared, and reused immediately.

In conclusion, we demonstrated how a well-designed tool-using agent can function as a reliable research assistant rather than a conversational toy. We showcased how explicit tools, disciplined prompting, and step-by-step execution allow the agent to search the web, analyze documents and visuals, and generate traceable, citation-aware reports. This approach offers a practical blueprint for building trustworthy research agents that emphasize evaluation, evidence, and failure awareness, capabilities increasingly essential for real-world AI systems.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Swiss Army Knife Research Agent with Tool-Using AI, Web Search, PDF Analysis, Vision, and Automated Reporting appeared first on MarkTechPost.

NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on …

Building simulators for robots has been a long term challenge. Traditional engines require manual coding of physics and perfect 3D models. NVIDIA is changing this with DreamDojo, a fully open-source, generalizable robot world model. Instead of using a physics engine, DreamDojo ‘dreams’ the results of robot actions directly in pixels.

https://arxiv.org/pdf/2602.06949

Scaling Robotics with 44k+ Hours of Human Experience

The biggest hurdle for AI in robotics is data. Collecting robot-specific data is expensive and slow. DreamDojo solves this by learning from 44k+ hours of egocentric human videos. This dataset, called DreamDojo-HV, is the largest of its kind for world model pretraining.

It features 6,015 unique tasks across 1M+ trajectories.

The data covers 9,869 unique scenes and 43,237 unique objects.

Pretraining used 100,000 NVIDIA H100 GPU hours to build 2B and 14B model variants.

Humans have already mastered complex physics, such as pouring liquids or folding clothes. DreamDojo uses this human data to give robots a ‘common sense’ understanding of how the world works.

https://arxiv.org/pdf/2602.06949

Bridging the Gap with Latent Actions

Human videos do not have robot motor commands. To make these videos ‘robot-readable,’ NVIDIA’s research team introduced continuous latent actions. This system uses a spatiotemporal Transformer VAE to extract actions directly from pixels.

The VAE encoder takes 2 consecutive frames and outputs a 32-dimensional latent vector.

This vector represents the most critical motion between frames.

The design creates an information bottleneck that disentangles action from visual context.

This allows the model to learn physics from humans and apply them to different robot bodies.

https://arxiv.org/pdf/2602.06949

Better Physics through Architecture

DreamDojo is based on the Cosmos-Predict2.5 latent video diffusion model. It uses the WAN2.2 tokenizer, which has a temporal compression ratio of 4. The team improved the architecture with 3 key features:

Relative Actions: The model uses joint deltas instead of absolute poses. This makes it easier for the model to generalize across different trajectories.

Chunked Action Injection: It injects 4 consecutive actions into each latent frame. This aligns the actions with the tokenizer’s compression ratio and fixes causality confusion.

Temporal Consistency Loss: A new loss function matches predicted frame velocities to ground-truth transitions. This reduces visual artifacts and keeps objects physically consistent.

Distillation for 10.81 FPS Real-Time Interaction

A simulator is only useful if it is fast. Standard diffusion models require too many denoising steps for real-time use. NVIDIA team used a Self Forcing distillation pipeline to solve this.

The distillation training was conducted on 64 NVIDIA H100 GPUs.

The ‘student’ model reduces denoising from 35 steps down to 4 steps.

The final model achieves a real-time speed of 10.81 FPS.

It is stable for continuous rollouts of 60 seconds (600 frames).

Unlocking Downstream Applications

DreamDojo’s speed and accuracy enable several advanced applications for AI engineers.

1. Reliable Policy Evaluation

Testing robots in the real world is risky. DreamDojo acts as a high-fidelity simulator for benchmarking.

Its simulated success rates show a Pearson correlation of (Pearson 𝑟=0.995) with real-world results.

The Mean Maximum Rank Violation (MMRV) is only 0.003.

2. Model-Based Planning

Robots can use DreamDojo to ‘look ahead.’ A robot can simulate multiple action sequences and pick the best one.

In a fruit-packing task, this improved real-world success rates by 17%.

Compared to random sampling, it provided a 2x increase in success.

3. Live Teleoperation

Developers can teleoperate virtual robots in real time. NVIDIA team demonstrated this using a PICO VR controller and a local desktop with an NVIDIA RTX 5090. This allows for safe and rapid data collection.

Summary of Model Performance

MetricDREAMDOJO-2BDREAMDOJO-14BPhysics Correctness62.50%73.50%Action Following63.45%72.55%FPS (Distilled)10.81N/A

NVIDIA has released all weights, training code, and evaluation benchmarks. This open-source release allows you to post-train DreamDojo on your own robot data today.

Key Takeaways

Massive Scale and Diversity: DreamDojo is pretrained on DreamDojo-HV, the largest egocentric human video dataset to date, featuring 44,711 hours of footage across 6,015 unique tasks and 9,869 scenes.

Unified Latent Action Proxy: To overcome the lack of action labels in human videos, the model uses continuous latent actions extracted via a spatiotemporal Transformer VAE, which serves as a hardware-agnostic control interface.

Optimized Training and Architecture: The model achieves high-fidelity physics and precise controllability by utilizing relative action transformations, chunked action injection, and a specialized temporal consistency loss.

Real-Time Performance via Distillation: Through a Self Forcing distillation pipeline, the model is accelerated to 10.81 FPS, enabling interactive applications like live teleoperation and stable, long-horizon simulations for over 1 minute.

Reliable for Downstream Tasks: DreamDojo functions as an accurate simulator for policy evaluation, showing a 0.995 Pearson correlation with real-world success rates, and can improve real-world performance by 17% when used for model-based planning.

Check out the Paper and Codes. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data appeared first on MarkTechPost.

Amazon SageMaker AI in 2025, a year in review part 1: Flexible Trainin …

In 2025, Amazon SageMaker AI saw dramatic improvements to core infrastructure offerings along four dimensions: capacity, price performance, observability, and usability. In this series of posts, we discuss these various improvements and their benefits. In Part 1, we discuss capacity improvements with the launch of Flexible Training Plans. We also describe improvements to price performance for inference workloads. In Part 2, we discuss enhancements made to observability, model customization, and model hosting.
Flexible Training Plans for SageMaker
SageMaker AI Training Plans now support inference endpoints, extending a powerful capacity reservation capability originally designed for training workloads to address the critical challenge of GPU availability for inference deployments. Deploying large language models (LLMs) for inference requires reliable GPU capacity, especially during critical evaluation periods, limited-duration production testing, or predictable burst workloads. Capacity constraints can delay deployments and impact application performance, particularly during peak hours when on-demand capacity becomes unpredictable. Training Plans can help solve this problem by making it possible to reserve compute capacity for specified time periods, facilitating predictable GPU availability precisely when teams need it most.
The reservation workflow is designed for simplicity and flexibility. You begin by searching for available capacity offerings that match your specific requirements—selecting instance type, quantity, duration, and desired time window. When you identify a suitable offering, you can create a reservation that generates an Amazon Resource Name (ARN), which serves as the key to your guaranteed capacity. The upfront, transparent pricing model helps support accurate budget planning while minimizing concerns about infrastructure availability, so teams can focus on their evaluation metrics and model performance rather than worrying about whether capacity will be available when they need it.
Throughout the reservation lifecycle, teams maintain operational flexibility to manage their endpoints as requirements evolve. You can update endpoints to new model versions while maintaining the same reserved capacity, using iterative testing and refinement during evaluation periods. Scaling capabilities help teams adjust instance counts within their reservation limits, supporting scenarios where initial deployments are conservative, but higher throughput testing becomes necessary. This flexibility helps make sure teams aren’t locked into rigid infrastructure decisions while still being able to benefit from the reserved capacity during critical time windows.
With support for endpoint updates, scaling capabilities, and seamless capacity management, Training Plans help give you control over both GPU availability and costs for time-bound inference workloads. Whether you’re running competitive model benchmarks to select the best-performing variant, performing limited-duration A/B tests to validate model improvements, or handling predictable traffic spikes during product launches, Training Plans for inference endpoints help provide the capacity guarantees teams need with transparent, upfront pricing. This approach is particularly valuable for data science teams conducting week-long or month-long evaluation projects, where the ability to reserve specific GPU instances in advance minimizes the uncertainty of on-demand availability and enables more predictable project timelines and budgets.
For more information, see Amazon SageMaker AI now supports Flexible Training Plans capacity for Inference.
Price performance
Enhancements made to SageMaker AI in 2025 help optimize inference economics through four key capabilities. Flexible Training Plans extend to inference endpoints with transparent upfront pricing. Inference components add Multi-AZ availability and parallel model copy placement during scaling that help accelerate deployment. EAGLE-3 speculative decoding delivers increased throughput improvements on inference requests. Dynamic multi-adapter inference enables on-demand loading of LoRA adapters.
Improvements to inference components
Generative models only start delivering value when they’re serving predictions in production. As applications scale, inference infrastructure must be as dynamic and reliable as the models themselves. That’s where SageMaker AI inference components come in. Inference components provide a modular way to manage model inference within an endpoint. Each inference component represents a self-contained unit of compute, memory, and model configuration that can be independently created, updated, and scaled. This design helps you operate production endpoints with greater flexibility. You can deploy multiple models, adjust capacity quickly, and roll out updates safely without redeploying the entire endpoint. For teams running real-time or high-throughput applications, inference components help bring fine-grained control to inference workflows. In the following sections, we review three major enhancements to SageMaker AI inference components that make them even more powerful in production environments. These updates add Multi-AZ high availability, controlled concurrency for multi-tenant workloads, and parallel scaling for faster response to traffic surges. Together, they help make running AI at scale more resilient, predictable, and efficient.
Building resilience with Multi-AZ high availability
Production systems face the same truth: failures happen. A single hardware fault, network issue, or Availability Zone outage can disrupt inference traffic and affect user experience. Now, SageMaker AI inference components automatically distribute workloads across multiple Availability Zones. You can run multiple inference component copies per Availability Zone, and SageMaker AI helps intelligently route traffic to instances that are healthy and have available capacity. This distribution adds fault tolerance at every layer of your deployment.
Multi-AZ high availability offers the following benefits:

Minimizes single points of failure by spreading inference workloads across Availability Zones
Automatically fails over to healthy instances when issues occur
Keeps uptime high to meet strict SLA requirements
Enables balanced cost and resilience through flexible deployment patterns

For example, a financial services company running real-time fraud detection can benefit from this feature. By deploying inference components across three Availability Zones, traffic can seamlessly redirect to the remaining Availability Zones if one goes offline, helping facilitate uninterrupted fraud detection when reliability matters most.
Parallel scaling and NVMe caching
Traffic patterns in production are rarely steady. One moment your system is quiet; the next, it’s flooded with requests. Previously, scaling inference components happened sequentially—each new model copy waited for the previous one to initialize before starting. During spikes, this sequential process could add several minutes of latency. With parallel scaling, SageMaker AI can now deploy multiple inference component copies simultaneously when an instance and the required resources are available. This helps shorten the time required to respond to traffic surges and improves responsiveness for variable workloads. For example, if an instance needs three model copies, they now deploy in parallel instead of waiting on one another. Parallel scaling helps accelerate the deployment of model copies onto inference components but does not accelerate the scaling up of models when traffic increases beyond provisioned capacity. NVMe caching helps accelerate model scaling for already provisioned inference components by caching model artifacts and images. NVMe caching’s ability to reduce scaling times helps reduce inference latency during traffic spikes, lower idle costs through faster scale-down, and provide greater elasticity for serving unpredictable or volatile workloads.
EAGLE-3
SageMaker AI has introduced (Extrapolation Algorithm for Greater Language-model Efficiency (EAGLE)-based adaptive speculative decoding to help accelerate generative AI inference. This enhancement supports six model architectures and helps you optimize performance using either SageMaker-provided datasets or your own application-specific data for highly adaptive, workload-specific results. The solution streamlines the workflow from optimization job creation through deployment, making it seamless to deliver low-latency generative AI applications at scale without compromising generation quality. EAGLE works by predicting future tokens directly from the model’s hidden layers rather than relying on an external draft model, resulting in more accurate predictions and fewer rejections. SageMaker AI automatically selects between EAGLE-2 and EAGLE-3 based on the model architecture, with launch support for LlamaForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM, Qwen2ForCausalLM, GptOssForCausalLM (EAGLE-3), and Qwen3NextForCausalLM (EAGLE-2). You can train EAGLE models from scratch, retrain existing models, or use pre-trained models from SageMaker JumpStart, with the flexibility to iteratively refine performance using your own curated datasets collected through features like Data Capture. The optimization workflow integrates seamlessly with existing SageMaker AI infrastructure through familiar APIs (create_model, create_endpoint_config, create_endpoint) and supports widely used training data formats, including ShareGPT and OpenAI chat and completions. Benchmark results are automatically generated during optimization jobs, providing clear visibility into performance improvements across metrics like Time to First Token (TTFT) and throughput, with trained EAGLE models showing significant gains over both base models and EAGLE models trained only on built-in datasets.
To run an EAGLE-3 optimization job, run the following command in the AWS Command Line Interface (AWS CLI):

aws sagemaker –region us-west-2 create-optimization-job
–optimization-job-name <job-name>
–account-id <account-id>
–deployment-instance-type ml.p5.48xlarge
–max-instance-count 10
–model-source ‘{
“SageMakerModel”: { “ModelName”: “Created Model name” }
}’
–optimization-configs'{
“ModelSpeculativeDecodingConfig”: {
“Technique”: “EAGLE”,
“TrainingDataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “Enter custom train data location”
}
}
}’
–output-config ‘{
“S3OutputLocation”: “Enter optimization output location”
}’
–stopping-condition ‘{“MaxRuntimeInSeconds”: 432000}’
–role-arn “Enter Execution Role ARN”

For more details, see Amazon SageMaker AI introduces EAGLE based adaptive speculative decoding to accelerate generative AI inference.
Dynamic multi-adapter inference on SageMaker AI Inference
SageMaker AI helped enhance the efficient multi-adapter inference capability introduced at re:Invent 2024, which now supports dynamic loading and unloading of LoRA adapters during inference invocations rather than pinning them at endpoint creation. This enhancement helps optimize resource utilization for on-demand model hosting scenarios.
Previously, the adapters were downloaded to disk and loaded into memory during the CreateInferenceComponent API call. With dynamic loading, adapters are registered using a lightweight, synchronous CreateInferenceComponent API, then downloaded and loaded into memory only when first invoked. This approach supports use cases where you can register thousands of fine-tuned adapters per endpoint while maintaining low-latency inference.
The system implements intelligent memory management, evicting least popular models during resource constraints. When memory reaches capacity—controlled by the SAGEMAKER_MAX_NUMBER_OF_ADAPTERS_IN_MEMORY environment variable—the system automatically unloads inactive adapters to make room for newly requested ones. Similarly, when disk space becomes constrained, the least recently used adapters are evicted from storage. This multi-tier caching strategy facilitates optimal resource utilization across CPU, GPU memory, and disk.
For security and compliance alignment, you can explicitly delete adapters using the DeleteInferenceComponent API. Upon deletion, SageMaker unloads the adapter from the base inference component containers and removes it from disk across the instances, facilitating the complete cleanup of customer data. The deletion process completes asynchronously with automatic retries, providing you with control over your adapter lifecycle while helping meet stringent data retention requirements.
This dynamic adapter loading capability powers the SageMaker AI serverless model customization feature, which helps you fine-tune popular AI models like Amazon Nova, DeepSeek, Llama, and Qwen using techniques like supervised fine-tuning, reinforcement learning, and direct preference optimization. When you complete fine-tuning through the serverless customization interface, the output LoRA adapter weights flow seamlessly to deployment—you can deploy to SageMaker AI endpoints using multi-adapter inference components. The hosting configurations from training recipes automatically include the appropriate dynamic loading settings, helping make sure customized models can be deployed efficiently without requiring you to manage infrastructure or load the adapters at endpoint creation time.
The following steps illustrate how you can use this feature in practice:

Create a base inference component with your foundation model:

import boto3

sagemaker = boto3.client(‘sagemaker’)

# Create base inference component with foundation model
response = sagemaker.create_inference_component(
InferenceComponentName=’llama-base-ic’,
EndpointName=’my-endpoint’,
Specification={
‘Container’: {
‘Image’: ‘your-container-image’,
‘Environment’: {
‘SAGEMAKER_MAX_NUMBER_OF_ADAPTERS_IN_MEMORY’: ’10’
}
},
‘ComputeResourceRequirements’: {
‘NumberOfAcceleratorDevicesRequired’: 2,
‘MinMemoryRequiredInMb’: 16384
}
}
)

Register Your LoRA adapters:

# Register adapter – completes in < 1 second
response = sagemaker.create_inference_component(
InferenceComponentName=’my-custom-adapter’,
EndpointName=’my-endpoint’,
Specification={
‘BaseInferenceComponentName’: ‘llama-base-ic’,
‘Container’: {
‘ArtifactUrl’: ‘s3://amzn-s3-demo-bucket/adapters/customer-support/’
}
}
)

Invoke your adapter (it loads automatically on first use):

runtime = boto3.client(‘sagemaker-runtime’)

# Invoke with adapter – loads into memory on first call
response = runtime.invoke_endpoint(
EndpointName=’my-endpoint’,
InferenceComponentName=’llama-base-ic’,
TargetModel=’s3://amzn-s3-demo-bucket/adapters/customer-support/’,
ContentType=’application/json’,
Body=json.dumps({‘inputs’: ‘Your prompt here’})
)

Delete adapters when no longer needed:

sagemaker.delete_inference_component(
InferenceComponentName=’my-custom-adapter’
)

This dynamic loading capability integrates seamlessly with the existing inference infrastructure of SageMaker, supporting the same base models and maintaining compatibility with the standard InvokeEndpoint API. By decoupling adapter registration from resource allocation, you can now deploy and manage more LoRA adapters cost-effectively, paying only for the compute resources actively serving inference requests.
Conclusion
The 2025 SageMaker AI enhancements represent a significant leap forward in making generative AI inference more accessible, reliable, and cost-effective for production workloads. With Flexible Training Plans now supporting inference endpoints, you can gain predictable GPU capacity precisely when you need it—whether for critical model evaluations, limited-duration testing, or handling traffic spikes. The introduction of Multi-AZ high availability, controlled concurrency, and parallel scaling with NVMe caching for inference components helps make sure production deployments can scale rapidly while maintaining resilience across Availability Zones. The adaptive speculative decoding of EAGLE-3 delivers increased throughput without sacrificing output quality, and dynamic multi-adapter inference helps teams efficiently manage more fine-tuned LoRA adapters on a single endpoint. Together, these capabilities help reduce the operational complexity and infrastructure costs of running AI at scale, so teams can focus on delivering value through their models rather than managing underlying infrastructure.
These improvements directly address some of the most pressing challenges facing AI practitioners today: securing reliable compute capacity, achieving low-latency inference at scale, and managing the growing complexity of multi-model deployments. By combining transparent capacity reservations, intelligent resource management, and performance optimizations that help deliver measurable throughput gains, SageMaker AI helps organizations deploy generative AI applications with confidence. The seamless integration between model customization and deployment—where fine-tuned adapters flow directly from training to production hosting—further helps accelerate the journey from experimentation to production.
Ready to accelerate your generative AI inference workloads? Explore Flexible Training Plans for inference endpoints to secure GPU capacity for your next evaluation cycle, implement EAGLE-3 speculative decoding to help boost throughput on your existing deployments, or use dynamic multi-adapter inference to more efficiently serve customized models. Refer to the Amazon SageMaker AI Documentation to get started, and stay tuned for Part 2 of this series, where we will dive into observability and model customization improvements. Share your experiences and questions in the comments—we’d love to hear how these capabilities are transforming your AI workloads.

About the authors
Dan Ferguson is a Sr. Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.
Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Sadaf Fardeen leads Inference Optimization charter for SageMaker. She owns optimization and development of LLM inference containers on SageMaker.
Suma Kasa is an ML Architect with the SageMaker Service team focusing on the optimization and development of LLM inference containers on SageMaker.
Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.
Deepti Ragha is a Senior Software Development Engineer on the Amazon SageMaker AI team, specializing in ML inference infrastructure and model hosting optimization. She builds features that improve deployment performance, reduce inference costs, and make ML accessible to organizations of all sizes. Outside of work, she enjoys traveling, hiking, and gardening.

Amazon SageMaker AI in 2025, a year in review part 2: Improved observa …

In 2025, Amazon SageMaker AI made several improvements designed to help you train, tune, and host generative AI workloads. In Part 1 of this series, we discussed Flexible Training Plans and price performance improvements made to inference components.
In this post, we discuss enhancements made to observability, model customization, and model hosting. These improvements facilitate a whole new class of customer use cases to be hosted on SageMaker AI.
Observability
The observability enhancements made to SageMaker AI in 2025 help deliver enhanced visibility into model performance and infrastructure health. Enhanced metrics provide granular, instance-level and container-level tracking of CPU, memory, GPU utilization, and invocation performance with configurable publishing frequencies, so teams can diagnose latency issues and resource inefficiencies that were previously hidden by endpoint-level aggregation. Rolling updates for inference components help transform deployment safety by alleviating the need for duplicate infrastructure provisioning—updates deploy in configurable batches with integrated Amazon CloudWatch alarm monitoring that triggers automatic rollbacks if issues are detected, facilitating zero-downtime deployments while minimizing risk through gradual validation.
Enhanced Metrics
SageMaker AI introduced enhanced metrics this year, helping deliver granular visibility into endpoint performance and resource utilization at both instance and container levels. This capability addresses a critical gap in observability, facilitating customers’ diagnosis of latency issues, invocation failures, and resource inefficiencies that were previously obscured by endpoint-level aggregation. Enhanced metrics provide instance-level tracking of CPU, memory, and GPU utilization alongside invocation performance metrics (latency, errors, throughput) with InstanceId dimensions for the SageMaker endpoints. For inference components, container-level metrics offer visibility into individual model replica resource consumption with both ContainerId and InstanceId dimensions.
You can configure metric publishing frequency, supplying near real-time monitoring for critical applications requiring rapid response. The self-service enablement through a simple MetricsConfig parameter in the CreateEndpointConfig API helps reduce time-to-insight, helping you self-diagnose performance issues. Enhanced metrics help you identify which specific instance or container requires attention, diagnose uneven traffic distribution across hosts, optimize resource allocation, and correlate performance issues with specific infrastructure resources. The feature works seamlessly with CloudWatch alarms and automatic scaling policies, providing proactive monitoring and automated responses to performance anomalies.
To enable enhanced metrics, add the MetricsConfig parameter when creating your endpoint configuration:

response = sagemaker_client.create_endpoint_config(
EndpointConfigName=’my-config’,
ProductionVariants=[{…}],
MetricsConfig={
‘EnableEnhancedMetrics’: True,
‘MetricPublishFrequencyInSeconds’: 60 # Supported: 10, 30, 60, 120, 180, 240, 300
}
)

Enhanced metrics are available across the AWS Regions for both single model endpoints and inference components, providing comprehensive observability for production AI deployments at scale.
Guardrail deployment with rolling updates
SageMaker AI introduced rolling updates for inference components, helping transform how you can deploy model updates with enhanced safety and efficiency. Traditional blue/green deployments require provisioning duplicate infrastructure, creating resource constraints—particularly for GPU-heavy workloads like large language models. Rolling updates deploy new model versions in configurable batches while dynamically scaling infrastructure, with integrated CloudWatch alarms monitoring metrics to trigger automatic rollbacks if issues are detected. This approach helps alleviate the need to provision duplicate fleets, reduces deployment overhead, and enables zero-downtime updates through gradual validation that minimizes risk while maintaining availability. For more details, see Enhance deployment guardrails with inference component rolling updates for Amazon SageMaker AI inference.
Usability
SageMaker AI usability improvements focus on removing complexity and accelerating time-to-value for AI teams. Serverless model customization reduces time for infrastructure planning by automatically provisioning compute resources based on model and data size, supporting advanced techniques like reinforcement learning from verifiable rewards (RLVR) and reinforcement learning from AI feedback (RLAIF) through both UI-based and code-based workflows with integrated MLflow experiment tracking. Bidirectional streaming enables real-time, multi-modal applications by maintaining persistent connections where data flows simultaneously in both directions—helping transform use cases like voice agents and live transcription from transactional exchanges into continuous conversations. Enhanced connectivity through comprehensive AWS PrivateLink support across the Regions and IPv6 compatibility helps make sure enterprise deployments can meet strict compliance alignment requirements while future-proofing network architectures.
Serverless model customization
The new SageMaker AI serverless customization capability addresses a critical challenge faced by organizations: the lengthy and complex process of fine-tuning AI models, which traditionally takes months and requires significant infrastructure management expertise. Many teams struggle with selecting appropriate compute resources, managing the technical complexity of advanced fine-tuning techniques like reinforcement learning, and navigating the end-to-end workflow from model selection through evaluation to deployment.

This serverless solution helps remove these barriers by automatically provisioning the right compute resources based on model and data size, making it possible for teams to focus on model tuning rather than infrastructure management and helping accelerate the customization process. The solution supports popular models including Amazon Nova, DeepSeek, GPT-OSS, Llama, and Qwen, providing both UI-based and code-based customization workflows that make advanced techniques accessible to teams with varying levels of technical expertise.
The solution offers multiple advanced customization techniques, including supervised fine-tuning, direct preference optimization, RLVR, and RLAIF. Each technique helps optimize models in different ways, with selection influenced by factors such as dataset size and quality, available computational resources, task requirements, desired accuracy levels, and deployment constraints. The solution includes integrated experiment tracking through serverless MLflow for automatic logging of critical metrics without code modifications, helping teams monitor and compare model performance throughout the customization process.

Deployment flexibility is a key feature, with options to deploy to either Amazon Bedrock for serverless inference or SageMaker AI endpoints for controlled resource management. The solution includes built-in model evaluation capabilities to compare customized models against base models, an interactive playground for testing with prompts or chat mode, and seamless integration with the broader Amazon SageMaker Studio environment. This end-to-end workflow—from model selection and customization through evaluation and deployment—is handled entirely within a unified interface.
Currently available in US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) Regions, the service operates on a pay-per-token model for both training and inference. This pricing approach helps make it cost-effective for organizations of different sizes to customize AI models without upfront infrastructure investments, and the serverless architecture helps make sure teams can scale their model customization efforts based on actual usage rather than provisioned capacity. For more information on this core capability, see New serverless customization in Amazon SageMaker AI accelerates model fine-tuning.
Bidirectional streaming
SageMaker AI introduced the bidirectional streaming capability in 2025, transforming inference from transactional exchanges into continuous conversations between users and models. This feature enables data to flow simultaneously in both directions over a single persistent connection, supporting real-time multi-modal use cases ranging from audio transcription and translation to voice agents. Unlike traditional approaches where clients send complete questions and wait for complete answers, bidirectional streaming allows speech and responses to flow concurrently—users can see results as soon as models begin generating them, and models can maintain context across continuous streams without re-sending conversation history. The implementation combines HTTP/2 and WebSocket protocols, with the SageMaker infrastructure managing efficient multiplexed connections from clients through routers to model containers.
The feature supports both bring-your-own-container implementations and partner integrations, with Deepgram serving as a launch partner offering their Nova-3 speech-to-text model through AWS Marketplace. This capability addresses critical enterprise requirements for real-time voice AI applications—particularly for organizations with strict compliance needs requiring audio processing to remain within their Amazon virtual private cloud (VPC)—while removing the operational overhead traditionally associated with self-hosted real-time AI solutions. The persistent connection approach reduces infrastructure overhead from TLS handshakes and connection management, replacing short-lived connections with efficient long-running sessions.
Developers can implement bidirectional streaming through two approaches: building custom containers that implement WebSocket protocol at ws://localhost:8080/invocations-bidirectional-stream with the appropriate Docker label (com.amazonaws.sagemaker.capabilities.bidirectional-streaming=true), or deploying pre-built partner solutions like Deepgram’s Nova-3 model directly from AWS Marketplace. The feature requires containers to handle incoming WebSocket data frames and send response frames back to SageMaker, with sample implementations available in both Python and TypeScript. For more details, see Introducing bidirectional streaming for real-time inference on Amazon SageMaker AI.
IPv6 and PrivateLink
Additionally, SageMaker AI expanded its connectivity capabilities in 2025 with comprehensive PrivateLink support across Regions and IPv6 compatibility for both public and private endpoints. These enhancements significantly help improve the service’s accessibility and security posture for enterprise deployments. PrivateLink integration makes it possible to access SageMaker AI endpoints privately from your VPCs without traversing the public internet, keeping the traffic within the AWS network infrastructure. This is particularly valuable for organizations with strict compliance requirements or data residency policies that mandate private connectivity for machine learning workloads.
The addition of IPv6 support for SageMaker AI endpoints addresses the growing need for modern IP addressing as organizations transition away from IPv4. You can now access SageMaker AI services using IPv6 addresses for both public endpoints and private VPC endpoints, providing flexibility in network architecture design and future-proofing infrastructure investments. The dual-stack capability (supporting both IPv4 and IPv6) facilitates backward compatibility while helping organizations adopt IPv6 at their own pace. Combined with PrivateLink, these connectivity enhancements help make SageMaker AI more accessible and secure for diverse enterprise networking environments, from traditional on-premises data centers connecting using AWS Direct Connect to modern cloud-based architectures built entirely on IPv6.
Conclusion
The 2025 enhancements to SageMaker AI represent a significant leap forward in making generative AI workloads more observable, reliable, and accessible for enterprise customers. From granular performance metrics that pinpoint infrastructure bottlenecks to serverless customization, these improvements address the real-world challenges teams face when deploying AI at scale. The combination of enhanced observability, safer deployment mechanisms, and streamlined workflows helps empower organizations to move faster while maintaining the reliability and security standards required for production systems.
These capabilities are available now across Regions, with features like enhanced metrics, rolling updates, and serverless customization ready to help transform how you can build and deploy AI applications. Whether you’re fine-tuning models for domain-specific tasks, building real-time voice agents with bidirectional streaming, or facilitating deployment safety with rolling updates and integrated monitoring, SageMaker AI helps provide the tools to accelerate your AI journey while reducing operational complexity.
Get started today by exploring the enhanced metrics documentation, trying serverless model customization, or implementing bidirectional streaming for your real-time inference workloads. For comprehensive guidance on implementing these features, refer to the Amazon SageMaker AI Documentation or reach out to your AWS account team to discuss how these capabilities can support your specific use cases.

About the authors
Dan Ferguson is a Sr. Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.
Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Sadaf Fardeen leads Inference Optimization charter for SageMaker. She owns optimization and development of LLM inference containers on SageMaker.
Suma Kasa is an ML Architect with the SageMaker Service team focusing on the optimization and development of LLM inference containers on SageMaker.
Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.
Deepti Ragha is a Senior Software Development Engineer on the Amazon SageMaker AI team, specializing in ML inference infrastructure and model hosting optimization. She builds features that improve deployment performance, reduce inference costs, and make ML accessible to organizations of all sizes. Outside of work, she enjoys traveling, hiking, and gardening.

Integrate external tools with Amazon Quick Agents using Model Context …

Amazon Quick supports Model Context Protocol (MCP) integrations for action execution, data access, and AI agent integration. You can expose your application’s capabilities as MCP tools by hosting your own MCP server and configuring an MCP integration in Amazon Quick. Amazon Quick acts as an MCP client and connects to your MCP server endpoint to access the tools you expose. After that connection is in place, Amazon Quick AI agents and automations can invoke your tools to retrieve data and run actions in your product, using the customer’s authentication, authorization, and governance controls.
With an Amazon Quick and MCP integration you can build a repeatable integration contract: you define tools once, publish a stable endpoint, and support the same model across customers. You can build AI agents and automations in Amazon Quick to analyze data, search enterprise knowledge, and run workflows across their business. Your customers get a way to use your product inside Amazon Quick workflows, without building custom connectors for every use case.
In this post, you’ll use a six-step checklist to build a new MCP server or validate and adjust an existing MCP server for Amazon Quick integration. The Amazon Quick User Guide describes the MCP client behavior and constraints. This is a “How to” guide for detailed implementation required by 3P partners to integrate with Amazon Quick with MCP.
Solution overview
Amazon Quick includes an MCP client that you configure through an integration. That integration connects to a remote MCP server, discovers the tools and data sources the server exposes, and makes them available to AI agents and automations. MCP integrations in Amazon Quick support both action execution and data access, including knowledge base creation.
Figure 1. shows how customers use Amazon Quick to invoke application capabilities, exposed as MCP tools by ISVs, enterprise systems, or custom solutions through an MCP integration.

Figure 1. Amazon Quick MCP integration with an external MCP server that exposes application capabilities as MCP tools.

Prerequisites

An Amazon Quick Professional subscription.
An Amazon Quick user with Author or higher permissions to create action connectors.
A remote MCP server endpoint that is reachable from the Amazon Quick.
An authentication approach that your MCP server supports user authentication, service authentication or no authentication.
A small initial set of product capabilities as APIs to be exposed as MCP tools (start with the operations your customers use most).

Checklist for Amazon Quick MCP integration readiness
Now let’s walk through the 6 steps process build the integration with Amazon Quick using MCP

Step 1: Choose your MCP server deployment model.
Step 2: Implement a remote MCP server compatible with Amazon Quick.
Step 3: Implement authentication and authorization.
Step 4: Document configuration for Amazon Quick customers
Step 5: Register the MCP integration in Amazon Quick.
Step 6: Test your actions and setup using out-of-the-box test action APIs tool in Amazon Quick.

Use the following steps to either build an MCP server for Amazon Quick or validate an existing server before customers connect it. Steps 1–4 cover server design, implementation, and documentation. Step 5 covers the Amazon Quick integration workflow customers run. Step 6 covers operations.
Step 1: Choose your MCP server deployment model
Decide how you will host your MCP endpoint and isolate tenants. Two common patterns work well:

Shared multi-tenant endpoint: One MCP endpoint serves multiple customers. Your authentication and authorization layer maps each request to a tenant and user, and enforces tenant isolation on every tool call.
Dedicated per-tenant endpoint: Each customer gets a unique MCP endpoint or server instance. You provision and operate a stable URL and credentials for each tenant.

Choose the model that matches your SaaS architecture and support model. If you already run a multi-tenant API tier with tenant-aware authorization, a shared MCP endpoint fits. If you need stronger isolation boundaries or separate compliance controls, dedicated endpoints reduce impact.
Step 2: Implement a remote MCP server compatible with Amazon Quick
Your MCP server must conform to the MCP specification and align with Amazon Quick client constraints. Focus on transport, tool definitions, and operational limits.
Transport and connectivity requirements:

Expose your MCP server over a public endpoint that is reachable from Amazon Quick. Use HTTPS for production.
Support a remote transport. Amazon Quick supports Server-Sent Events (SSE) and streamable HTTP. HTTP streaming is preferred.

Tool and resource requirements:

Define MCP tools using JSON schema so the Amazon Quick MCP client can discover them and invoke them through listTools and callTool.
Keep tool names consistent and version tool behavior intentionally. Amazon Quick treats the tool list as static after registration; administrators must reestablish the connection for the server side to reflect the changes.
If your integration includes data access, expose data sources and resources so that Amazon Quick can use the sources to create knowledge bases.

Amazon Quick MCP client limitations:
As of today, you must consider the following when you design.

Each MCP operation has a fixed 300-second timeout. Operations that exceed this limit fail with HTTP 424.
Connector creation can fail if the Amazon Quick callback URI is not allow-listed by your identity provider or authorization server. See Step 3 for call back URIs details.

If your applications and service providers don’t have an MCP server, you can:

Build and host your own MCP server using an MCP SDK that supports streamable HTTP or SSE. For MCP developer guidance, refer to the Model Context Protocol documentation. For code samples to host it in AWS, see the deployment guidance GitHub repository.
Run your MCP server on Amazon Bedrock AgentCore Runtime, which supports hosting MCP servers in a managed way. For details about hosting agents or tools, see Host agent or tools with Amazon Bedrock AgentCore Runtime.
Front existing REST APIs or AWS Lambda functions with Amazon Bedrock AgentCore Gateway, which can convert APIs and services into MCP-compatible tools and expose them through gateway endpoints. For an overview, see Introducing Amazon Bedrock AgentCore Gateway.

For an end-to-end Amazon Quick example that uses AgentCore Gateway as the MCP server endpoint, refer to Connect Amazon Quick to enterprise apps and agents with MCP. Similarly refer to Build your Custom MCP Server on Agentcore Runtime for a Code Sample.
Step 3: Implement authentication and authorization
Amazon Quick MCP integrations support multiple authentication patterns. Choose the pattern that matches how your customers want Amazon Quick to access your product, then enforce authorization on every tool invocation.
  User authentication:

Use OAuth 2.0 authorization code flow when Amazon Quick needs to act on behalf of individual users.
Support OAuth Dynamic Client Registration (DCR) if you want Amazon Quick to register the client automatically. If you do not support DCR, document the client ID, client secret, token URL, authorization URL, and redirect URL that customers must enter during integration setup.
Issue access tokens scoped to tenant and user, and enforce user-level role-based access control (RBAC) for every tool call.

  Service authentication (service-to-service):

Use service-to-service authentication when Amazon Quick should call your MCP server as a machine client (for example, shared service accounts or backend automation).
Validate client-credential tokens on every request and enforce tenant-scoped access.

  No authentication:

Use no authentication only for public or demo MCP servers. For example, the AWS Knowledge MCP Server does not require authentication (but it is subject to rate limits).

If you front your tools with Amazon Bedrock AgentCore Gateway, Gateway validates inbound requests using OAuth-based authorization aligned with the MCP authorization specification. Gateway functions as an OAuth resource server and can work with identity providers such as Amazon Cognito, Okta, or Auth0. Gateway also supports outbound authentication to downstream APIs and secure credential storage. In this pattern, Amazon Quick authenticates to the Gateway using the authentication method you configure (for example, service-to-service OAuth), and Gateway authenticates to your downstream APIs.
Allowlist requirements for OAuth redirects (required for some IdPs) Some identity providers block OAuth redirects unless the redirect URI is explicitly allowlisted in the OAuth client configuration. If your OAuth setup fails during integration creation, confirm that your OAuth client app allowlists the Amazon Quick redirect URI for each AWS Region where your customers use Amazon Quick.

https://us-east-1.quicksight.aws.amazon.com/sn/oauthcallback
https://us-west-2.quicksight.aws.amazon.com/sn/oauthcallback
https://ap-southeast-2.quicksight.aws.amazon.com/sn/oauthcallback
https://eu-west-1.quicksight.aws.amazon.com/sn/oauthcallback
https://us-east-1-onebox.quicksight.aws.amazon.com/sn/oauthcallback
https://us-west-2-onebox.quicksight.aws.amazon.com/sn/oauthcallback
https://ap-southeast-2-onebox.quicksight.aws.amazon.com/sn/oauthcallback
https://eu-west-1-onebox.quicksight.aws.amazon.com/sn/oauthcallback

Step 4: Document configuration for Amazon Quick customers
Before connecting to Amazon Quick, verify your server’s baseline compatibility using the MCP Inspector. This standard developer tool acts as a generic MCP client, so you can test connectivity, browse your tool catalog, and simulate tool execution in a controlled sandbox. If your server works with the Inspector, it is protocol-compliant and ready for Amazon Quick integration.
Your integration succeeds when you’re able to authenticate into your MCP Server and test your actions using the Test APIs section and you can invoke these tools through Chat Agents and automations.
Add a Amazon Quick integration section to your product documentation that covers:

MCP server endpoint: the exact URL customers enter in the Amazon Quick MCP server endpoint field.
Authentication method: Which Amazon Quick option to choose (user authentication or service authentication or No Authentication), plus the fields and values required.
OAuth details (if used): Required scopes, roles, and any prerequisites such as allow listing the Amazon Quick callback URI.
Network and security notes: Any allow-list requirements, data residency constraints, or compliance implications.
Tool catalog: The tools you expose, what each tool does, required permissions, and error behavior.

Step 5: Register the MCP integration in Amazon Quick
After your server is ready, your customer can create an MCP integration in the Amazon Quick console. This procedure is based on Set up MCP integration in the Amazon Quick User Guide.

Sign in to the Amazon Quick console with a user that has Author permissions or higher.
Choose Integrations.
Choose Add (+), and then choose Model Context Protocol (MCP).
On the Create integration page, enter a Name, an optional Description, and your MCP server endpoint URL. Choose Next.
Select the authentication method your server supports (user authentication or service authentication), and then enter the required configuration values. If your MCP Server supports DCR, you will be skip the Authentication step and the client credentials exchange happens during the sign-in step.
Choose Create and continue. Review the discovered tools and data capabilities from your MCP server, and then choose Next.
If you want other users to use the integration, share it. When you are finished, choose Done.

Amazon Quick does not poll for schema changes. If you modify tool signatures or add new capabilities, you must advise your customers to re-authenticate or refresh their integration settings to enable these updates.
Step 6: Operate, monitor, and meter your MCP server
Treat your MCP server as production API surface area. Add the operational controls you already use for your SaaS APIs, and make them tenant-aware.

Logging and observability: Log each tool invocation with tenant identifier, user identifier (when available), tool name, latency, status, and error details.
Throttling and quotas: Enforce per-tenant rate limits to protect downstream systems and return clear throttling errors.
Versioning: Coordinate tool changes with your documentation and your customers’ refresh workflow. Treat tool names and schemas as a contract.
Security operations: Support credential rotation, token revocation, and audit trails for administrative actions.
Metering (optional): Record usage per tenant (for example, tool calls or data volume) to align with your SaaS pricing or AWS Marketplace metering.

Clean up
If you created a Amazon Quick MCP integration for testing, delete it when you no longer need it.
To delete an integration, follow Integration workflows in the Amazon Quick User Guide. The high-level steps are:

In the Amazon Quick console, choose Integrations.
From the integrations table, select the integration you want to remove.
From the Actions menu (three-dot menu), choose Delete integration.
In the confirmation dialog, review the integration details and any dependent resources that will be affected.
Choose Delete to confirm removal.

If you used OAuth for the integration, also revoke the Amazon Quick client in your authorization server and delete any test credentials you created.
Conclusion
Amazon Quick MCP integrations give your customers a standard way to connect AI agents and automations to your product. When you expose your capabilities as MCP tools on a remote MCP server, customers can configure the connection in the Amazon Quick console and use your tools across multiple workflows.
Start with a small set of high-value tools, design each tool call to complete within the 300-second limit, and document the exact endpoint and authentication settings customers must use. After you validate the integration workflow in Amazon Quick , expand your tool catalog and add the operational controls you use for any production API.
For next steps, review the Amazon Quick MCP documentation, then use the checklist in this post to validate your server. If you want AWS options to build and host MCP servers, refer to the AgentCore documentation and Deploying model context protocol servers on AWS.

About the authors

Ebbey Thomas
Ebbey Thomas is a Senior Worldwide Generative AI Specialist Solutions Architect at AWS. He designs and implements generative AI solutions that address specific customer business problems. He is recognized for simplifying complexity and delivering measurable business outcomes for clients. Ebbey holds a BS in Computer Engineering and an MS in Information Systems from Syracuse University.

Vishnu Elangovan
Vishnu Elangovan is a Worldwide Agentic AI Solution Architect with over 9+ years of experience in Applied AI/ML and Deep Learning. He loves building and tinkering with scalable AI/ML solutions and considers himself a lifelong learner. Vishnu is a trusted thought leader in the AI/ML community, regularly speaking at leading AI conferences and sharing his expertise on Agentic AI at top-tier events.

Sonali Sahu
Sonali Sahu is leading the Generative AI Specialist Solutions Architecture team at AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.

Google AI Releases Gemini 3.1 Pro with 1 Million Token Context and 77. …

Google has officially shifted the Gemini era into high gear with the release of Gemini 3.1 Pro, the first version update in the Gemini 3 series. This release is not just a minor patch; it is a targeted strike at the ‘agentic’ AI market, focusing on reasoning stability, software engineering, and tool-use reliability.

For devs, this update signals a transition. We are moving from models that simply ‘chat’ to models that ‘work.’ Gemini 3.1 Pro is designed to be the core engine for autonomous agents that can navigate file systems, execute code, and reason through scientific problems with a success rate that now rivals—and in some cases exceeds—the industry’s most elite frontier models.

Massive Context, Precise Output

One of the most immediate technical upgrades is the handling of scale. Gemini 3.1 Pro Preview maintains a massive 1M token input context window. To put this in perspective for software engineers: you can now feed the model an entire medium-sized code repository, and it will have enough ‘memory’ to understand the cross-file dependencies without losing the plot.

However, the real news is the 65k token output limit. This 65k window is a significant jump for developers building long-form generators. Whether you are generating a 100-page technical manual or a complex, multi-module Python application, the model can now finish the job in a single turn without hitting an abrupt ‘max token’ wall.

Doubling Down on Reasoning

If Gemini 3.0 was about introducing ‘Deep Thinking,’ Gemini 3.1 is about making that thinking efficient. The performance jumps on rigorous benchmarks are notable:

BenchmarkScoreWhat it measuresARC-AGI-277.1%Ability to solve entirely new logic patternsGPQA Diamond94.1%Graduate-level scientific reasoningSciCode58.9%Python programming for scientific computingTerminal-Bench Hard53.8%Agentic coding and terminal useHumanity’s Last Exam (HLE)44.7%Reasoning against near-human limits

The 77.1% on ARC-AGI-2 is the headline figure here. Google team claims this represents more than double the reasoning performance of the original Gemini 3 Pro. This means the model is much less likely to rely on pattern matching from its training data and is more capable of ‘figuring it out’ when faced with a novel edge case in a dataset.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

The Agentic Toolkit: Custom Tools and ‘Antigravity‘

Google team is making a clear play for the developer’s terminal. Along with the main model, they launched a specialized endpoint: gemini-3.1-pro-preview-customtools.

This endpoint is optimized for developers who mix bash commands with custom functions. In previous versions, models often struggled to prioritize which tool to use, sometimes hallucinating a search when a local file read would have sufficed. The customtools variant is specifically tuned to prioritize tools like view_file or search_code, making it a more reliable backbone for autonomous coding agents.

This release also integrates deeply with Google Antigravity, the company’s new agentic development platform. Developers can now utilize a new ‘medium’ thinking level. This allows you to toggle the ‘reasoning budget’—using high-depth thinking for complex debugging while dropping to medium or low for standard API calls to save on latency and cost.

API Breaking Changes and New File Methods

For those already building on the Gemini API, there is a small but critical breaking change. In the Interactions API v1beta, the field total_reasoning_tokens has been renamed to total_thought_tokens. This change aligns with the ‘thought signatures’ introduced in the Gemini 3 family—encrypted representations of the model’s internal reasoning that must be passed back to the model to maintain context in multi-turn agentic workflows.

The model’s appetite for data has also grown. Key updates to file handling include:

100MB File Limit: The previous 20MB cap for API uploads has been quintupled to 100MB.

Direct YouTube Support: You can now pass a YouTube URL directly as a media source. The model ‘watches’ the video via the URL rather than requiring a manual upload.

Cloud Integration: Support for Cloud Storage buckets and private database pre-signed URLs as direct data sources.

The Economics of Intelligence

Pricing for Gemini 3.1 Pro Preview remains aggressive. For prompts under 200k tokens, input costs are $2 per 1 million tokens, and output is $12 per 1 million. For contexts exceeding 200k, the price scales to $4 input and $18 output.

When compared to competitors like Claude Opus 4.6 or GPT-5.2, Google team is positioning Gemini 3.1 Pro as the ‘efficiency leader.’ According to data from Artificial Analysis, Gemini 3.1 Pro now holds the top spot on their Intelligence Index while costing roughly half as much to run as its nearest frontier peers.

Key Takeaways

Massive 1M/65K Context Window: The model maintains a 1M token input window for large-scale data and repositories, while significantly upgrading the output limit to 65k tokens for long-form code and document generation.

A Leap in Logic and Reasoning: Performance on the ARC-AGI-2 benchmark reached 77.1%, representing more than double the reasoning capability of previous versions. It also achieved a 94.1% on GPQA Diamond for graduate-level science tasks.

Dedicated Agentic Endpoints: Google team introduced a specialized gemini-3.1-pro-preview-customtools endpoint. It is specifically optimized to prioritize bash commands and system tools (like view_file and search_code) for more reliable autonomous agents.

API Breaking Change: Developers must update their codebases as the field total_reasoning_tokens has been renamed to total_thought_tokens in the v1beta Interactions API to better align with the model’s internal “thought” processing.

Enhanced File and Media Handling: The API file size limit has increased from 20MB to 100MB. Additionally, developers can now pass YouTube URLs directly into the prompt, allowing the model to analyze video content without needing to download or re-upload files.

Check out the Technical details and Try it here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Releases Gemini 3.1 Pro with 1 Million Token Context and 77.1 Percent ARC-AGI-2 Reasoning for AI Agents appeared first on MarkTechPost.

A Coding Implementation to Build Bulletproof Agentic Workflows with Py …

In this tutorial, we build a production-ready agentic workflow that prioritizes reliability over best-effort generation by enforcing strict, typed outputs at every step. We use PydanticAI to define clear response schemas, wire in tools via dependency injection, and ensure the agent can safely interact with external systems, such as a database, without breaking execution. By running everything in a notebook-friendly, async-first setup, we demonstrate how to move beyond fragile chatbot patterns toward robust agentic systems suitable for real enterprise workflows.

Copy CodeCopiedUse a different Browser!pip -q install “pydantic-ai-slim[openai]” pydantic

import os, json, sqlite3
from dataclasses import dataclass
from datetime import datetime, timezone
from typing import Literal, Optional, List

from pydantic import BaseModel, Field, field_validator
from pydantic_ai import Agent, RunContext, ModelRetry

if not os.environ.get(“OPENAI_API_KEY”):
try:
from google.colab import userdata
os.environ[“OPENAI_API_KEY”] = (userdata.get(“OPENAI_API_KEY”) or “”).strip()
except Exception:
pass

if not os.environ.get(“OPENAI_API_KEY”):
import getpass
os.environ[“OPENAI_API_KEY”] = getpass.getpass(“Paste your OPENAI_API_KEY: “).strip()

assert os.environ.get(“OPENAI_API_KEY”), “OPENAI_API_KEY is required.”

We set up the execution environment and ensure all required libraries are available for the agent to run correctly. We securely load the OpenAI API key in a Colab-friendly way so the tutorial works without manual configuration changes. We also import all core dependencies that will be shared across schemas, tools, and agent logic.

Copy CodeCopiedUse a different BrowserPriority = Literal[“low”, “medium”, “high”, “critical”]
ActionType = Literal[“create_ticket”, “update_ticket”, “query_ticket”, “list_open_tickets”, “no_action”]
Confidence = Literal[“low”, “medium”, “high”]

class TicketDraft(BaseModel):
title: str = Field(…, min_length=8, max_length=120)
customer: str = Field(…, min_length=2, max_length=60)
priority: Priority
category: Literal[“billing”, “bug”, “feature_request”, “security”, “account”, “other”]
description: str = Field(…, min_length=20, max_length=1000)
expected_outcome: str = Field(…, min_length=10, max_length=250)

class AgentDecision(BaseModel):
action: ActionType
reason: str = Field(…, min_length=20, max_length=400)
confidence: Confidence
ticket: Optional[TicketDraft] = None
ticket_id: Optional[int] = None
follow_up_questions: List[str] = Field(default_factory=list, max_length=5)

@field_validator(“follow_up_questions”)
@classmethod
def short_questions(cls, v):
for q in v:
if len(q) > 140:
raise ValueError(“Each follow-up question must be <= 140 characters.”)
return v

We define the strict data models that act as the contract between the agent and the rest of the system. We use typed fields and validation rules to guarantee that every agent response follows a predictable structure. By enforcing these schemas, we prevent malformed outputs from silently propagating through the workflow.

Copy CodeCopiedUse a different Browser@dataclass
class SupportDeps:
db: sqlite3.Connection
tenant: str
policy: dict

def utc_now_iso() -> str:
return datetime.now(timezone.utc).isoformat()

def init_db() -> sqlite3.Connection:
conn = sqlite3.connect(“:memory:”, check_same_thread=False)
conn.execute(“””
CREATE TABLE tickets (
id INTEGER PRIMARY KEY AUTOINCREMENT,
tenant TEXT NOT NULL,
title TEXT NOT NULL,
customer TEXT NOT NULL,
priority TEXT NOT NULL,
category TEXT NOT NULL,
description TEXT NOT NULL,
expected_outcome TEXT NOT NULL,
status TEXT NOT NULL,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
“””)
conn.commit()
return conn

def seed_ticket(db: sqlite3.Connection, tenant: str, ticket: TicketDraft, status: str = “open”) -> int:
now = utc_now_iso()
cur = db.execute(
“””
INSERT INTO tickets
(tenant, title, customer, priority, category, description, expected_outcome, status, created_at, updated_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
“””,
(
tenant,
ticket.title,
ticket.customer,
ticket.priority,
ticket.category,
ticket.description,
ticket.expected_outcome,
status,
now,
now,
),
)
db.commit()
return int(cur.lastrowid)

We construct the dependency layer and initialize a lightweight SQLite database for persistence. We model real-world runtime dependencies, such as database connections and tenant policies, and make them injectable into the agent. We also define helper functions that safely insert and manage ticket data during execution.

Copy CodeCopiedUse a different Browserdef build_agent(model_name: str) -> Agent[SupportDeps, AgentDecision]:
agent = Agent(
f”openai:{model_name}”,
output_type=AgentDecision,
output_retries=2,
instructions=(
“You are a production support triage agent.n”
“Return an output that matches the AgentDecision schema.n”
“Use tools when you need DB state.n”
“Never invent ticket IDs.n”
“If the user intent is unclear, ask concise follow-up questions.n”
),
)

@agent.tool
def create_ticket(ctx: RunContext[SupportDeps], ticket: TicketDraft) -> int:
deps = ctx.deps
if ticket.priority in (“critical”, “high”) and deps.policy.get(“require_security_phrase_for_critical”, False):
if ticket.category == “security” and “incident” not in ticket.description.lower():
raise ModelRetry(“For security high/critical, include the word ‘incident’ in description and retry.”)
return seed_ticket(deps.db, deps.tenant, ticket, status=”open”)

@agent.tool
def update_ticket_status(
ctx: RunContext[SupportDeps],
ticket_id: int,
status: Literal[“open”, “in_progress”, “resolved”, “closed”],
) -> dict:
deps = ctx.deps
now = utc_now_iso()
cur = deps.db.execute(“SELECT id FROM tickets WHERE tenant=? AND id=?”, (deps.tenant, ticket_id))
if not cur.fetchone():
raise ModelRetry(f”Ticket {ticket_id} not found for this tenant. Ask for the correct ticket_id.”)
deps.db.execute(
“UPDATE tickets SET status=?, updated_at=? WHERE tenant=? AND id=?”,
(status, now, deps.tenant, ticket_id),
)
deps.db.commit()
return {“ticket_id”: ticket_id, “status”: status, “updated_at”: now}

@agent.tool
def query_ticket(ctx: RunContext[SupportDeps], ticket_id: int) -> dict:
deps = ctx.deps
cur = deps.db.execute(
“””
SELECT id, title, customer, priority, category, status, created_at, updated_at
FROM tickets WHERE tenant=? AND id=?
“””,
(deps.tenant, ticket_id),
)
row = cur.fetchone()
if not row:
raise ModelRetry(f”Ticket {ticket_id} not found. Ask the user for a valid ticket_id.”)
keys = [“id”, “title”, “customer”, “priority”, “category”, “status”, “created_at”, “updated_at”]
return dict(zip(keys, row))

@agent.tool
def list_open_tickets(ctx: RunContext[SupportDeps], limit: int = 5) -> list:
deps = ctx.deps
limit = max(1, min(int(limit), 20))
cur = deps.db.execute(
“””
SELECT id, title, priority, category, status, updated_at
FROM tickets
WHERE tenant=? AND status IN (‘open’,’in_progress’)
ORDER BY updated_at DESC
LIMIT ?
“””,
(deps.tenant, limit),
)
rows = cur.fetchall()
return [
{“id”: r[0], “title”: r[1], “priority”: r[2], “category”: r[3], “status”: r[4], “updated_at”: r[5]}
for r in rows
]

@agent.output_validator
def validate_decision(ctx: RunContext[SupportDeps], out: AgentDecision) -> AgentDecision:
deps = ctx.deps
if out.action == “create_ticket” and out.ticket is None:
raise ModelRetry(“You chose create_ticket but did not provide ticket. Provide ticket fields and retry.”)
if out.action in (“update_ticket”, “query_ticket”) and out.ticket_id is None:
raise ModelRetry(“You chose update/query but did not provide ticket_id. Ask for ticket_id and retry.”)
if out.ticket and out.ticket.priority == “critical” and not deps.policy.get(“allow_critical”, True):
raise ModelRetry(“This tenant does not allow ‘critical’. Downgrade to ‘high’ and retry.”)
return out

return agent

It contains the core agent logic for assembling a model-agnostic PydanticAI agent. We register typed tools for creating, querying, updating, and listing tickets, allowing the agent to interact with external state in a controlled way. We also enforce output validation so the agent can self-correct whenever its decisions violate business rules.

Copy CodeCopiedUse a different Browserdb = init_db()
deps = SupportDeps(
db=db,
tenant=”acme_corp”,
policy={“allow_critical”: True, “require_security_phrase_for_critical”: True},
)

seed_ticket(
db,
deps.tenant,
TicketDraft(
title=”Double-charged on invoice 8831″,
customer=”Riya”,
priority=”high”,
category=”billing”,
description=”Customer reports they were billed twice for invoice 8831 and wants a refund and confirmation email.”,
expected_outcome=”Issue a refund and confirm resolution to customer.”,
),
)
seed_ticket(
db,
deps.tenant,
TicketDraft(
title=”App crashes on login after update”,
customer=”Sam”,
priority=”high”,
category=”bug”,
description=”After latest update, the app crashes immediately on login. Reproducible on two devices; needs investigation.”,
expected_outcome=”Provide a fix or workaround and restore successful logins.”,
),
)

agent = build_agent(“gpt-4o-mini”)

async def run_case(prompt: str):
res = await agent.run(prompt, deps=deps)
out = res.output
print(json.dumps(out.model_dump(), indent=2))
return out

case_a = await run_case(
“We suspect account takeover: multiple password reset emails and unauthorized logins. ”
“Customer=Leila. Priority=critical. Open a security ticket.”
)

case_b = await run_case(“List our open tickets and summarize what to tackle first.”)

case_c = await run_case(“What is the status of ticket 1? If it’s open, move it to in_progress.”)

agent_alt = build_agent(“gpt-4o”)
alt_res = await agent_alt.run(
“Create a feature request ticket: customer=Noah wants ‘export to CSV’ in analytics dashboard; priority=medium.”,
deps=deps,
)

print(json.dumps(alt_res.output.model_dump(), indent=2))

We wire everything together by seeding initial data and running the agent asynchronously, in a notebook-safe manner. We execute multiple real-world scenarios to show how the agent reasons, calls tools, and returns schema-valid outputs. We also demonstrate how easily we can swap the underlying model while keeping the same workflows and guarantees intact.

In conclusion, we showed how a type-safe agent can reason, call tools, validate its own outputs, and recover from errors without manual intervention. We kept the logic model-agnostic, allowing us to swap underlying LLMs while preserving the same schemas and tools, which is critical for long-term maintainability. Overall, we demonstrated how combining strict schema enforcement, dependency injection, and async execution closes the reliability gap in agentic AI and provides a solid foundation for building dependable production systems.

Check out the Full Codes Here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation to Build Bulletproof Agentic Workflows with PydanticAI Using Strict Schemas, Tool Injection, and Model-Agnostic Execution appeared first on MarkTechPost.

Zyphra Releases ZUNA: A 380M-Parameter BCI Foundation Model for EEG Da …

Brain-computer interfaces (BCIs) are finally having their ‘foundation model’ moment. Zyphra, a research lab focused on large-scale models, recently released ZUNA, a 380M-parameter foundation model specifically for EEG signals. ZUNA is a masked diffusion auto-encoder designed to perform channel infilling and super-resolution for any electrode layout. This release includes weights under an Apache-2.0 license and an MNE-compatible inference stack.

The Problem with ‘Brittle’ EEG Models

For decades, researchers have struggled with the ‘Wild West’ of EEG data. Different datasets use varying numbers of channels and inconsistent electrode positions. Most deep learning models are trained on fixed channel montages, making them fail when applied to new datasets or recording conditions. Additionally, EEG measurements are often plagued by noise from electrode shifts or subject movement.

ZUNA’s 4D Architecture: Spatial Intelligence

ZUNA solves the generalizability problem by treating brain signals as spatially grounded data. Instead of assuming a fixed grid, ZUNA injects spatiotemporal structure via a 4D rotary positional encoding (4D RoPE).

The model tokenizes multichannel EEG into short temporal windows of 0.125 seconds, or 32 samples. Each token is mapped to a 4D coordinate: its 3D scalp location (x, y, z) and its coarse-time index (t). This allows the model to process arbitrary channel subsets and positions. Because it relies on positional embeddings rather than a fixed schema, ZUNA can ‘imagine’ signal data at any point on the head where a sensor might be missing.

https://www.zyphra.com/post/zuna

Diffusion as a Generative Engine

ZUNA uses a diffusion approach because EEG signals are continuous and real-valued. The model pairs a diffusion decoder with an encoder that stores signal information in a latent bottleneck.

During training, Zyphra used a heavy channel-dropout objective. They randomly dropped 90% of channels, replacing them with zeros in the encoder input. The model was then tasked with reconstructing these ‘masked’ signals from the information in the remaining 10% of channels. This forced the model to learn deep cross-channel correlations and a powerful internal representation of brain activity.

The Massive Data Pipeline: 2 Million Hours

Data quality is the heartbeat of any foundation model. Zyphra aggregated a harmonized corpus spanning 208 public datasets. This massive collection includes:

2 million channel-hours of EEG recordings.

Over 24 million non-overlapping 5-second samples.

A wide range of channel counts from 2 to 256 per recording.

The preprocessing pipeline standardized all signals to a common sampling rate of 256 Hz. They used MNE-Python to apply high-pass filters at 0.5 Hz and an adaptive notch filter to remove line noise. Signals were then z-score normalized to ensure zero-mean and unit-variance while preserving spatial structure.

Benchmarks: Killing the Spherical Spline

For years, the industry standard for filling in missing EEG data has been spherical-spline interpolation. While splines are useful for capturing local smoothness, they have no ‘learned prior’ and fail when gaps between sensors grow too large.

ZUNA consistently outperforms spherical-spline interpolation across multiple benchmarks, including the ANPHY-Sleep dataset and the BCI2000 motor-imagery dataset. The performance gap widens significantly at higher dropout rates. In extreme 90% dropout scenarios—essentially 10x upsampling—ZUNA maintains high reconstruction fidelity while spline methods degrade sharply.

https://www.zyphra.com/post/zuna

Key Takeaways

Universal Generalization: ZUNA is a 380M-parameter model that works with any EEG system, regardless of the number or position of electrodes. Unlike previous AI models limited to fixed layouts, it generalizes across diverse datasets and novel channel positions.

4D Spatiotemporal Intelligence: The model uses a 4D Rotary Positional Encoding (4D RoPE) system to map brain signals across 3D space (x, y, z) and time (t). This allow it to ‘understand’ the physical geometry of the scalp and accurately predict missing data.

Superior Channel Reconstruction: By training as a masked diffusion autoencoder, ZUNA significantly outperforms traditional spherical-spline interpolation. It excels at ‘super-resolution,’ maintaining high accuracy even when up to 90% of the brain’s signals are missing or corrupted.

Massive Training Scale: The model was trained on a harmonized corpus of 208 datasets, totaling approximately 2 million channel-hours and 24 million unique 5-second samples. This scale allows it to learn deep cross-channel correlations that simpler geometric methods miss.

Check out the Paper, Technical Details, Repo and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Zyphra Releases ZUNA: A 380M-Parameter BCI Foundation Model for EEG Data, Advancing Noninvasive Thought-to-Text Development appeared first on MarkTechPost.

Build AI workflows on Amazon EKS with Union.ai and Flyte

As artificial intelligence and machine learning (AI/ML) workflows grow in scale and complexity, it becomes harder for practitioners to organize and deploy their models. AI projects often struggle to move from pilot to production. AI projects often fail not because models are bad, but because infrastructure and processes are fragmented and brittle, and the original pilot code base is often forced to bloat by these additional requirements. This makes it difficult for data scientists and engineers to quickly move from laptop to cluster (local development to production deployment) and reproduce the exact results they had seen during the pilot.
In this post, we explain how you can use the Flyte Python SDK to orchestrate and scale AI/ML workflows. We explore how the Union.ai 2.0 system enables deployment of Flyte on Amazon Elastic Kubernetes Service (Amazon EKS), integrating seamlessly with AWS services like Amazon Simple Storage Service (Amazon S3), Amazon Aurora, AWS Identity and Access Management (IAM), and Amazon CloudWatch. We explore the solution through an AI workflow example, using the new Amazon S3 Vectors service.
Common challenges running AI/ML workflows on Kubernetes
AI/ML workflows running on Kubernetes present several orchestration challenges:

Infrastructure complexity – Provisioning the right compute resources (CPUs, GPUs, memory) dynamically across Kubernetes clusters
Experiment-to-production gap – Moving from experimentation to production often requires rebuilding pipelines in different environments
Reproducibility – Tracking data lineage, model versions, and experiment parameters to facilitate reliable results
Cost management – Efficiently utilizing spot instances, automatic scaling, and avoiding over-provisioning
Reliability – Handling failures gracefully with automatic retries, checkpointing, and recovery mechanisms

Purpose-built AI/ML tooling is essential for orchestrating complex workflows, offering specialized capabilities like intelligent caching, automatic versioning, and dynamic resource allocation that streamline development and deployment cycles.
Why Flyte/Union for Amazon EKS
The Flyte on Amazon EKS Python workflows scale from laptop-to-cluster with dynamic execution, reproducibility, and compute-aware orchestration. These workflows, along with Union.ai’s managed deployment, facilitate seamless, crash-proof operations that fully utilize Amazon EKS without the infrastructure overhead. Flyte transforms how you can orchestrate AI/ML workloads on Amazon EKS, making workflows simple to build. Some key factors include:

Pure Python workflows – Write orchestration logic in Python with 66% less code than traditional orchestrators, alleviating the need to learn domain-specific languages and removing barriers for ML engineers and AI developers migrating existing code
Dynamic execution – Make real-time decisions at runtime with flexible branching, loops, and conditional logic, which is essential for agentic AI systems
Reproducibility by default – Every execution is versioned, cached, and tracked with complete data lineage
Compute-aware orchestration – Dynamically provision the right compute resources for each task, from CPUs for data processing to GPUs for model training
Robustness – Pipelines can quickly recover from failures, isolate errors, and manage checkpoints without manual intervention

Union.ai 2.0 is built on Flyte, the open source, Kubernetes-based workflow orchestration system originally developed at Lyft to power mission-critical ML systems like ETA prediction, pricing, and mapping. After Flyte was open sourced in 2020 and became a Linux Foundation AI & Data project, the core engineering team founded Union.ai 2.0 to deliver an enterprise-grade service purposed-built for teams running AI/ML workloads on Amazon EKS. Union.ai 2.0 reduces the complexity of managing Kubernetes infrastructure through managed operations, a multi-cloud control plane, and abstracted infrastructure management, while providing ML-based capabilities that help data scientists and engineers focus on building models with enhanced scale, speed, security, and reliability.
Additional benefits of using Union.ai 2.0 include:

Enhanced scalability – Workflows respond at runtime with flexible branching, task fanout, and real-time infrastructure scaling.
Crash-proof reliability – Automatic retries, checkpointing, and failure recovery allow workflows to stay resilient without manual intervention.
Agentic AI runtime – Union.ai is designed for long-lived agentic AI systems, supporting stateful agents and truly durable orchestration.
Compliance – For regulated industries, built-in lineage, auditability, and secure execution (SOC2, RBAC, SSO) are critical. Orchestration on Amazon EKS and Union.ai helps facilitate compliance.
Resource awareness – It offers first-class support for compute provisioning, spot instances, and automatic scaling.

The benefits of Flyte and Union.ai 2.0 elevate modern orchestration to a first-class requirement: dynamic execution, fault tolerance, and resource awareness are now built-in, providing a more developer-friendly experience compared to 1.0.
Amazon EKS provides your compute, storage, and networking backbone. Flyte (the open source project) handles workflow orchestration. Union.ai extends Flyte with infrastructure-aware orchestration, enterprise-grade security, and turnkey scalability, giving you production-ready Flyte without the DIY setup. Both Flyte and Union.ai 2.0 run on Amazon EKS, but serve different needs, as detailed in the following table.

Feature
Open Source Flyte
Union.ai 2.0

Deployment
Self-managed on your EKS cluster
Fully managed or BYOC options

Best for
Teams with Kubernetes expertise
Teams wanting managed operations

Performance
Standard scale
10–100 times greater scale, speed, task fanout, and parallelism

Infrastructure
You manage upgrades, scaling
White-glove managed infrastructure

Enterprise features
No role-based access control
Fine-grained role-based access control, single sign-on, managed secrets, cost dashboards

Support
Community-driven
Enterprise SLA with Union.ai team

Real-time serving
Build your own
Built-in real-time inference and near real-time inference with reusable containers

Enterprises like Woven Toyota, Lockheed Martin, Spotify, and Artera orchestrate millions of dollars of compute annually with Flyte and Union, accelerating experimentation by 25 times faster and cutting iteration cycles by 96%.
Both options (open source Flyte and Union.ai 2.0) integrate with the open source community, facilitating rapid feature rollout and continuous improvement.
Solution overview
Although open source Flyte provides powerful orchestration capabilities, Union.ai 2.0 delivers the same core technology with enterprise-grade management, removing the operational overhead so your team can focus on building AI applications instead of managing infrastructure. This is achieved through a hybrid architecture that combines managed simplicity with complete data control. The Regional control plane handles workflow metadata and coordination, while the Union Operator deploys directly into your EKS clusters—keeping your data, code, and secrets entirely within your AWS perimeter.
The following figure illustrates the operational flow between Union’s control plane and your data plane. The Union-managed control plane (left) orchestrates workflows through Elastic Load Balancing (ELB), storing task data in Amazon S3 and execution metadata in Aurora. Within your Amazon EKS environment (right), the data plane executes workflows that pull customer code from your container registry, access secrets from AWS Secrets Manager, and read/write data to your S3 buckets—with the execution logs flowing to both CloudWatch and the Union control plane for observability.

Union.ai 2.0’s AWS integration architecture is built on six key service components that provide end-to-end workflow management:

Control plane and data plane – The control plane operates within the Union.ai AWS account and serves as the central management interface, providing users with authentication and authorization capabilities, observation and monitoring functions, and system management tools. It also orchestrates execution placement on data plane clusters and handles cluster control and management operations. Union.ai 2.0 maintains one control plane per AWS Region, managing the Regional data planes. Available Regions for data plane deployment include us-west, us-east, eu-west, and eu-central, with ongoing expansion to additional Regions.
Data plane object store – This component stores data comprising files, directories, data frames, models, and Python-pickled types, which are passed as references and read by the control plane.
Container registry – This component contains registry data that include names of workflows, tasks, launch plans, and artifacts; input and output types for workflows and tasks; execution status, start time, end time, and duration of workflows and tasks; version information for workflows, tasks, launch plans, and artifacts; and artifact definitions. With the Union.ai 2.0 architecture, you can retain full ownership of your data and compute resources while it manages the infrastructure operations. The Union.ai 2.0 operator resides in the data plane and handles management tasks with least privilege permissions. It enables cluster lifecycle operations and provides support engineers with system-level log access and change implementation capabilities—without exposing secrets or data. Security is further strengthened through unidirectional communication: the data plane operator initiates the connections to the control plane, not the reverse.
Logging and monitoring – CloudWatch provides centralized logging and monitoring through deep integration with Flyte. The system automatically builds logging links for each execution and displays them in the console, with links pointing directly to the AWS Management Console and the specific log stream for that execution—a feature that significantly accelerates troubleshooting during failures.
Security – Security is handled through IAM roles for service accounts (IRSA), which maps the identity between Kubernetes resources and the AWS services they depend on. These configurations enable more secure, fine-grained access control for backend services, and Union.ai 2.0 adds enterprise role-based access control (RBAC) for user access control on top of these AWS security features.
Storage layer – Amazon S3 serves as the durable storage layer for workflows and data. When you register a workflow with Flyte, your code is compiled into a language-independent representation that captures the workflow definition, input, and output types. This representation is packaged and stored in Amazon S3, where FlytePropeller—Flyte’s execution engine—retrieves it to instruct the respective compute framework (such as Kubernetes or Spark) to run workflows and report status. Raw input data used to train and validate models is also stored in Amazon S3. Union.ai 2.0 now includes a new integration with Amazon S3 Vectors, enabling vector storage for Retrieval Augmented Generation (RAG), semantic search, and agentic AI workflows.

With this robust infrastructure in place, Union.ai 2.0 on Amazon EKS excels at orchestrating a wide range of AI/ML workloads. It handles large-scale model training by orchestrating distributed training pipelines across GPU clusters with automatic resource provisioning and spot instance support. For data processing, it can process petabyte-scale datasets with dynamic parallelism and efficient task fanout, scaling to 100,000 task fanouts with 50,000 concurrent actions in Union.ai 2.0. By using Union.ai 2.0 and Flyte on Amazon EKS, you can build and deploy agentic AI systems—long-running, stateful AI agents that make autonomous decisions at runtime. For production deployments, it supports real-time inference with low-latency model serving, using reusable containers for sub-100 millisecond task startup times. Throughout the entire process, Union.ai 2.0 provides comprehensive MLOps and model lifecycle management, automating everything from experimentation to production deployment with built-in versioning and rollback capabilities.
These capabilities are exemplified in specialized implementations like distributed training on AWS Trainium instances, where Flyte orchestrates large-scale training workloads on Amazon EKS.
Deployment options for Union.ai 2.0 on Amazon EKS
Union.ai 2.0 and Flyte offer three flexible deployment models for Amazon EKS, each balancing managed convenience with operational control. Select the approach that best fits your team’s expertise, compliance requirements, and development velocity:

Union BYOC (fully managed) – The fastest path to production. Union.ai 2.0 manages the infrastructure, upgrades, and scaling while your workloads run in your AWS account. This option is ideal for teams that want to focus entirely on AI development rather than infrastructure operations.
Union Self Managed – You can deploy Union.ai 2.0’s managed control plane while maintaining control of your data and compute resources in your AWS account. This option combines the benefits of managed services with data sovereignty and governance requirements.
Flyte OSS on Amazon EKS – You can deploy and operate open source Flyte directly on your EKS cluster using the AWS Cloud Development Kit (AWS CDK). This option provides maximum control and is ideal for teams with strong Kubernetes expertise who want to customize their deployment. (edited) 

The Amazon EKS Blueprints for AWS CDK Union add-on helps AWS customers deploy, scale, and optimize AI/ML workloads using Union on Amazon EKS. It provides modular infrastructure as code (IaC) AWS CDK templates and curated deployment blueprints for running scalable AI workloads, including:

Model training and fine-tuning pipelines
Large language model (LLM) inference and serving
Multi-model deployment and management
Agentic AI pipeline orchestration

Union.ai 2.0 and Flyte provide IaC templates for deploying on Amazon EKS:

Terraform modules – Preconfigured modules for deploying Flyte on Amazon EKS with best practices for networking, security, and observability
AWS CDK support – AWS CDK constructs for integrating Union into existing AWS infrastructure
GitOps workflows – Support for Flux and ArgoCD for declarative infrastructure management

The Union add-on is available by blog publication, and the Flyte add-on is coming—keep watching the GitHub repo.
These templates automate the provisioning of EKS clusters, node groups (including GPU instances), IAM roles, S3 buckets, Aurora databases, and the required Flyte components.
Prerequisites
To start using this solution, you must have the following prerequisites:

An AWS account with appropriate permissions.
Amazon EKS version on standard support.
Required IAM roles. Using IAM roles for service accounts, Flyte can map identity between the Kubernetes resources and AWS services it depends on. These configurations are for the backend and do not interfere with user-control plane communication

How Union.ai 2.0 supports Amazon S3 Vectors
As AI applications increasingly rely on vector embeddings for semantic search and RAG, Union.ai 2.0 empowers teams with Amazon S3 Vectors integration, simplifying vector data management at scale. Built into Flyte 2.0, this feature is available today. Amazon S3 Vectors delivers purpose-built, cost-optimized vector storage for semantic search and AI applications. With Amazon S3 level elasticity and durability for storing vector datasets with subsecond query performance, Amazon S3 Vectors is ideal for applications that need to build and grow vector indexes at scale. Union.ai 2.0 provides support for Amazon S3 Vectors for RAG, semantic search, and multi-agent systems. If you’re using Union.ai 2.0 today with Amazon S3 as your object store, you can start using Amazon S3 Vectors immediately with minimal configuration changes.
To set it up, use Boto’s dedicated APIs to store and query vectors. Your Amazon S3 IAM roles are already in place. Just update the permissions.

By combining Flyte 2.0’s orchestration with Amazon S3 Vector support, multi-agent trading simulations can scale to hundreds of agents that learn from historical data, share industry insights, and execute coordinated strategies in real time. These architectural advantages support sophisticated AI applications like multi-agent systems that require both semantic memory and real-time coordination.
To learn more, refer to the example use case of a multi-agent trading simulation using Flyte 2.0 with Amazon S3 Vectors. In this example, you will learn to build a trading simulation featuring multiple agents that represent team members in a firm, illustrating their interactions, strategic planning, and collaborative trading activities
Consider a multi-agent trading simulation where AI agents interact, test strategies, and continuously learn from their experiences. For realistic agent behavior, each agent must retain context from previous interactions, essentially building a memory of semantic artifacts that inform future decisions. The process includes the following steps:

After each simulation round, embed the agent’s learnings into vector representations using embedding models.
Store embeddings in Amazon S3 using Amazon S3 Vectors with appropriate metadata and tags.
During subsequent executions, retrieve relevant memories using semantic search to ground agent decisions in past experience.

With Flyte 2.0, your agents already run in an orchestration-aware environment. Amazon S3 becomes your vector store. It’s inexpensive, fast, and fully integrated, alleviating the need for separate vector databases. For the steps and associated code to implement the multi-agent trading simulation, refer to the GitHub repo.
In summary, this architecture helps deliver measurable advantages for production AI systems:

Reduced operational complexity – Consolidate your AI/ML orchestration and vector storage on a single environment, alleviating the need to provision, maintain, and secure separate vector database infrastructure
Significant cost savings – Amazon S3 Vectors delivers significantly lower storage costs compared to purpose-built vector databases, while providing subsecond similarity search performance at scale
Zero-friction AWS integration – Use your existing Amazon S3 infrastructure, IRSA configuration, and virtual private cloud (VPC) networking—no additional authentication layers or network configurations are required
Battle-tested scalability – Build on the 99.999999999% durability and elastic scalability of Amazon S3 to support vector datasets from gigabytes to petabytes without re-architecture

Customer success: Woven by Toyota
Toyota’s autonomous driving arm, Woven by Toyota, faced challenges orchestrating complex AI workloads for their autonomous driving technology, requiring petabyte-scale data processing and GPU-intensive training pipelines. After outgrowing their open source Flyte implementation, they migrated to Union.ai’s managed service on AWS in 2023. The impact was transformative: over 20 times faster ML iteration cycles, millions of dollars in annual cost savings through spot instance optimization, and thousands of parallel workers enabling massive scale.

“Union.ai’s wealth of expertise has enabled us to focus our efforts on key ADAS-related functionalities, move fast, and rely on Union.ai to deliver data at scale,”
– Alborz Alavian, Senior Engineering Manager at Woven by Toyota.

Read the full case study about Woven by Toyota’s migration to Union.ai.
Conclusion
Union.ai and Flyte provide the foundation for reliable, scalable AI on Amazon EKS for your AI/ML workflows, such as building autonomous systems, training LLMs, or orchestrating complex data pipelines.To get started, choose your path:

Enterprise ready – Deploy Union.ai through AWS Marketplace (ISVA Partner)
Resource-Aware AI Orchestration – Trial v2
Open source – Try Flyte at flyte.org
Quick start – Deploy your first AI pipeline with the AI on Amazon EKS Blueprint

About the authors
ND Ngoka is Senior Solutions Architect at AWS with specialized focus on AI/ML and storage technologies. Guides customers through complex architectural decisions, enabling them to build resilient, scalable solutions that drive business outcomes.
Samhita Alla is a Senior Solutions Engineer for Partnerships at Union.ai, where she leads the technical execution of strategic integrations across the AI stack, from distributed training and experiment tracking to data platform integrations. She works closely with partners and cross-functional teams to evaluate feasibility, build production-ready solutions, and deliver technical content that drives real-world adoption.
Kristy Cook is Head of Partnerships at Union.ai, where she builds strategic alliances across the AI/ML ecosystem focused on sustained growth. Having forged impactful partnerships at Meta, Yahoo, and Neustar she brings deep expertise in operationalizing AI solutions at scale.
Jim Fratantoni is a GenAI Account Manager at AWS, focused on helping AI startups scale and co-sell with AWS. He is passionate about working with founders to jointly go to market and drive enterprise customer success.
Theo Rashid is an Applied Scientist at Amazon building probabilistic machine learning and forecasting models. He is an active open source contributor, and is passionate about open source tooling across the machine learning stack, from probabilistic programming libraries to workflow orchestration. He holds a PhD in Epidemiology and Biostatistics from Imperial College London.
Alex Fabisiak is a Senior Applied Scientist at Amazon working on applied forecasting and supply chain problems. He specializes in probabilistic and causal modeling as they relate to optimal policy decisions. He holds a PhD in Finance from UCLA.

Amazon Quick now supports key pair authentication to Snowflake data so …

Modern enterprises face significant challenges connecting business intelligence platforms to cloud data warehouses while maintaining automation. Password-based authentication introduces security vulnerabilities, operational friction, and compliance gaps—especially critical as Snowflake is deprecating username password.
Amazon Quick Sight (a capability of Amazon Quick Suite) now supports key pair authentication for Snowflake integrations, using asymmetric cryptography where RSA key pairs replace traditional passwords. This enhancement addresses a critical need as Snowflake moves toward deprecating password-based authentication, which requires more secure authentication methods. With this new capability, Amazon Quick Suite users can establish secure, passwordless connections to Snowflake data sources using RSA key pairs, providing a seamless and secure integration experience that meets enterprise security standards.
In this blog post, we will guide you through establishing data source connectivity between Amazon Quick Sight and Snowflake through secure key pair authentication.
Prerequisites
Before configuring key pair authentication between Amazon Quick and Snowflake, ensure that you have the following:

An active Amazon Quick account with appropriate permissions – You need administrative access to create and manage data sources, configure authentication settings, and grant permissions to users. Amazon Quick Enterprise license or Author role in Amazon Quick Enterprise Sight Edition typically provide sufficient access.
A Snowflake account with ACCOUNTADMIN, SECURITYADMIN, or USERADMIN role – These elevated permissions are essential for modifying user accounts, assigning public keys using ALTER USER commands, and granting warehouse and database permissions. If you don’t have access to these roles, contact your Snowflake administrator for assistance.
OpenSSL installed (for key generation) – This cryptographic toolkit generates RSA key pairs in PKCS#8 format. Most Linux and macOS systems include OpenSSL pre-installed. Windows users can use Windows Subsystem Linux (WSL) or download OpenSSL separately.
(Optional) AWS Secrets Manager access (for API-based setup) – Required for programmatic configurations, you will need IAM permissions to create and manage secrets, and Amazon Quick Sight API access for automated deployments and infrastructure as code (IaC) implementations.

Solution walkthrough
We will guide you through the following essential steps to establish secure key pair authentication between Amazon Quick Sight and Snowflake:

Generate RSA Key Pair – Create public and private keys using OpenSSL with proper encryption standards
Configure Snowflake User – Assign the public key to your Snowflake user account and verify the setup
Establish Data Source Connectivity – Create your connection through either the Amazon Quick UI for interactive setup or AWS Command Line Interface (AWS CLI) for programmatic deployment

Let’s explore each step in detail and secure your Amazon Quick Sight-Snowflake connection with key pair authentication!
Generate RSA key pair:

Navigate to AWS CloudShell in AWS Management Console and execute the following command to generate the RSA private key. You will be prompted to enter an encryption passphrase. Choose a strong passphrase and store it securely—you will need this later when generating the public key.

openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8

Run the following commands to create a public key pair. You will be prompted to enter the phrase that you used in the previous step.

openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub

Extract the private key content (including header and footer):

cat rsa_key.p8

This displays your private key in the format:
—–BEGIN PRIVATE KEY—–[key content]—–END PRIVATE KEY—–
Note: Copy the entire output including the —–BEGIN PRIVATE KEY—– and —–END PRIVATE KEY—– lines. You will use this complete private key (with headers and footers) when creating your Snowflake data source connection.

Snowflake requires the public key in a specific format without headers or line breaks. Run these commands to extract and format the key properly.

grep -v KEY rsa_key.pub | tr -d ‘n’ | awk ‘{print $1}’ > pub.Key
cat pub.Key

This will display your formatted public key string. Copy this output—you will use it in the next step to configure your Snowflake user account.
Assign public key to Snowflake user:

Log in to Snowflake and execute the following SQL commands to assign the public key to your user:

ALTER USER <username> SET RSA_PUBLIC_KEY='<public_key_content>’;

Verify the key assignment: Look for the RSA_PUBLIC_KEY property to confirm if the public key is set.

DESCRIBE USER <username>;

Establish your Snowflake connection in Amazon Quick UI:

Navigate to Amazon Quick in AWS Management Console and select Datasets. Then select the Data sources tab and choose Create data source.

In the Create data source pane, enter “snowflake” in Search datasets, select Snowflake, and then choose Next.

In the New Snowflake data source pane, enter the data source name, then enter the connection type as Public Network or a Private VPC Connection. If you need a VPC connection, see configure the VPC connection in Quick.
Then, enter the database server hostname, database name, and warehouse name.
Select Authentication Type as KeyPair and then enter the username of the Snowflake user.
In the Private Key field, paste the complete output from cat rsa_key.p8 (including the BEGIN and END headers). If you have configured a passphrase during key generation, provide it in the optional Passphrase field.
After all the fields are entered, select the Validate connection button.

After the connection is validated, select the Create data source button.
Then in the Data sources list, find the snowflake data source that you created.
From the Action menu, select the Create dataset option.

Establish your Snowflake Connection using the Amazon Quick Sight API:
Using AWS CLI, create the Amazon Quick data source connection to Snowflake by executing the following command:

aws quicksight create-data-source
–aws-account-id 123456789
–data-source-id awsclikeypairtest
–name “awsclikeypairtest”
–type SNOWFLAKE
–data-source-parameters ‘{
“SnowflakeParameters”: {
“Host”: “hostname.snowflakecomputing.com”,
“Database”: “DB_NAME”,
“Warehouse”: “WH_NAME”,
“AuthenticationType”: “KEYPAIR”
}
}’
–credentials ‘{
“KeyPairCredentials”: {
“KeyPairUsername”: “SNOWFLAKE_USERNAME”,
“PrivateKey”: “—–BEGIN ENCRYPTED PRIVATE KEY—–nPRIVATE_KEYn—–END ENCRYPTED PRIVATE KEY—–“,
“PrivateKeyPassphrase”: “******”
}
}’
–permissions ‘[
{
“Principal”: “arn:aws:quicksight:us-east-1: 123456789:user/default/Admin/username,
“Actions”: [
“quicksight:DescribeDataSource”,
“quicksight:DescribeDataSourcePermissions”,
“quicksight:PassDataSource”,
“quicksight:UpdateDataSource”,
“quicksight:DeleteDataSource”,
“quicksight:UpdateDataSourcePermissions”
]
}
]’
–region us-east-1

Use the following command to check the status of creation:

aws quicksight describe-data-source –region us-east-1 –aws-account-id 123456789 –data-source-id awsclikeypairtest

Initially, the status returned from the describe-data-source command will be CREATION_IN_PROGRESS. The status will change to CREATION_SUCCESSFUL if the new data source is ready for use.
Alternatively, when creating the data source programmatically via CreateDataSource, you can store the username, key and passphrase in AWS Secrets Manager and reference them using the Secret ARN.
After the data source is successfully created, you can navigate to the Quick console. In the Create a Dataset page, you can view the newly created data source connection awsclikeypairtest under the data sources list. You can then continue to create the datasets.
Cleanup
To clean up your resources to avoid incurring additional charges, follow these steps:

Delete the secret created in the AWS Secrets Manager Console.
Delete the data source connection created in Amazon Quick.

Conclusion
Key pair authentication represents a transformative advancement in securing data connectivity between Amazon Quick and Snowflake. By removing password-based vulnerabilities and embracing cryptographic authentication, organizations can achieve superior security posture while maintaining seamless automated workflows. This implementation addresses critical enterprise requirements, such as enhanced security through asymmetric encryption, streamlined service account management, and compliance with evolving authentication standards as Snowflake transitions away from traditional password methods.
Whether deploying through the intuitive Amazon Quick UI or using AWS CLI for Infrastructure as Code implementations, key pair authentication provides flexibility without compromising security. The integration with AWS Secrets Manager helps protect the private keys, while the straightforward setup process enables rapid deployment across development, staging, and production environments.
As data security continues to evolve, adopting key pair authentication positions your organization at the forefront of best practices. Business intelligence teams can now focus on extracting actionable insights from Snowflake data rather than managing authentication complexities, ultimately accelerating time-to-insight and improving operational efficiency.
For further reading, see Snowflake Key-Pair Authentication.

About the authors

Vignessh Baskaran
Vignessh Baskaran is a Sr. Technical Product Manager in the structured DATA domain in Amazon Quick powering BI and GenAI initiatives. He has 9+ years of experience in developing large-scale data and analytics solutions. Prior to this role, he worked as a Sr. Analytics Lead in AWS building comprehensive BI solutions using Quick which were globally adopted across AWS Worldwide Specialist Sales teams. Outside of work, he enjoys watching Cricket, playing Racquetball and exploring different cuisines in Seattle.

Chinnakanu Sai Janakiram
Chinnakanu Sai Janakiram is a Software Development Engineer in Amazon Quick, working on cloud infrastructure automation and feature development using AWS technologies. He has 2+ years of experience building scalable systems across AWS, CI/CD pipelines, CloudFormation, React, and Spring Boot. Prior to this role, he contributed to data and analytics solutions on AWS, improving deployment reliability and scalability across regions. Outside of work, he enjoys following Formula 1 and staying up to date with emerging technologies.

Nithyashree Alwarsamy
Nithyashree Alwarsamy is a Partner Solutions Architect at Amazon Web Services, specializing in data and analytics solutions with a focus on streaming and event-driven architecture. Leveraging deep expertise in modern data architectures, Nithyashree helps organizations unlock the full potential of their data by integrating Snowflake’s cloud-native data platform with the breadth of AWS services.

Andries Engelbrecht
Andries Engelbrecht is a Principal Partner Solutions Engineer at Snowflake working with AWS. He supports product and service integrations, as well as the development of joint solutions with AWS. Andries has over 25 years of experience in the field of data and analytics.