Beyond Simple API Requests: How OpenAI’s WebSocket Mode Changes the …

In the world of Generative AI, latency is the ultimate killer of immersion. Until recently, building a voice-enabled AI agent felt like assembling a Rube Goldberg machine: you’d pipe audio to a Speech-to-Text (STT) model, send the transcript to a Large Language Model (LLM), and finally shuttle text to a Text-to-Speech (TTS) engine. Each hop added hundreds of milliseconds of lag.

OpenAI has collapsed this stack with the Realtime API. By offering a dedicated WebSocket mode, the platform provides a direct, persistent pipe into GPT-4o’s native multimodal capabilities. This represents a fundamental shift from stateless request-response cycles to stateful, event-driven streaming.

The Protocol Shift: Why WebSockets?

The industry has long relied on standard HTTP POST requests. While streaming text via Server-Sent Events (SSE) made LLMs feel faster, it remained a one-way street once initiated. The Realtime API utilizes the WebSocket protocol (wss://), providing a full-duplex communication channel.

For a developer building a voice assistant, this means the model can ‘listen’ and ‘talk’ simultaneously over a single connection. To connect, clients point to:

wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview

The Core Architecture: Sessions, Responses, and Items

Understanding the Realtime API requires mastering three specific entities:

The Session: The global configuration. Through a session.update event, engineers define the system prompt, voice (e.g., alloy, ash, coral), and audio formats.

The Item: Every conversation element—a user’s speech, a model’s output, or a tool call—is an item stored in the server-side conversation state.

The Response: A command to act. Sending a response.create event tells the server to examine the conversation state and generate an answer.

Audio Engineering: PCM16 and G.711

OpenAI’s WebSocket mode operates on raw audio frames encoded in Base64. It supports two primary formats:

PCM16: 16-bit Pulse Code Modulation at 24kHz (ideal for high-fidelity apps).

G.711: The 8kHz telephony standard (u-law and a-law), perfect for VoIP and SIP integrations.

Devs must stream audio in small chunks (typically 20-100ms) via input_audio_buffer.append events. The model then streams back response.output_audio.delta events for immediate playback.

VAD: From Silence to Semantics

A major update is the expansion of Voice Activity Detection (VAD). While standard server_vad uses silence thresholds, the new semantic_vad uses a classifier to understand if a user is truly finished or just pausing for thought. This prevents the AI from awkwardly interrupting a user who is mid-sentence, a common ‘uncanny valley’ issue in earlier voice AI.

The Event-Driven Workflow

Working with WebSockets is inherently asynchronous. Instead of waiting for a single response, you listen for a cascade of server events:

input_audio_buffer.speech_started: The model hears the user.

response.output_audio.delta: Audio snippets are ready to play.

response.output_audio_transcript.delta: Text transcripts arrive in real-time.

conversation.item.truncate: Used when a user interrupts, allowing the client to tell the server exactly where to “cut” the model’s memory to match what the user actually heard.

Key Takeaways

Full-Duplex, State-Based Communication: Unlike traditional stateless REST APIs, the WebSocket protocol (wss://) enables a persistent, bidirectional connection. This allows the model to ‘listen’ and ‘speak’ simultaneously while maintaining a live Session state, eliminating the need to resend the entire conversation history with every turn.

Native Multimodal Processing: The API bypasses the STT → LLM → TTS pipeline. By processing audio natively, GPT-4o reduces latency and can perceive and generate nuanced paralinguistic features like tone, emotion, and inflection that are typically lost in text transcription.

Granular Event Control: The architecture relies on specific server-sent events for real-time interaction. Key events include input_audio_buffer.append for streaming chunks to the model and response.output_audio.delta for receiving audio snippets, allowing for immediate, low-latency playback.

Advanced Voice Activity Detection (VAD): The transition from simple silence-based server_vad to semantic_vad allows the model to distinguish between a user pausing for thought and a user finishing their sentence. This prevents awkward interruptions and creates a more natural conversational flow.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Beyond Simple API Requests: How OpenAI’s WebSocket Mode Changes the Game for Low Latency Voice Powered AI Experiences appeared first on MarkTechPost.

How to Build a Production-Grade Customer Support Automation Pipeline w …

In this tutorial, we build an advanced Griptape-based customer support automation system that combines deterministic tooling with agentic reasoning to process real-world support tickets end-to-end. We design custom tools to sanitize sensitive information, categorize issues, assign priorities with clear SLA targets, and generate structured escalation payloads, all before involving the language model. We then use a Griptape Agent to synthesize these tool outputs into professional customer replies and internal support notes, demonstrating how Griptape enables controlled, auditable, and production-ready AI workflows without relying on retrieval or external knowledge bases.

Copy CodeCopiedUse a different Browser!pip -q install “griptape[all]” rich schema pandas

import os, re, json
from getpass import getpass

try:
from google.colab import userdata
os.environ[“OPENAI_API_KEY”] = userdata.get(“OPENAI_API_KEY”)
except Exception:
pass

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY: “)

We set up the execution environment by installing all required Griptape dependencies and supporting libraries. We securely load the OpenAI API key using Colab secrets or a runtime prompt to keep credentials out of the code. We ensure the notebook is ready for agent execution before any logic is defined.

Copy CodeCopiedUse a different Browsertool_code = r”’
import re, json
from schema import Schema, Literal, Optional
from griptape.tools import BaseTool
from griptape.utils.decorators import activity
from griptape.artifacts import TextArtifact, ErrorArtifact

def _redact(text: str) -> str:
text = re.sub(r”[\w\.-]+@[\w\.-]+\.\w+”, “[REDACTED_EMAIL]”, text)
text = re.sub(r”\+?\d[\d\-\s\(\)]{7,}\d”, “[REDACTED_PHONE]”, text)
text = re.sub(r”\b(\d{4}[\s-]?){3}\d{4}\b”, “[REDACTED_CARD]”, text)
return text

class TicketOpsTool(BaseTool):
@activity(config={“description”: “Redact PII”, “schema”: Schema({Literal(“text”): str})})
def redact_pii(self, params: dict):
try:
return TextArtifact(_redact(params[“values”][“text”]))
except Exception as e:
return ErrorArtifact(str(e))

@activity(config={“description”: “Categorize ticket”, “schema”: Schema({Literal(“text”): str})})
def categorize(self, params: dict):
try:
t = params[“values”][“text”].lower()
if any(k in t for k in [“charged”, “refund”, “invoice”, “billing”, “payment”]):
cat = “billing”
elif any(k in t for k in [“crash”, “error”, “bug”, “export”, “0x”]):
cat = “bug”
elif any(k in t for k in [“locked”, “password”, “login attempts”, “unauthorized”, “security”]):
cat = “security”
elif any(k in t for k in [“account”, “profile”, “access”]):
cat = “account”
else:
cat = “other”
return TextArtifact(cat)
except Exception as e:
return ErrorArtifact(str(e))

@activity(config={“description”: “Priority and SLA”, “schema”: Schema({Literal(“category”): str, Literal(“text”): str, Optional(Literal(“channel”), default=”web”): str})})
def priority_and_sla(self, params: dict):
try:
cat = params[“values”][“category”].lower()
t = params[“values”][“text”].lower()
channel = params[“values”].get(“channel”, “web”)
if cat == “security” or “urgent” in t or “asap” in t:
p, sla = 1, “15 minutes”
elif cat in [“billing”, “account”]:
p, sla = 2, “2 hours”
elif cat == “bug”:
p, sla = 3, “1 business day”
else:
p, sla = 4, “3 business days”
if channel == “chat” and p > 1:
p = max(2, p – 1)
return TextArtifact(json.dumps({“priority”: p, “sla_target”: sla}))
except Exception as e:
return ErrorArtifact(str(e))

@activity(config={“description”: “Escalation payload”, “schema”: Schema({Literal(“ticket_id”): str, Literal(“customer”): str, Literal(“category”): str, Literal(“priority”): int, Literal(“sanitized_text”): str})})
def build_escalation_json(self, params: dict):
try:
v = params[“values”]
payload = {
“summary”: f”[{v[‘category’].upper()}][P{v[‘priority’]}] Ticket {v[‘ticket_id’]} – {v[‘customer’]}”,
“labels”: [v[“category”], f”p{v[‘priority’]}”],
“description”: v[“sanitized_text”],
“customer”: v[“customer”],
“source_ticket”: v[“ticket_id”]
}
return TextArtifact(json.dumps(payload, indent=2))
except Exception as e:
return ErrorArtifact(str(e))
”’
with open(“/content/ticket_tools.py”, “w”, encoding=”utf-8″) as f:
f.write(tool_code)

import importlib, sys
sys.path.append(“/content”)
ticket_tools = importlib.import_module(“ticket_tools”)
TicketOpsTool = ticket_tools.TicketOpsTool
tool = TicketOpsTool()

We implement the core operational logic by defining a custom Griptape tool inside a standalone Python module. We encode deterministic rules for PII redaction, ticket categorization, priority scoring, SLA assignment, and the generation of escalation payloads. We then import and instantiate this tool so it can be safely inspected and used by Griptape.

Copy CodeCopiedUse a different BrowserTICKETS = [
{“ticket_id”: “TCK-1001”, “customer”: “Leila”, “text”: “I was charged twice on my card ending 4432. Please refund ASAP. email: leila@test.com”, “channel”: “email”, “created_at”: “2026-02-01T10:14:00Z”},
{“ticket_id”: “TCK-1002”, “customer”: “Rohan”, “text”: “App crashes every time I try to export. Screenshot shows error code 0x7f. My phone: +1 514-555-0188”, “channel”: “chat”, “created_at”: “2026-02-01T10:20:00Z”},
{“ticket_id”: “TCK-1003”, “customer”: “Mina”, “text”: “Need invoice for January. Also update billing address to 21 King St, Montreal.”, “channel”: “email”, “created_at”: “2026-02-01T10:33:00Z”},
{“ticket_id”: “TCK-1004”, “customer”: “Sam”, “text”: “My account got locked after password reset. I’m seeing login attempts I don’t recognize. Please help urgently.”, “channel”: “web”, “created_at”: “2026-02-01T10:45:00Z”}
]

We create a realistic stream of customer support tickets that acts as our input workload. We structure each ticket with metadata such as channel, timestamp, and free-form text to reflect real operational data. We use this dataset to consistently test and demonstrate the full pipeline.

Copy CodeCopiedUse a different Browserfrom griptape.structures import Agent
from griptape.drivers.prompt.openai import OpenAiChatPromptDriver

prompt_driver = OpenAiChatPromptDriver(model=”gpt-4.1″)
agent = Agent(prompt_driver=prompt_driver, tools=[tool])

def run_ticket(ticket: dict) -> dict:
sanitized = tool.redact_pii({“values”: {“text”: ticket[“text”]}}).to_text()
category = tool.categorize({“values”: {“text”: sanitized}}).to_text().strip()
pr_sla = json.loads(tool.priority_and_sla({“values”: {“category”: category, “text”: sanitized, “channel”: ticket[“channel”]}}).to_text())
escalation = tool.build_escalation_json({“values”: {“ticket_id”: ticket[“ticket_id”], “customer”: ticket[“customer”], “category”: category, “priority”: int(pr_sla[“priority”]), “sanitized_text”: sanitized}}).to_text()
prompt = f”””
You are a senior support lead. Produce:
1) A customer-facing reply
2) Internal notes
3) Escalation decision

Ticket:
– id: {ticket[‘ticket_id’]}
– customer: {ticket[‘customer’]}
– channel: {ticket[‘channel’]}
– category: {category}
– priority: {pr_sla[‘priority’]}
– SLA target: {pr_sla[‘sla_target’]}
– sanitized_text: {sanitized}

Output in Markdown.
“””
out = agent.run(prompt).to_text()
return {“ticket_id”: ticket[“ticket_id”], “category”: category, “priority”: pr_sla[“priority”], “sla_target”: pr_sla[“sla_target”], “escalation_payload_json”: escalation, “agent_output_markdown”: out}

We initialize a Griptape Agent with the custom tool and a prompt driver to enable controlled reasoning. We define a deterministic processing function that chains tool calls before invoking the agent, ensuring all sensitive handling and classification are completed first. We then ask the agent to generate customer responses and internal notes based solely on tool outputs.

Copy CodeCopiedUse a different Browserresults = [run_ticket(t) for t in TICKETS]

for r in results:
print(“n” + “=” * 88)
print(f”{r[‘ticket_id’]} | category={r[‘category’]} | P{r[‘priority’]} | SLA={r[‘sla_target’]}”)
print(r[“escalation_payload_json”])
print(r[“agent_output_markdown”])

We execute the pipeline across all tickets and collect the structured results. We print escalation payloads and agent-generated Markdown outputs to verify correctness and clarity. We use this final step to validate that the workflow runs end-to-end without hidden dependencies or retrieval logic.

In conclusion, we demonstrated how Griptape can be used to orchestrate complex operational workflows in which logic, policy, and AI reasoning coexist cleanly. We relied on deterministic tools for classification, risk handling, and escalation, using the agent only where natural-language judgment is required to keep the system reliable and explainable. This pattern illustrates how we can scale AI-assisted operations safely, integrate them into existing support systems, and maintain strict control over behavior, outputs, and service guarantees using Griptape’s core abstractions.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Production-Grade Customer Support Automation Pipeline with Griptape Using Deterministic Tools and Agentic Reasoning appeared first on MarkTechPost.

Taalas is replacing programmable GPUs with hardwired AI chips to achie …

In the high-stakes world of AI infrastructure, the industry has operated under a singular assumption: flexibility is king. We build general-purpose GPUs because AI models change every week, and we need programmable silicon that can adapt to the next research breakthrough.

But Taalas, the Toronto-based startup thinks that flexibility is exactly what’s holding AI back. According to Taalas team, if we want AI to be as common and cheap as plastic, we have to stop ‘simulating’ intelligence on general-purpose computers and start ‘casting’ it directly into silicon.

The Problem: The ‘Memory Wall’ and the GPU Tax

The current cost of running a Large Language Model (LLM) is driven by a physical bottleneck: the Memory Wall.

Traditional processors (GPUs) are ‘Instruction Set Architecture’ (ISA) based. They separate compute and memory. When you run an inference pass on a model like Llama-3, the chip spends the vast majority of its time and energy shuttling weights from High Bandwidth Memory (HBM) to the processing cores. This ‘data movement tax’ accounts for nearly 90% of the power consumption in modern AI data centers.

Taalas’s solution is radical: eliminate the memory-fetch cycle. By using a proprietary automated design flow, Taalas translates the computational graph of a specific model directly into the physical layout of a chip. In their HC1 (Hardcore 1) chip, the model’s weights and architecture are literally etched into the wiring of the silicon.

The path to ubiquitous AI

Hardcore Models: 17,000 Tokens Per Second

The results of this ‘direct-to-silicon’ approach redefine the performance ceiling for inference. At their latest unveiling, Taalas demonstrated the HC1 running a Llama 3.1 8B model. While a top-tier NVIDIA H100 might serve a single user at ~150 tokens per second, the HC1 serves a staggering 16,000 to 17,000 tokens per second.

This changes the ‘unit economics’ of AI:

Performance: A single HC1 chip can outperform a small GPU data center in terms of raw throughput for a specific model.

Efficiency: Taalas claims a 1000x improvement in efficiency (performance-per-watt and performance-per-dollar) compared to conventional chips.

Infrastructure: Because the weights are hardwired, there is no need for external HBM or complex liquid cooling systems. A standard air-cooled rack can house ten of these 250W cards, delivering the power of an entire GPU cluster in a single server box.

Breaking the 60-Day Barrier: The Automated Foundry

The obvious ‘catch’ for an AI developer is flexibility. If you hardwire a model into a chip today, what happens when a better model comes out tomorrow? Historically, designing an ASIC (Application-Specific Integrated Circuit) took two years and tens of millions of dollars.

Taalas has solved this through automation. They have built a compiler-like foundry system that takes model weights and generates a chip design in roughly a week. By focusing on a streamlined manufacturing workflow—where they only change the top metal masks of the silicon—they have collapsed the turnaround time from ‘weights-to-silicon’ to just two months.

This allows for a ‘seasonal’ hardware cycle. A company could fine-tune a frontier model in the spring and have thousands of specialized, hyper-efficient inference chips deployed by summer.

The path to ubiquitous AI

The Market Shift: From Shovels to Stamps

This transition marks a pivotal moment in the AI hype cycle. We are moving from the ‘Research & Training’ phase—where GPUs are essential for their flexibility—to the ‘Deployment & Inference’ phase, where cost-per-token is the only metric that matters.

If Taalas succeeds, the AI market will split into two distinct tiers:

General-Purpose Training: Led by NVIDIA and AMD, providing the massive, flexible clusters needed to discover and train new architectures.

Specialized Inference: Led by ‘foundries’ like Taalas, which take those proven architectures and ‘print’ them into cheap, ubiquitous silicon for everything from smartphones to industrial sensors.

Key Takeaways

The ‘Hardwired’ Paradigm Shift: Taalas is moving from software-defined AI (running models on general-purpose GPUs) to hardware-defined AI. By ‘baking’ a specific model’s weights and architecture directly into the silicon, they eliminate the need for traditional instruction-set overhead, effectively making the model the processor itself.

Death of the Memory Wall: Traditional AI hardware wastes ~90% of its energy moving data between memory and compute. Taalas’s HC1 (Hardcore 1) chip eliminates the “Memory Wall” by physically wiring the model parameters into the chip’s metal layers, removing the need for expensive High Bandwidth Memory (HBM).

1000x Efficiency Leap: By stripping away the ‘programmability tax’, Taalas claims a 1,000x improvement in performance-per-watt and performance-per-dollar. In practice, this means an HC1 can hit 17,000 tokens per second on a Llama 3.1 8B model—massively outperforming a standard GPU rack while using far less power.

Automated ‘Direct-to-Silicon’ Foundry: To solve the problem of model obsolescence, Taalas uses a proprietary automated design flow. This reduces the time to create a custom AI chip from years to just weeks, allowing companies to ‘print’ their fine-tuned models into silicon on a seasonal basis.

The Commodity AI Future: This technology signals a shift from ‘Cloud-First’ to ‘Device-Native’ AI. As inference becomes a cheap, hardwired commodity, AI will move off centralized servers and into local, low-power hardware—ranging from smartphones to industrial sensors—with zero latency and no subscription costs.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Taalas is replacing programmable GPUs with hardwired AI chips to achieve 17,000 tokens per second for ubiquitous inference appeared first on MarkTechPost.

Scaling data annotation using vision-language models to power physical …

Critical labor shortages are constraining growth across manufacturing, logistics, construction, and agriculture. The problem is particularly acute in construction: nearly 500,000 positions remain unfilled in the United States, with 40% of the current workforce approaching retirement within the decade. These workforce limitations result in delayed projects, escalating costs, and deferred development plans. To address these constraints, organizations are developing autonomous systems that can perform tasks that fill capacity gaps, extend operational capabilities, and offer the added benefit of around-the-clock productivity.
Building autonomous systems requires large, annotated datasets to train AI models. Effective training determines whether these systems deliver business value. The bottleneck: the high cost of data preparation. Critically, the act of labeling video data—identifying information about equipment, tasks, and the environment—is required to make sure that the data is useful for model training. This step can impede model deployment, which slows down the delivery of AI-powered products and services to customers. For construction companies managing millions of hours of video, manual data preparation and annotation become impractical. Vision-language models (VLMs) help to address this by interpreting images and video, responding to natural language queries, and generating descriptions at a speed and scale that manual processes cannot match, providing a cost-effective alternative.
In this post, we examine how Bedrock Robotics tackles this challenge. By joining the AWS Physical AI Fellowship, the startup partnered with the AWS Generative AI Innovation Center to apply vision-language models that analyze construction video footage, extract operational details, and generate labeled training datasets at scale, to improve data preparation for autonomous construction equipment.
Bedrock Robotics: a case study in accelerating autonomous construction
Since 2024, Bedrock Robotics has been developing autonomous systems for construction equipment. The company’s product, Bedrock Operator, is a retrofit solution that combines hardware with AI models to enable excavators and other machinery to operate with minimal human intervention. These systems can perform tasks like digging, grading, and material handling with centimeter-level precision. Training these models requires massive volumes of video footage capturing equipment, tasks, and the surrounding environment – a highly resource-intensive process that limits scalability.
VLMs offer a solution by analyzing this image and video data and generating text descriptions. This makes them well-suited for annotation tasks, which is critical for teaching models how to associate visual patterns with human language. Bedrock Robotics used this technology to streamline data preparation for training AI models, enabling autonomous operations for equipment. Additionally, through proper model selection and prompt engineering, the company improved tool identification from 34% to 70%. This transformed a manual, time-intensive process into an automated, scalable data pipeline solution. The breakthrough accelerated deployment of autonomous equipment.
This approach provides a replicable framework for organizations facing similar data challenges and demonstrates how strategic investment in foundation models (FMs) can deliver measurable operational outcomes and a competitive advantage. Foundation models are models trained on massive amounts of data using self-supervised learning techniques that learn general representations that can be adapted to many downstream tasks. VLMs leverage these large-scale pretraining techniques to bridge visual and textual modalities, enabling them to understand, analyze, and generate content across both image and language.
In the following sections, we look at the process that Bedrock Robotics used to annotate millions of hours of video footage and accelerate innovation using a VLM-based solution.
From unstructured video data to a strategic asset using VLMs
Enabling autonomous construction equipment requires extracting useful information from millions of hours of unstructured operational footage. Specifically, Bedrock Robotics needed to identify tool attachments, tasks, and worksite conditions across diverse scenarios. The following images are example video frames from this dataset.

Construction equipment operates with multiple tool attachments, each requiring accurate classification to train reliable AI models. Working with the Innovation Center, Bedrock Robotics focused their innovation efforts by addressing a few critical tool categories: lifting hooks for material handling, hammers for concrete demolition, grading beams for surface leveling, and trenching buckets for narrow excavation.
These labels allow Bedrock Robotics to select relevant video segments and assemble training datasets that represent a variety of equipment configurations and operating conditions.
Accelerating AI deployment through strategic model optimization
Off-the-shelf VLMs (VLMs without prompt optimization) struggle with construction video data because they’re trained on web images, not operator footage from excavator cabins. They can’t handle unusual angles, equipment-specific visuals, or poor visibility from dust and weather. They also lack the domain knowledge to distinguish visually similar tools like digging buckets from trenching buckets.
Bedrock Robotics and the Innovation Center addressed this through targeted model selection and prompt optimization. The teams evaluated multiple VLMs—including open source options and FMs available in Amazon Bedrock—then refined prompts with detailed visual descriptions of each tool, guidance for commonly confused tool pairs, and step-by-step instructions for analyzing video frames.
These modifications enhanced the classification accuracy from 34% to 70% on a test set comprising 130 videos, at $10 per hour of video processing. These results demonstrate how prompt engineering adapts VLMs to specialized tasks. For Bedrock Robotics, this customization delivered faster training cycles, reduced time-to-deployment, and a cost-effective scalable annotation pipeline that evolves with operational needs.
The path forward: addressing labor shortages through automation
The Competitive Advantage. For Bedrock Robotics, vision-language systems enabled rapid identification and extraction of critical datasets, providing necessary insights from massive construction video footage. With an overall accuracy of 70%, this cost-effective approach provides a practical foundation for scaling data preparation for model training. It demonstrates how strategic AI innovation can transform workforce constraints and accelerate industry transformations. Organizations that streamline data preparation can accelerate autonomous system deployment, reduce operational costs, and explore new areas for growth in industries impacted by labor shortages. With this repeatable framework, manufacturing and industrial automation leaders facing similar challenges can apply these principles to drive competitive differentiation within their own domains.
To learn more, visit Bedrock Robotics or explore the physical AI resources on AWS.

AWS Physical AI Fellowship
Transforming the Physical World with AI
Physical AI in Practice

About the authors

Laura Kulowski
Laura Kulowski is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where she works to develop physical AI solutions. Before joining Amazon, Laura completed her PhD at Harvard’s Department of Earth and Planetary Sciences and investigated Jupiter’s deep zonal flows and magnetic field using Juno data.

Alla Simoneau
Alla Simoneau is a technology and commercial leader with over 15 years of experience, currently serving as the Emerging Technology Physical AI Lead at Amazon Web Services (AWS), where she drives global innovation at the intersection of AI and real-world applications. With over a decade at Amazon, Alla is a recognized leader in strategy, team building, and operational excellence, specializing in turning cutting-edge technologies into real-world transformations for startups and enterprise customers.

Parmida Atighehchian
Parmida Atighehchian is a Senior Data Scientist at AWS Generative AI Innovation Center. With over 10 years of experience in Deep Learning and Generative AI, Parmida brings deep expertise in AI and customer focused solutions. Parmida has led and co-authored highly impactful scientific papers focused on domains such as computer vision, explainability, video and image generation. With a strong focus on scientific practices, Parmida helps customers with practical design of systems using generative AI in robust and scalable pipelines.

Dan Volk
Dan Volk is a Senior Data Scientist at the AWS Generative AI Innovation Center. He has 10 years of experience in machine learning, deep learning, and time series analysis, and holds a Master’s in Data Science from UC Berkeley. He is passionate about transforming complex business challenges into opportunities by leveraging cutting-edge AI technologies.

Paul Amadeo
Paul Amadeo is a seasoned technology leader with over 30 years of experience spanning artificial intelligence, machine learning, IoT systems, RF design, optics, semiconductor physics, and advanced engineering. As Technical Lead for Physical AI in the AWS Generative AI Innovation Center, Paul specializes in translating AI capabilities into tangible physical systems, guiding enterprise customers through complex implementations from concept to production. His diverse background includes architecting computer vision systems for edge environments, designing robotic smart card manufacturing technologies that have produced billions of devices globally, and leading cross-functional teams in both commercial and defense sectors. Paul holds an MS in Applied Physics from the University of California, San Diego, a BS in Applied Physics from Caltech, and holds six patents spanning optical systems, communication devices, and manufacturing technologies.

Sri Elaprolu
Sri Elaprolu is Director of the AWS Generative AI Innovation Center, where he leads a global team implementing cutting-edge AI solutions for enterprise and government organizations. During his 13-year tenure at AWS, he has led ML science teams partnering with global enterprises and public sector organizations. Prior to AWS, he spent 14 years at Northrop Grumman in product development and software engineering leadership roles. Sri holds a Master’s in Engineering Science and an MBA.

How Sonrai uses Amazon SageMaker AI to accelerate precision medicine t …

In precision medicine, researchers developing diagnostic tests for early disease detection face a critical challenge: datasets containing thousands of potential biomarkers but only hundreds of patient samples. This curse of dimensionality can determine the success or failure of breakthrough discoveries.
Modern bioinformatics use multiple omic modalities—genomics, lipidomics, proteomics, and metabolomics—to develop early disease detection tests. Researchers in this industry are also often challenged with datasets where features outnumber samples by orders of magnitude. As new modalities are considered, the permutations increase exponentially, making experiment tracking a significant challenge. Additionally, source control and code quality are a mission-critical aspect of the overall machine learning architecture. Without efficient machine learning operations (MLOps) processes in place, this can be overlooked, especially in the early discovery stage of the cycle.
In this post, we explore how Sonrai, a life sciences AI company, partnered with AWS to build a robust MLOps framework using Amazon SageMaker AI that addresses these challenges while maintaining the traceability and reproducibility required in regulated environments.
Overview of MLOps
MLOps combines ML, DevOps, and data engineering practices to deploy and maintain ML systems in production reliably and efficiently.
Implementing MLOps best practices from the start enables faster experiment iterations for and confident, traceable model deployment, all of which are essential in healthcare technology companies where governance and validation are paramount.
Sonrai’s data challenge
Sonrai partnered with a large biotechnology company developing biomarker tests for an underserved cancer type. The project involved a rich dataset spanning multiple omic modalities: proteomics, metabolomics, and lipidomics, with the objective to identify the optimal combination of features for an early detection biomarker with high sensitivity and specificity.The customer faced several critical challenges. Their dataset contained over 8,000 potential biomarkers across three modalities, but only a few hundred patient samples. This extreme feature-to-sample ratio required sophisticated feature selection to avoid overfitting. The team needed to evaluate hundreds of combinations of modalities and modeling approaches, making manual experiment tracking infeasible. As a diagnostic test destined for clinical use, complete traceability from raw data through every modeling decision to the final deployed model was essential for regulatory submissions.
Solution overview
To address these MLOps challenges, Sonrai architected a comprehensive solution using SageMaker AI, a fully managed service for data scientists and developers to build, train, and deploy ML models at scale. This solution helps provide more secure data management, flexible development environments, robust experiment tracking, and streamlined model deployment with full traceability.The following diagram illustrates the architecture and process flow.

The end-to-end MLOps workflow follows a clear path:

Customers provide sample data to the secure data repository in Amazon Simple Storage Service (Amazon S3).
ML engineers use Amazon SageMaker Studio Lab and Code Editor, connected to source control.
Pipelines read from the data repository, process data, and write results to Amazon S3.
The experiments are logged in MLflow within Amazon SageMaker Studio.
Generated reports are stored in Amazon S3 and shared with stakeholders.
Validated models are promoted to the Amazon SageMaker Model Registry.
Final models are deployed for inference or further validation.

This architecture facilitates complete traceability: each registered model can be traced back through hyperparameter selection and dataset splits to the source data and code version that produced it.
Secure data management with Amazon S3
The foundation of Sonrai’s solution is secure data management with the help of Amazon S3. Sonrai configured S3 buckets with tiered access controls for sensitive patient data. Sample and clinical data were stored in a dedicated data repository bucket with restricted access, facilitating governance with data protection requirements. A separate results repository bucket stores processed data, model outputs, and generated reports. This separation makes sure raw patient data can remain secure while enabling flexible sharing of analysis results. Seamless integration with Git repositories enables collaboration, source control, and quality assurance processes while keeping sensitive patient data secure within the AWS environment—critical for maintaining governance in regulated industries.
SageMaker AI MLOps
From project inception, Sonrai used both JupyterLab and Code Editor interfaces within their SageMaker AI environment. This environment was integrated with the customer’s Git repository for source control, establishing version control and code review workflows from day one.SageMaker AI offers a wide range of ML-optimized compute instances that can be provisioned in minutes and stopped when not in use, optimizing cost-efficiency. For this project, Sonrai used compute instances with sufficient memory to handle large omic datasets, spinning them up for intensive modeling runs and shutting them down during analysis phases.Code Editor served as the primary development environment for building production-quality pipelines, with its integrated debugging and Git workflow features. JupyterLab was used for data exploration and customer collaboration meetings, where its interactive notebook format facilitated real-time discussion of results.
Third-party tools such as Quarto, an open source technical publishing system, were installed within the SageMaker compute environments to enable report generation within the modeling pipeline itself. A single quarto render command executes the complete pipeline and creates stakeholder-ready reports with interactive visualizations, statistical tables, and detailed markdown annotations. Reports are automatically written to the results S3 bucket, where customers can download them within minutes of pipeline completion.
Managed MLflow
The managed MLflow capability within SageMaker AI enabled seamless experiment tracking. Experiments executed within the SageMaker AI environment are automatically tracked and recorded in MLflow, capturing a comprehensive view of the experimentation process. For this project, MLflow became the single source of truth for the modeling experiments, logging performance metrics, hyperparameters, feature importance rankings, and custom artifacts such as ROC curves and confusion matrices. The MLflow UI provided an intuitive interface for comparing experiments side-by-side, enabling the team to quickly identify promising approaches and share results during customer review sessions.
MLOps pipelines
Sonrai’s modeling pipelines are structured as reproducible, version-controlled workflows that process raw data through multiple stages to produce final models:

Raw omic data from Amazon S3 is loaded, normalized, and quality-controlled.
Domain-specific transformations are applied to create modeling-ready features.
Recursive Feature Elimination (RFE) reduces thousands of features to the most significant for disease detection.
Multiple models are trained across individual and combined modalities.
Model performance is assessed and comprehensive reports are generated.

Each pipeline execution is tracked in MLflow, capturing input data versions, code commits, hyperparameters, and performance metrics. This creates an auditable trail from raw data to final model, essential for regulatory submissions. The pipelines are executed on SageMaker training jobs, which provide scalable compute resources and automatic capture of training metadata.The most critical pipeline stage was RFE, which iteratively removes less important features while monitoring model performance. MLflow tracked each iteration, logging which features were removed, the model’s performance at each step, and the final selected feature set. This detailed tracking enabled validation of feature selection decisions and provided documentation for regulatory review.
Model deployment
Sonrai uses both MLflow and the SageMaker Model Registry in a complementary fashion to manage model artifacts and metadata throughout the development lifecycle. During active experimentation, MLflow serves as the primary tracking system, enabling rapid iteration with lightweight experiment tracking. When a model meets predetermined performance thresholds and is ready for broader validation or deployment, it is promoted to the SageMaker Model Registry.This promotion represents a formal transition from research to development. Candidate models are evaluated against success criteria, packaged with their inference code and containers, and registered in the SageMaker Model Registry with a unique version identifier. The SageMaker Model Registry supports a formal deployment approval workflow aligned with Sonrai’s quality management system:

Pending – Newly registered models awaiting review
Approved – Models that have passed validation criteria and are ready for deployment
Rejected – Models that did not meet acceptance criteria, with documented reasons

For the cancer biomarker project, models were evaluated against stringent clinical criteria: sensitivity of at least 90%, specificity of at least 85%, and AUC-ROC of at least 0.90. For approved models, deployment options include SageMaker endpoints for real-time inference, batch transform jobs for processing large datasets, or retrieval of model artifacts for deployment in customer-specific environments.
Results and model performance
Using ML-optimized compute instances on SageMaker AI, the entire pipeline—from raw data to final models and reports—executed in under 10 minutes. This rapid iteration cycle enabled daily model updates, real-time collaboration during customer meetings, and immediate validation of hypotheses. What previously would have taken days could now be accomplished in a single customer call.The modeling pipeline generated 15 individual models across single-modality and multi-modality combinations. The top-performing model combined proteomic and metabolomic features, achieving 94% sensitivity and 89% specificity with an AUC-ROC of 0.93. This multi-modal approach outperformed single modalities alone, demonstrating the value of integrating different omic data types.The winning model was promoted to the SageMaker Model Registry with complete metadata, including model artifact location, training dataset, MLflow experiment IDs, evaluation metrics, and custom metadata. This registered model underwent additional validation by the customer’s clinical team before approval for clinical validation studies. “Using SageMaker AI for the full model development process enabled the team to collaborate and rapidly iterate with full traceability and confidence in the final result. The rich set of services available in Amazon SageMaker AI make it a complete solution for robust model development, deployment, and monitoring,” says Matthew Lee, Director of AI & Medical Imaging at Sonrai.
Conclusion
Sonrai partnered with AWS to develop an MLOps solution that accelerates precision medicine trials using SageMaker AI. The solution addresses key challenges in biomarker discovery: managing datasets with thousands of features from multiple omic modalities while working with limited patient samples, tracking hundreds of complex experimental permutations, and maintaining version control and traceability for regulatory readiness.The result is a scalable MLOps framework that reduces development iteration time from days to minutes while facilitating reproducibility and regulatory readiness. The combination of the SageMaker AI development environment, MLflow experiment tracking, and SageMaker Model Registry provides end-to-end traceability from raw data to deployed models—essential for both scientific validity and governance. Sonrai saw the following key results:

8,916 biomarkers modeled and tracked
Hundreds of experiments performed with full lineage
50% reduction in time spent curating data for biomarker reports

Building on this foundation, Sonrai is expanding its SageMaker AI MLOps capabilities. The team is developing automated retraining pipelines that trigger model updates when new patient data becomes available, using Amazon EventBridge to orchestrate SageMaker AI pipelines that monitor data drift and model performance degradation.
Sonrai is also extending the architecture to support federated learning across multiple clinical sites, enabling collaborative model development while keeping sensitive patient data at each institution. Selected models are being deployed to SageMaker endpoints for real-time predictions, supporting clinical decision support applications.
Get started today with Amazon SageMaker for MLOps to build your own ML Ops piplines. Please find our introductory Amazon SageMaker ML Ops workshop to get started.

About the Authors

Matthew Lee
Matthew Lee is Director of AI & Medical Imaging at Sonrai, bringing extensive experience as a data scientist specializing in computer vision and medical imaging. With a background as a medical physicist, he focuses on developing impactful AI solutions—from initial experimentation through proof of concept to scalable production code that addresses real business needs. Matthew has successfully built and deployed AI models in cloud environments for clients, and regularly shares his work through customer presentations, conference talks, and industry meetups.

Jonah Craig
Jonah Craig is a Startup Solutions Architect based in Dublin, Ireland. He works with startup customers across the UK and Ireland and focuses on developing AI/ML and generative AI solutions. Jonah has a master’s degree in computer science and regularly speaks on stage at AWS conferences, such as the annual AWS London Summit and the AWS Dublin Cloud Day. In his spare time, he enjoys creating music and releasing it on Spotify.

Siamak Nariman
Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies.

Accelerating AI model production at Hexagon with Amazon SageMaker Hype …

This blog post was co-authored with Johannes Maunz, Tobias Bösch Borgards, Aleksander Cisłak, and Bartłomiej Gralewicz from Hexagon.
Hexagon is the global leader in measurement technologies and provides the confidence that vital industries rely on to build, navigate, and innovate. From microns to Mars, Hexagon’s solutions drive productivity, quality, safety, and sustainability across aerospace, agriculture, automotive, construction, manufacturing, and mining.
Applications in these industries often rely on capturing the reality by recording vast amounts of highly accurate point cloud data with Hexagon measurement technology. A point cloud is a collection of data points in 3D space, typically representing the external surface of an object or a scene. Point clouds are commonly used in applications like 3D modeling, computer vision, robotics, autonomous vehicles, and geospatial analysis.
Hexagon provides specialized AI models to its customers to help them ensure productivity, quality, safety, or sustainability in their applications. These AI models are purpose built for a given domain and usually focus on understanding the built environment.
In this blog post, we demonstrate how Hexagon collaborated with Amazon Web Services to scale their AI model production by pretraining state-of-the-art segmentation models, using the model training infrastructure of Amazon SageMaker HyperPod.
AI impact and opportunity
AI models provided by Hexagon to its customers help them solve complex challenges. These challenges are solved by specialized AI models that are often more effective than large, general-purpose ones. Before using scanned point clouds in geospatial applications, it’s essential to perform preprocessing and point cloud cleaning operations. Instead of relying on a single AI model to classify an entire dataset, targeted AI models have been developed that tackle distinct operations: one efficiently removes stray points from dust or sensor noise, another helps separate land types even in complex environments, and another detects and eliminates moving objects like cars and pedestrians while keeping fixed objects in the scene. This AI approach not only improves precision and efficiency, but also reduces processing demands and leads to faster creation of and more accurate 3D models.
The following figures illustrate the practical application of specialized AI models, such as the point cloud classification models that Hexagon is developing.
The first figure shows how mobile mapping road models enable the creation of digital twins of entire cities.

The second figure is a heavy construction model that enables on-site decision making.

There’s a significant opportunity to accelerate Hexagon’s AI innovation and time-to-market by implementing a robust, scalable, and high-performance infrastructure that enables efficient and fast model training and development of new, specialized AI use cases in days rather than months.
Hexagon and Amazon SageMaker HyperPod: A success story
To address Hexagon’s need for scalable compute resources, access to the latest GPUs, and streamlined training pipelines, the Hexagon team evaluated the key features of Amazon SageMaker HyperPod for their model training requirements:

Resilient architecture: SageMaker HyperPod streamlined operations through proactive node health checks and automated cluster monitoring. With built-in self-healing capabilities and automated job resumption, it enables training runs to run for weeks or months without interruptions. In the event of a node failure, it will automatically detect the failure, replace the faulty node, and resume the training from the most recent checkpoint.
Scalable infrastructure: Using single-spine node topology and pre-configured Elastic Fabric Adapter (EFA), SageMaker HyperPod delivers optimal inter-node communication. Its flexible compute capacity allocation enables seamless scaling without compromising performance, making it ideal for growing workloads spanning multiple nodes.
Versatile deployment: Compatible with a wide range of generative AI software stacks, SageMaker HyperPod simplifies deployment through lifecycle scripts and Helm customization. It supports leading Amazon Elastic Compute Cloud (Amazon EC2) instances like the P6-B200 and P6e-GB200, which are accelerated by NVIDIA Blackwell GPUs, offering versatility in implementation.
Efficient operations: Through intelligent task governance and integrated SageMaker tools, SageMaker HyperPod automatically optimizes cluster utilization. Pre-configured Deep Learning Amazon Machine Images (DLAMI) with compatible drivers and libraries, combined with quick start training recipes, help to ensure maximum operational efficiency.

Solution overview
Hexagon implemented a robust training environment using Amazon SageMaker HyperPod managed infrastructure, shown in the following figure. It includes an integrated data pipeline, compute cluster management, and MLOps monitoring stack.

Solution Architecture Diagram
Data pipeline and storage
Training data is stored in Amazon Simple Storage Service (Amazon S3) within Hexagon’s AWS account, with Amazon FSx for Lustre providing high-performance parallel file system capabilities. The Amazon FSx for Lustre file system is configured with a data repository association (DRA) that automatically synchronizes with the S3 bucket, enabling lazy loading of training data and automatic export of model checkpoints back to Amazon S3.
This configuration enables streaming of terabytes of training data directly to GPU accelerated compute nodes at multi-GBs per second throughput rates, eliminating data transfer bottlenecks during model training. The DRA helps ensure that data scientists can work with familiar Amazon S3 interfaces while benefiting from the performance advantages of a parallel file system during training.
Compute cluster management
SageMaker HyperPod cluster provisioned with built-in health checks and automated instance management. Through Amazon SageMaker Training Plans, Hexagon can flexibly reserve GPU capacity from 1 day to 6 months, helping to ensure resource availability for both short experimental runs and extended training campaigns. These training plans provide predictable pricing and dedicated capacity, eliminating the uncertainty of on-demand resource availability for critical model development. The cluster automatically handles node failures and job resumption, maintaining training continuity without manual intervention.
MLOps and monitoring stack
The environment integrates with a one-click observability solution from Amazon SageMaker HyperPod, which automatically publishes comprehensive metrics to Amazon Managed Service for Prometheus and visualizes them through pre-built Amazon Managed Grafana dashboards optimized for foundation model development.
This unified observability consolidates health and performance data from NVIDIA Data Center GPU Manager, Kubernetes node exporters, EFA, integrated file systems, and SageMaker HyperPod task operators, enabling per-GPU level monitoring of resource utilization, GPU memory, and FLOPs.
For experiment tracking, MLflow on Amazon SageMaker AI provides a fully managed solution that requires minimal code modifications to Hexagon’s training containers. This integration enables automatic tracking of training parameters, metrics, model artifacts, and lineage across all experiment runs, with the ability to compare model performance and reproduce results reliably.
Key outcomes from using SageMaker HyperPod at Hexagon
Hexagon’s implementation of SageMaker HyperPod delivered measurable improvements across deployment speed, training efficiency, and model performance.

Quick integration and deployment: Hexagon successfully integrated SageMaker HyperPod for training and achieved their first training deployment within hours, reflecting the ease of set-up and enhanced end-user experience for machine learning (ML) developers. Having all the services required for training models under a single ecosystem helped meet security and governance needs.
Training time reduction: Hexagon reduced their training time from 80 days on-premises for a given network and configuration to approximately 4 days on AWS using 6x ml.p5.48xlarge instances each containing eight NVIDIA H100 GPUs, with EFA network interface that boosts distributed training efficiency through low-latency, high throughput networking for multi-node GPU training.
Performance enhancement: SageMaker HyperPod enabled larger batch sizes during training, which led to better training performance, resulting in higher accuracy scores for the trained AI models.

AWS Enterprise Support played a crucial role in Hexagon’s successful implementation of Amazon SageMaker HyperPod. Through proactive guidance, deep technical expertise, and dedicated partnership, the AWS Enterprise Support team helped Hexagon navigate their cloud journey from initial AWS adoption to advanced generative AI implementations. The comprehensive support included best practices guidance, cost optimization strategies, and continuous architectural advice, so that Hexagon’s team could focus on innovation while maintaining operational excellence. This strategic partnership demonstrates how AWS Enterprise Support goes beyond traditional support services, becoming a trusted advisor that helps customers accelerate their business transformation and achieve their desired outcomes in the cloud.
Conclusion
Hexagon’s collaboration with Amazon Web Services delivered a remarkable 95% reduction in training time through Amazon SageMaker HyperPod. With flexible training plans, Hexagon teams can now provision the exact amount of accelerated compute capacity needed for each model training project with complete flexibility and freedom. This combination of flexibility, scalability, and performance unlocks a transformative approach to model development at Hexagon, accelerating innovation and powering the next generation of AI-enabled products that help customers build, navigate, and innovate across critical industries.

About the Authors

Johannes Maunz
Johannes Maunz joined Hexagon Geosystems’ research and development department as an electronics/software engineer in 2007. Since 2017, he has been working for the Innovation Hub, Hexagon’s central technology organization. In his role he leads Hexagon’s central AI group, is responsible for applied research, development, deployment, and strategy of AI across sensors and solutions for all industries served by Hexagon. Furthermore he’s responsible for the AI enabled company program, a program for Hexagons workforce to use AI in daily operations across departments and functions.

Tobias Bösch Borgards
Tobias Bösch Borgards is an electrical engineer by training and leads the AI engineering team at Hexagon. Together with his team, Tobias bring ML to life in Hexagon products. In his free time, he enjoys hiking and skiing.

Bartlomiej Gralewicz
Bartlomiej Gralewicz is an Expert Software Engineer for AI at Hexagon AI Hub. He focuses on productizing AI solutions that allow understanding and automating 3D geometry analysis. In his free time, he enjoys bouldering, great coffee, and watching F1.

Mohan Gowda
Mohan Gowda is a Principal Solutions Architect for AI/ML at Amazon Web Services, helping customers across Switzerland, Austria, and Central & Eastern Europe drive innovation and digital transformation using AWS generative AI and machine learning services. In his free time, he enjoys playing tennis and skiing in the Swiss Alps.

Roy Allela
Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS. He helps AWS customers, from small startups to large enterprises to train and deploy foundation models efficiently on AWS. He has a background in Microprocessor Engineering passionate about computational optimization problems and improving the performance of AI workloads.

Ankit Anand
Ankit Anand is a Principal Foundation Models Go-To-Market (GTM) Specialist at AWS. He partners with top generative AI model builders, strategic customers, and AWS service teams to enable the next generation of AI/ML workloads on AWS. Ankit’s experience includes product management expertise within the financial services industry for high-frequency and low-latency trading and business development for Amazon Alexa.

Jann Wild
Jann Wild is a Senior Solutions Architect at Amazon Web Services (AWS), where he has spent nearly 8 years helping organizations harness the power of cloud computing and artificial intelligence. With deep expertise in software architecture and AI/ML solutions, Jann specializes in guiding enterprises through complex digital transformation initiatives while ensuring robust cloud security practices.

Forget Keyword Imitation: ByteDance AI Maps Molecular Bonds in AI Reas …

ByteDance Seed recently dropped a research that might change how we build reasoning AI. For years, devs and AI researchers have struggled to ‘cold-start’ Large Language Models (LLMs) into Long Chain-of-Thought (Long CoT) models. Most models lose their way or fail to transfer patterns during multi-step reasoning.

The ByteDance team discovered the problem: we have been looking at reasoning the wrong way. Instead of just words or nodes, effective AI reasoning has a stable, molecular-like structure.

https://arxiv.org/pdf/2601.06002

The 3 ‘Chemical Bonds’ of Thought

The researchers posit that high-quality reasoning trajectories are held together by 3 interaction types. These mirror the forces found in organic chemistry:

Deep Reasoning as Covalent Bonds: This forms the primary ‘bone’ of the thought process. It encodes strong logical dependencies where Step A must justify Step B. Breaking this bond destabilizes the entire answer.

Self-Reflection as Hydrogen Bonds: This acts as a stabilizer. Just as proteins gain stability when chains fold, reasoning stabilizes when later steps (like Step 100) revise or reinforce earlier premises (like Step 10). In their tests, 81.72% of reflection steps successfully reconnected to previously formed clusters.

Self-Exploration as Van der Waals Forces: These are weak bridges between distant clusters of logic. They allow the model to probe new possibilities or alternative hypotheses before enforcing stronger logical constraints.

Why ‘Wait, Let Me Think’ Isn’t Enough

Most AI devs/researchers try to fix reasoning by training models to imitate keywords like ‘wait’ or ‘maybe’. ByteDance team proved that models actually learn the underlying reasoning behavior, not the surface words.

The research team identifies a phenomenon called Semantic Isomers. These are reasoning chains that solve the same task and use the same concepts but differ in how their logical ‘bonds’ are distributed.

Key findings include:

Imitation Fails: Fine-tuning on human-annotated traces or using In-Context Learning (ICL) from weak models fails to build stable Long CoT structures.

Structural Conflict: Mixing reasoning data from different strong teachers (like DeepSeek-R1 and OpenAI-OSS) actually destabilizes the model. Even if the data is similar, the different “molecular” structures cause structural chaos and drop performance.

Information Flow: Unlike humans, who have uniform information gain, strong reasoning models exhibit metacognitive oscillation. They alternate between high-entropy exploration and stable convergent validation.

https://arxiv.org/pdf/2601.06002

MOLE-SYN: The Synthesis Method

To fix these issues, ByteDance team introduced MOLE-SYN. This is a ‘distribution-transfer-graph’ method. Instead of directly copying a teacher’s text, it transfers the behavioral structure to the student model.

It works by estimating a behavior transition graph from strong models and guiding a cheaper model to synthesize its own effective Long CoT structures. This decoupling of structure from surface text yields consistent gains across 6 major benchmarks, including GSM8K, MATH-500, and OlymBench.

Protecting the ‘Thought Molecule‘

This research also sheds light on how private AI companies protect their models. Exposing full reasoning traces allows others to clone the model’s internal procedures.

ByteDance team found that summarization and reasoning compression are effective defenses. By reducing the token count—often by more than 45%—companies disrupt the reasoning bond distributions. This creates a gap between what the model outputs and its internal ‘error-bounded transitions,’ making it much harder to distill the model’s capabilities.

Key Takeaways

Reasoning as ‘Molecular’ Bonds: Effective Long Chain-of-Thought (Long CoT) is defined by three specific ‘chemical’ bonds: Deep Reasoning (covalent-like) forms the logical backbone, Self-Reflection (hydrogen-bond-like) provides global stability through logical folding, and Self-Exploration (van der Waals-like) bridges distant semantic concepts.

Behavior Over Keywords: Models internalize underlying reasoning structures and transition distributions rather than just surface-level lexical cues like ‘wait’ or ‘maybe’. Replacing keywords with synonyms does not significantly impact performance, proving that true reasoning depth comes from learned behavioral motifs.

The ‘Semantic Isomer’ Conflict: Combining heterogeneous reasoning data from different strong models (e.g., DeepSeek-R1 and OpenAI-OSS) can trigger ‘structural chaos’. Even if data sources are statistically similar, incompatible behavioral distributions can break logical coherence and degrade model performance.

MOLE-SYN Methodology: This ‘distribution-transfer-graph’ framework enables models to synthesize effective Long CoT structures from scratch using cheaper instruction LLMs. By transferring the behavioral transition graph instead of direct text, MOLE-SYN achieves performance close to expensive distillation while stabilizing Reinforcement Learning (RL).

Protection via Structural Disruption: Private LLMs can protect their internal reasoning processes through summarization and compression. Reducing token count by roughly 45% or more effectively ‘breaks’ the bond distributions, making it significantly harder for unauthorized models to clone internal reasoning procedures via distillation.

Check out the Paper. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Forget Keyword Imitation: ByteDance AI Maps Molecular Bonds in AI Reasoning to Stabilize Long Chain-of-Thought Performance and Reinforcement Learning (RL) Training appeared first on MarkTechPost.

A New Google AI Research Proposes Deep-Thinking Ratio to Improve LLM A …

For the last few years, the AI world has followed a simple rule: if you want a Large Language Model (LLM) to solve a harder problem, make its Chain-of-Thought (CoT) longer. But new research from the University of Virginia and Google proves that ‘thinking long’ is not the same as ‘thinking hard’.

The research team reveals that simply adding more tokens to a response can actually make an AI less accurate. Instead of counting words, the Google researchers introduce a new measurement: the Deep-Thinking Ratio (DTR).

https://arxiv.org/pdf/2602.13517

The Failure of ‘Token Maxing‘

Engineers often use token count as a proxy for the effort an AI puts into a task. However, the researchers found that raw token count has an average correlation of r= -0.59 with accuracy.

This negative number means that as the model generates more text, it is more likely to be wrong. This happens because of ‘overthinking,’ where the model gets stuck in loops, repeats redundant steps, or amplifies its own mistakes. Relying on length alone wastes expensive compute on uninformative tokens.

What are Deep-Thinking Tokens?

The research team argued that real ‘thinking’ happens inside the layers of the model, not just in the final output. When a model predicts a token, it processes data through a series of transformer layers (L).

Shallow Tokens: For easy words, the model’s prediction stabilizes early. The ‘guess’ doesn’t change much from layer 5 to layer 36.

Deep-Thinking Tokens: For difficult logic or math symbols, the prediction shifts significantly in the deeper layers.

How to Measure Depth

To identify these tokens, the research team uses a technique to peek at the model’s internal ‘drafts’ at every layer. They project the intermediate hidden states (htl) into the vocabulary space using the model’s unembedding matrix (WU). This produces a probability distribution (pt,l) for every layer.

They then calculate the Jensen-Shannon Divergence (JSD) between the intermediate layer distribution and the final layer distribution (pt,L):

Dt,l := JSD(pt,L || pt,l)

A token is a deep-thinking token if its prediction only settles in the ‘late regime’—defined by a depth fraction (⍴). In their tests, they set ⍴= 0.85, meaning the token only stabilized in the final 15% of the layers.

The Deep-Thinking Ratio (DTR) is the percentage of these ‘hard’ tokens in a full sequence. Across models like DeepSeek-R1-70B, Qwen3-30B-Thinking, and GPT-OSS-120B, DTR showed a strong average positive correlation of r = 0.683 with accuracy.

https://arxiv.org/pdf/2602.13517

Think@n: Better Accuracy at 50% the Cost

The research team used this innovative approach to create Think@n, a new way to scale AI performance during inference.

Most devs use Self-Consistency (Cons@n), where they sample 48 different answers and use majority voting to pick the best one. This is very expensive because you have to generate every single token for every answer.

Think@n changes the game by using ‘early halting’:

The model starts generating multiple candidate answers.

After just 50 prefix tokens, the system calculates the DTR for each candidate.

It immediately stops generating the ‘unpromising’ candidates with low DTR.

It only finishes the candidates with high deep-thinking scores.

The Results on AIME 2025

MethodAccuracyAvg. Cost (k tokens)Cons@n (Majority Vote)92.7% 307.6 Think@n (DTR-based Selection)94.7% 155.4

On the AIME 25 math benchmark, Think@n achieved higher accuracy than standard voting while reducing the inference cost by 49%.

Key Takeaways

Token count is a poor predictor of accuracy: Raw output length has an average negative correlation (r = -0.59) with performance, meaning longer reasoning traces often signal ‘overthinking’ rather than higher quality.

Deep-thinking tokens define true effort: Unlike simple tokens that stabilize in early layers, deep-thinking tokens are those whose internal predictions undergo significant revision in deeper model layers before converging.

The Deep-Thinking Ratio (DTR) is a superior metric: DTR measures the proportion of deep-thinking tokens in a sequence and exhibits a robust positive correlation with accuracy (average r = 0.683), consistently outperforming length-based or confidence-based baselines.

Think@n enables efficient test-time scaling: By prioritizing and finishing only the samples with high deep-thinking ratios, the Think@n strategy matches or exceeds the performance of standard majority voting (Cons@n).

Massive cost reduction via early halting: Because DTR can be estimated from a short prefix of just 50 tokens, unpromising generations can be rejected early, reducing total inference costs by approximately 50%.

Check out the Paper. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A New Google AI Research Proposes Deep-Thinking Ratio to Improve LLM Accuracy While Cutting Total Inference Costs by Half appeared first on MarkTechPost.

How to Design an Agentic Workflow for Tool-Driven Route Optimization w …

In this tutorial, we build a production-style Route Optimizer Agent for a logistics dispatch center using the latest LangChain agent APIs. We design a tool-driven workflow in which the agent reliably computes distances, ETAs, and optimal routes rather than guessing, and we enforce structured outputs to make the results directly usable in downstream systems. We integrate geographic calculations, configurable speed profiles, traffic buffers, and multi-stop route optimization, ensuring the agent behaves deterministically while still reasoning flexibly through tools.

Copy CodeCopiedUse a different Browser!pip -q install -U langchain langchain-openai pydantic

import os
from getpass import getpass

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY (input hidden): “)

from typing import Dict, List, Optional, Tuple, Any
from math import radians, sin, cos, sqrt, atan2

from pydantic import BaseModel, Field, ValidationError

from langchain_openai import ChatOpenAI
from langchain.tools import tool
from langchain.agents import create_agent

We set up the execution environment and ensure all required libraries are installed and imported correctly. We securely load the OpenAI API key so the agent can interact with the language model without hardcoding credentials. We also prepare the core dependencies that power tools, agents, and structured outputs.

Copy CodeCopiedUse a different BrowserSITES: Dict[str, Dict[str, Any]] = {
“Rig_A”: {“lat”: 23.5880, “lon”: 58.3829, “type”: “rig”},
“Rig_B”: {“lat”: 23.6100, “lon”: 58.5400, “type”: “rig”},
“Rig_C”: {“lat”: 23.4500, “lon”: 58.3000, “type”: “rig”},
“Yard_Main”: {“lat”: 23.5700, “lon”: 58.4100, “type”: “yard”},
“Depot_1”: {“lat”: 23.5200, “lon”: 58.4700, “type”: “depot”},
“Depot_2”: {“lat”: 23.6400, “lon”: 58.4300, “type”: “depot”},
}

SPEED_PROFILES: Dict[str, float] = {
“highway”: 90.0,
“arterial”: 65.0,
“local”: 45.0,
}

DEFAULT_TRAFFIC_MULTIPLIER = 1.10

def haversine_km(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
R = 6371.0
dlat = radians(lat2 – lat1)
dlon = radians(lon2 – lon1)
a = sin(dlat / 2) ** 2 + cos(radians(lat1)) * cos(radians(lat2)) * sin(dlon / 2) ** 2
return R * c

We define the core domain data representing rigs, yards, and depots along with their geographic coordinates. We establish speed profiles and a default traffic multiplier to reflect realistic driving conditions. We also implement the Haversine distance function, which serves as the mathematical backbone of all routing decisions.

Copy CodeCopiedUse a different Browserdef _normalize_site_name(name: str) -> str:
return name.strip()

def _assert_site_exists(name: str) -> None:
if name not in SITES:
raise ValueError(f”Unknown site ‘{name}’. Use list_sites() or suggest_site().”)

def _distance_between(a: str, b: str) -> float:
_assert_site_exists(a)
_assert_site_exists(b)
sa, sb = SITES[a], SITES[b]
return float(haversine_km(sa[“lat”], sa[“lon”], sb[“lat”], sb[“lon”]))

def _eta_minutes(distance_km: float, speed_kmph: float, traffic_multiplier: float) -> float:
speed = max(float(speed_kmph), 1e-6)
base_minutes = (distance_km / speed) * 60.0
return float(base_minutes * max(float(traffic_multiplier), 0.0))

def compute_route_metrics(path: List[str], speed_kmph: float, traffic_multiplier: float) -> Dict[str, Any]:
if len(path) < 2:
raise ValueError(“Route path must include at least origin and destination.”)
for s in path:
_assert_site_exists(s)
legs = []
total_km = 0.0
total_min = 0.0
for i in range(len(path) – 1):
a, b = path[i], path[i + 1]
d_km = _distance_between(a, b)
t_min = _eta_minutes(d_km, speed_kmph, traffic_multiplier)
legs.append({“from”: a, “to”: b, “distance_km”: d_km, “eta_minutes”: t_min})
total_km += d_km
total_min += t_min
return {“route”: path, “distance_km”: float(total_km), “eta_minutes”: float(total_min), “legs”: legs}

We build the low-level utility functions that validate site names and compute distances and travel times. We implement logic to calculate per-leg and total route metrics deterministically. This ensures that every ETA and distance returned by the agent is based on explicit computation rather than inference.

Copy CodeCopiedUse a different Browserdef _all_paths_with_waypoints(origin: str, destination: str, waypoints: List[str], max_stops: int) -> List[List[str]]:
from itertools import permutations
waypoints = [w for w in waypoints if w not in (origin, destination)]
max_stops = int(max(0, max_stops))
candidates = []
for k in range(0, min(len(waypoints), max_stops) + 1):
for perm in permutations(waypoints, k):
candidates.append([origin, *perm, destination])
if [origin, destination] not in candidates:
candidates.insert(0, [origin, destination])
return candidates

def find_best_route(origin: str, destination: str, allowed_waypoints: Optional[List[str]], max_stops: int, speed_kmph: float, traffic_multiplier: float, objective: str, top_k: int) -> Dict[str, Any]:
origin = _normalize_site_name(origin)
destination = _normalize_site_name(destination)
_assert_site_exists(origin)
_assert_site_exists(destination)
allowed_waypoints = allowed_waypoints or []
for w in allowed_waypoints:
_assert_site_exists(_normalize_site_name(w))
objective = (objective or “eta”).strip().lower()
if objective not in {“eta”, “distance”}:
raise ValueError(“objective must be one of: ‘eta’, ‘distance'”)
top_k = max(1, int(top_k))
candidates = _all_paths_with_waypoints(origin, destination, allowed_waypoints, max_stops=max_stops)
scored = []
for path in candidates:
metrics = compute_route_metrics(path, speed_kmph=speed_kmph, traffic_multiplier=traffic_multiplier)
score = metrics[“eta_minutes”] if objective == “eta” else metrics[“distance_km”]
scored.append((score, metrics))
scored.sort(key=lambda x: x[0])
best = scored[0][1]
alternatives = [m for _, m in scored[1:top_k]]
return {“best”: best, “alternatives”: alternatives, “objective”: objective}

We introduce multi-stop routing logic by generating candidate paths with optional waypoints. We evaluate each candidate route against a clear optimization objective, such as ETA or distance. We then rank routes and extract the best option along with a set of strong alternatives.

Copy CodeCopiedUse a different Browser@tool
def list_sites(site_type: Optional[str] = None) -> List[str]:
if site_type:
st = site_type.strip().lower()
return sorted([k for k, v in SITES.items() if str(v.get(“type”, “”)).lower() == st])
return sorted(SITES.keys())

@tool
def get_site_details(site: str) -> Dict[str, Any]:
s = _normalize_site_name(site)
_assert_site_exists(s)
return {“site”: s, **SITES[s]}

@tool
def suggest_site(query: str, max_suggestions: int = 5) -> List[str]:
q = (query or “”).strip().lower()
max_suggestions = max(1, int(max_suggestions))
scored = []
for name in SITES.keys():
n = name.lower()
common = len(set(q) & set(n))
bonus = 5 if q and q in n else 0
scored.append((common + bonus, name))
scored.sort(key=lambda x: x[0], reverse=True)
return [name for _, name in scored[:max_suggestions]]

@tool
def compute_direct_route(origin: str, destination: str, road_class: str = “arterial”, traffic_multiplier: float = DEFAULT_TRAFFIC_MULTIPLIER) -> Dict[str, Any]:
origin = _normalize_site_name(origin)
destination = _normalize_site_name(destination)
rc = (road_class or “arterial”).strip().lower()
if rc not in SPEED_PROFILES:
raise ValueError(f”Unknown road_class ‘{road_class}’. Use one of: {sorted(SPEED_PROFILES.keys())}”)
speed = SPEED_PROFILES[rc]
return compute_route_metrics([origin, destination], speed_kmph=speed, traffic_multiplier=float(traffic_multiplier))

@tool
def optimize_route(origin: str, destination: str, allowed_waypoints: Optional[List[str]] = None, max_stops: int = 2, road_class: str = “arterial”, traffic_multiplier: float = DEFAULT_TRAFFIC_MULTIPLIER, objective: str = “eta”, top_k: int = 3) -> Dict[str, Any]:
origin = _normalize_site_name(origin)
destination = _normalize_site_name(destination)
rc = (road_class or “arterial”).strip().lower()
if rc not in SPEED_PROFILES:
raise ValueError(f”Unknown road_class ‘{road_class}’. Use one of: {sorted(SPEED_PROFILES.keys())}”)
speed = SPEED_PROFILES[rc]
allowed_waypoints = allowed_waypoints or []
allowed_waypoints = [_normalize_site_name(w) for w in allowed_waypoints]
return find_best_route(origin, destination, allowed_waypoints, int(max_stops), float(speed), float(traffic_multiplier), str(objective), int(top_k))

We expose the routing and discovery logic as callable tools for the agent. We allow the agent to list sites, inspect site details, resolve ambiguous names, and compute both direct and optimized routes. This tool layer ensures that the agent always reasons by calling verified functions rather than hallucinating results.

Copy CodeCopiedUse a different Browserclass RouteLeg(BaseModel):
from_site: str
to_site: str
distance_km: float
eta_minutes: float

class RoutePlan(BaseModel):
route: List[str]
distance_km: float
eta_minutes: float
legs: List[RouteLeg]
objective: str

class RouteDecision(BaseModel):
chosen: RoutePlan
alternatives: List[RoutePlan] = []
assumptions: Dict[str, Any] = {}
notes: str = “”
audit: List[str] = []

llm = ChatOpenAI(model=”gpt-4o-mini”, temperature=0.2)

SYSTEM_PROMPT = (
“You are the Route Optimizer Agent for a logistics dispatch center.n”
“You MUST use tools for any distance/ETA calculation.n”
“Return ONLY the structured RouteDecision.”
)

route_agent = create_agent(
model=llm,
tools=[list_sites, get_site_details, suggest_site, compute_direct_route, optimize_route],
system_prompt=SYSTEM_PROMPT,
response_format=RouteDecision,
)

def get_route_decision(origin: str, destination: str, road_class: str = “arterial”, traffic_multiplier: float = DEFAULT_TRAFFIC_MULTIPLIER, allowed_waypoints: Optional[List[str]] = None, max_stops: int = 2, objective: str = “eta”, top_k: int = 3) -> RouteDecision:
user_msg = {
“role”: “user”,
“content”: (
f”Optimize the route from {origin} to {destination}.n”
f”road_class={road_class}, traffic_multiplier={traffic_multiplier}n”
f”objective={objective}, top_k={top_k}n”
f”allowed_waypoints={allowed_waypoints}, max_stops={max_stops}n”
“Return the structured RouteDecision only.”
),
}
result = route_agent.invoke({“messages”: [user_msg]})
return result[“structured_response”]

decision1 = get_route_decision(“Yard_Main”, “Rig_B”, road_class=”arterial”, traffic_multiplier=1.12)
print(decision1.model_dump())

decision2 = get_route_decision(“Rig_C”, “Rig_B”, road_class=”highway”, traffic_multiplier=1.08, allowed_waypoints=[“Depot_1”, “Depot_2”, “Yard_Main”], max_stops=2, objective=”eta”, top_k=3)
print(decision2.model_dump())

We define strict Pydantic schemas to enforce structured, machine-readable outputs from the agent. We initialize the language model and create the agent with a clear system prompt and response format. We then demonstrate how to invoke the agent and obtain reliable route decisions ready for real logistics workflows.

In conclusion, we have implemented a robust, extensible route optimization agent that selects the best path between sites while clearly explaining its assumptions and alternatives. We demonstrated how combining deterministic routing logic with a tool-calling LLM produces reliable, auditable decisions suitable for real logistics operations. This foundation allows us to easily extend the system with live traffic data, fleet constraints, or cost-based objectives, making the agent a practical component in a larger dispatch or fleet-management platform.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design an Agentic Workflow for Tool-Driven Route Optimization with Deterministic Computation and Structured Outputs appeared first on MarkTechPost.

Is There a Community Edition of Palantir? Meet OpenPlanter: An Open So …

The balance of power in the digital age is shifting. While governments and large corporations have long used data to track individuals, a new open-source project called OpenPlanter is giving that power back to the public. Created by a developer ‘Shin Megami Boson‘, OpenPlanter is a recursive-language-model investigation agent. Its goal is simple: help you keep tabs on your government, since they are almost certainly keeping tabs on you.

Solving the ‘Heterogeneous Data’ Problem

Investigative work is difficult because data is messy. Public records are often spread across 100 different formats. You might have a CSV of campaign finance records, a JSON file of government contracts, and a PDF of lobbying disclosures.

OpenPlanter ingests these disparate structured and unstructured data sources effortlessly. It uses Large Language Models (LLMs) to perform entity resolution. This is the process of identifying when different records refer to the same person or company. Once it connects these dots, the agent probabilistically looks for anomalies. It searches for patterns that a human might miss, such as a sudden spike in contract wins following a specific lobbying event.

The Architecture: Recursive Sub-Agent Delegation

What makes OpenPlanter unique is its recursive engine. Most AI agents handle 1 request at a time. OpenPlanter, however, breaks large objectives into smaller pieces. If you give it a massive task, it uses a sub-agent delegation strategy.

The agent has a default max-depth of 4. This means the main agent can spawn a sub-agent, which can spawn another, and so on. These agents work in parallel to:

Resolve entities across massive datasets.

Link datasets that have no common ID numbers.

Construct evidence chains that back up every single finding.

This recursive approach allows the system to handle investigations that are too large for a single ‘context window.’

The 2026 AI Stack

OpenPlanter is built for the high-performance requirements of 2026. It is written in Python 3.10+ and integrates with the most advanced models available today. The technical documentation lists several supported providers:

OpenAI: It uses gpt-5.2 as the default.

Anthropic: It supports claude-opus-4-6.

OpenRouter: It defaults to anthropic/claude-sonnet-4-5.

Cerebras: It uses qwen-3-235b-a22b-instruct-2507 for high-speed tasks.

The system also uses Exa for web searches and Voyage for high-accuracy embeddings. This multi-model strategy ensures that the agent uses the best ‘brain’ for each specific sub-task.

19 Tools for Digital Forensics

The agent is equipped with 19 specialized tools. These tools allow it to interact with the real world rather than just ‘chatting.’ These are organized into 4 core areas:

File I/O and Workspace: Tools like read_file, write_file, and hashline_edit allow the agent to manage its own database of findings.

Shell Execution: The agent can use run_shell to execute actual code. It can write a Python script to analyze a dataset and then run that script to get results.

Web Retrieval: With web_search and fetch_url, it can pull live data from government registries or news sites.

Planning and Logic: The think tool lets the agent pause and strategize. It uses acceptance-criteria to verify that a sub-task was completed correctly before moving to the next step.

Deployment and Interface

OpenPlanter is designed to be accessible but powerful. It features a Terminal User Interface (TUI) built with rich and prompt_toolkit. The interface includes a splash art screen of ASCII potted plants, but the work it does is serious.

You can get started quickly using Docker. By running docker compose up, the agent starts in a container. This is a critical security feature because it isolates the agent’s run_shell commands from the user’s host operating system.

The command-line interface allows for ‘headless’ tasks. You can run a single command like:

Copy CodeCopiedUse a different Browseropenplanter-agent –task “Flag all vendor overlaps in lobbying data” –workspace ./data

The agent will then work autonomously until it produces a final report.

Key Takeaways

Autonomous Recursive Logic: Unlike standard agents, OpenPlanter uses a recursive sub-agent delegation strategy (default max-depth of 4). It breaks complex investigative objectives into smaller sub-tasks, parallelizing work across multiple agents to build detailed evidence chains.

Heterogeneous Data Correlation: The agent is built to ingest and resolve disparate structured and unstructured data. It can simultaneously process CSV files, JSON records, and unstructured text (like PDFs) to identify entities across fragmented datasets.

Probabilistic Anomaly Detection: By performing entity resolution, OpenPlanter automatically connects records—such as matching a corporate alias to a lobbying disclosure—and looks for probabilistic anomalies to surface hidden connections between government spending and private interests.

High-End 2026 Model Stack: The system is provider-agnostic and utilizes the latest frontier models, including OpenAI gpt-5.2, Anthropic claude-opus-4-6, and Cerebras qwen-3-235b-a22b-instruct-2507 for high-speed inference.

Integrated Toolset for Forensics: OpenPlanter features 19 distinct tools, including shell execution (run_shell), web search (Exa), and file patching (hashline_edit). This allows it to write and run its own analysis scripts while verifying results against real-world acceptance criteria.

Check out the Repo here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Disclaimer: MarkTechPost does not endorse the OpenPlanter project and provides this technical report for informational purposes only.
The post Is There a Community Edition of Palantir? Meet OpenPlanter: An Open Source Recursive AI Agent for Your Micro Surveillance Use Cases appeared first on MarkTechPost.

A Coding Guide to High-Quality Image Generation, Control, and Editing …

In this tutorial, we design a practical image-generation workflow using the Diffusers library. We start by stabilizing the environment, then generate high-quality images from text prompts using Stable Diffusion with an optimized scheduler. We accelerate inference with a LoRA-based latent consistency approach, guide composition with ControlNet under edge conditioning, and finally perform localized edits via inpainting. Also, we focus on real-world techniques that balance image quality, speed, and controllability.

Copy CodeCopiedUse a different Browser!pip -q uninstall -y pillow Pillow || true
!pip -q install –upgrade –force-reinstall “pillow<12.0”
!pip -q install –upgrade diffusers transformers accelerate safetensors huggingface_hub opencv-python

import os, math, random
import torch
import numpy as np
import cv2
from PIL import Image, ImageDraw, ImageFilter
from diffusers import (
StableDiffusionPipeline,
StableDiffusionInpaintPipeline,
ControlNetModel,
StableDiffusionControlNetPipeline,
UniPCMultistepScheduler,
)

We prepare a clean and compatible runtime by resolving dependency conflicts and installing all required libraries. We ensure image processing works reliably by pinning the correct Pillow version and loading the Diffusers ecosystem. We also import all core modules needed for generation, control, and inpainting workflows.

Copy CodeCopiedUse a different Browserdef seed_everything(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

def to_grid(images, cols=2, bg=255):
if isinstance(images, Image.Image):
images = [images]
w, h = images[0].size
rows = math.ceil(len(images) / cols)
grid = Image.new(“RGB”, (cols*w, rows*h), (bg, bg, bg))
for i, im in enumerate(images):
grid.paste(im, ((i % cols)*w, (i // cols)*h))
return grid

device = “cuda” if torch.cuda.is_available() else “cpu”
dtype = torch.float16 if device == “cuda” else torch.float32
print(“device:”, device, “| dtype:”, dtype)

We define utility functions to ensure reproducibility and to organize visual outputs efficiently. We set global random seeds so our generations remain consistent across runs. We also detect the available hardware and configure precision to optimize performance on the GPU or CPU.

Copy CodeCopiedUse a different Browserseed_everything(7)
BASE_MODEL = “runwayml/stable-diffusion-v1-5”

pipe = StableDiffusionPipeline.from_pretrained(
BASE_MODEL,
torch_dtype=dtype,
safety_checker=None,
).to(device)

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

if device == “cuda”:
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()

prompt = “a cinematic photo of a futuristic street market at dusk, ultra-detailed, 35mm, volumetric lighting”
negative_prompt = “blurry, low quality, deformed, watermark, text”

img_text = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=25,
guidance_scale=6.5,
width=768,
height=512,
).images[0]

We initialize the base Stable Diffusion pipeline and switch to a more efficient UniPC scheduler. We generate a high-quality image directly from a text prompt using carefully chosen guidance and resolution settings. This establishes a strong baseline for subsequent improvements in speed and control.

Copy CodeCopiedUse a different BrowserLCM_LORA = “latent-consistency/lcm-lora-sdv1-5”
pipe.load_lora_weights(LCM_LORA)

try:
pipe.fuse_lora()
lora_fused = True
except Exception as e:
lora_fused = False
print(“LoRA fuse skipped:”, e)

fast_prompt = “a clean product photo of a minimal smartwatch on a reflective surface, studio lighting”
fast_images = []
for steps in [4, 6, 8]:
fast_images.append(
pipe(
prompt=fast_prompt,
negative_prompt=negative_prompt,
num_inference_steps=steps,
guidance_scale=1.5,
width=768,
height=512,
).images[0]
)

grid_fast = to_grid(fast_images, cols=3)
print(“LoRA fused:”, lora_fused)

W, H = 768, 512
layout = Image.new(“RGB”, (W, H), “white”)
draw = ImageDraw.Draw(layout)
draw.rectangle([40, 80, 340, 460], outline=”black”, width=6)
draw.ellipse([430, 110, 720, 400], outline=”black”, width=6)
draw.line([0, 420, W, 420], fill=”black”, width=5)

edges = cv2.Canny(np.array(layout), 80, 160)
edges = np.stack([edges]*3, axis=-1)
canny_image = Image.fromarray(edges)

CONTROLNET = “lllyasviel/sd-controlnet-canny”
controlnet = ControlNetModel.from_pretrained(
CONTROLNET,
torch_dtype=dtype,
).to(device)

cn_pipe = StableDiffusionControlNetPipeline.from_pretrained(
BASE_MODEL,
controlnet=controlnet,
torch_dtype=dtype,
safety_checker=None,
).to(device)

cn_pipe.scheduler = UniPCMultistepScheduler.from_config(cn_pipe.scheduler.config)

if device == “cuda”:
cn_pipe.enable_attention_slicing()
cn_pipe.enable_vae_slicing()

cn_prompt = “a modern cafe interior, architectural render, soft daylight, high detail”
img_controlnet = cn_pipe(
prompt=cn_prompt,
negative_prompt=negative_prompt,
image=canny_image,
num_inference_steps=25,
guidance_scale=6.5,
controlnet_conditioning_scale=1.0,
).images[0]

We accelerate inference by loading and fusing a LoRA adapter and demonstrate fast sampling with very few diffusion steps. We then construct a structural conditioning image and apply ControlNet to guide the layout of the generated scene. This allows us to preserve composition while still benefiting from creative text guidance.

Copy CodeCopiedUse a different Browsermask = Image.new(“L”, img_controlnet.size, 0)
mask_draw = ImageDraw.Draw(mask)
mask_draw.rectangle([60, 90, 320, 170], fill=255)
mask = mask.filter(ImageFilter.GaussianBlur(2))

inpaint_pipe = StableDiffusionInpaintPipeline.from_pretrained(
BASE_MODEL,
torch_dtype=dtype,
safety_checker=None,
).to(device)

inpaint_pipe.scheduler = UniPCMultistepScheduler.from_config(inpaint_pipe.scheduler.config)

if device == “cuda”:
inpaint_pipe.enable_attention_slicing()
inpaint_pipe.enable_vae_slicing()

inpaint_prompt = “a glowing neon sign that says ‘CAFÉ’, cyberpunk style, realistic lighting”

img_inpaint = inpaint_pipe(
prompt=inpaint_prompt,
negative_prompt=negative_prompt,
image=img_controlnet,
mask_image=mask,
num_inference_steps=30,
guidance_scale=7.0,
).images[0]

os.makedirs(“outputs”, exist_ok=True)
img_text.save(“outputs/text2img.png”)
grid_fast.save(“outputs/lora_fast_grid.png”)
layout.save(“outputs/layout.png”)
canny_image.save(“outputs/canny.png”)
img_controlnet.save(“outputs/controlnet.png”)
mask.save(“outputs/mask.png”)
img_inpaint.save(“outputs/inpaint.png”)

print(“Saved outputs:”, sorted(os.listdir(“outputs”)))
print(“Done.”)

We create a mask to isolate a specific region and apply inpainting to modify only that part of the image. We refine the selected area using a targeted prompt while keeping the rest intact. Finally, we save all intermediate and final outputs to disk for inspection and reuse.

In conclusion, we demonstrated how a single Diffusers pipeline can evolve into a flexible, production-ready image generation system. We explained how to move from pure text-to-image generation to fast sampling, structural control, and targeted image editing without changing frameworks or tooling. This tutorial highlights how we can combine schedulers, LoRA adapters, ControlNet, and inpainting to create controllable and efficient generative pipelines that are easy to extend for more advanced creative or applied use cases.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to High-Quality Image Generation, Control, and Editing Using HuggingFace Diffusers appeared first on MarkTechPost.

How to Design a Swiss Army Knife Research Agent with Tool-Using AI, We …

In this tutorial, we build a “Swiss Army Knife” research agent that goes far beyond simple chat interactions and actively solves multi-step research problems end-to-end. We combine a tool-using agent architecture with live web search, local PDF ingestion, vision-based chart analysis, and automated report generation to demonstrate how modern agents can reason, verify, and produce structured outputs. By wiring together small agents, OpenAI models, and practical data-extraction utilities, we show how a single agent can explore sources, cross-check claims, and synthesize findings into professional-grade Markdown and DOCX reports.

Copy CodeCopiedUse a different Browser%pip -q install -U smolagents openai trafilatura duckduckgo-search pypdf pymupdf python-docx pillow tqdm

import os, re, json, getpass
from typing import List, Dict, Any
import requests
import trafilatura
from duckduckgo_search import DDGS
from pypdf import PdfReader
import fitz
from docx import Document
from docx.shared import Pt
from datetime import datetime

from openai import OpenAI
from smolagents import CodeAgent, OpenAIModel, tool

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass.getpass(“Paste your OpenAI API key (hidden): “).strip()
print(“OPENAI_API_KEY set:”, “YES” if os.environ.get(“OPENAI_API_KEY”) else “NO”)

if not os.environ.get(“SERPER_API_KEY”):
serper = getpass.getpass(“Optional: Paste SERPER_API_KEY for Google results (press Enter to skip): “).strip()
if serper:
os.environ[“SERPER_API_KEY”] = serper
print(“SERPER_API_KEY set:”, “YES” if os.environ.get(“SERPER_API_KEY”) else “NO”)

client = OpenAI()

def _now():
return datetime.utcnow().strftime(“%Y-%m-%d %H:%M:%SZ”)

def _safe_filename(s: str) -> str:
s = re.sub(r”[^a-zA-Z0-9._-]+”, “_”, s).strip(“_”)
return s[:180] if s else “file”

We set up the full execution environment and securely load all required credentials without hardcoding secrets. We import all dependencies required for web search, document parsing, vision analysis, and agent orchestration. We also initialize shared utilities to standardize timestamps and file naming throughout the workflow.

Copy CodeCopiedUse a different Browsertry:
from google.colab import files
os.makedirs(“/content/pdfs”, exist_ok=True)
uploaded = files.upload()
for name, data in uploaded.items():
if name.lower().endswith(“.pdf”):
with open(f”/content/pdfs/{name}”, “wb”) as f:
f.write(data)
print(“PDFs in /content/pdfs:”, os.listdir(“/content/pdfs”))
except Exception as e:
print(“Upload skipped:”, str(e))

def web_search(query: str, k: int = 6) -> List[Dict[str, str]]:
serper_key = os.environ.get(“SERPER_API_KEY”, “”).strip()
if serper_key:
resp = requests.post(
“https://google.serper.dev/search”,
headers={“X-API-KEY”: serper_key, “Content-Type”: “application/json”},
json={“q”: query, “num”: k},
timeout=30,
)
resp.raise_for_status()
data = resp.json()
out = []
for item in (data.get(“organic”) or [])[:k]:
out.append({
“title”: item.get(“title”,””),
“url”: item.get(“link”,””),
“snippet”: item.get(“snippet”,””),
})
return out

out = []
with DDGS() as ddgs:
for r in ddgs.text(query, max_results=k):
out.append({
“title”: r.get(“title”,””),
“url”: r.get(“href”,””),
“snippet”: r.get(“body”,””),
})
return out

def fetch_url_text(url: str) -> Dict[str, Any]:
try:
downloaded = trafilatura.fetch_url(url, timeout=30)
if not downloaded:
return {“url”: url, “ok”: False, “error”: “fetch_failed”, “text”: “”}
text = trafilatura.extract(downloaded, include_comments=False, include_tables=True)
if not text:
return {“url”: url, “ok”: False, “error”: “extract_failed”, “text”: “”}
title_guess = next((ln.strip() for ln in text.splitlines() if ln.strip()), “”)[:120]
return {“url”: url, “ok”: True, “title_guess”: title_guess, “text”: text}
except Exception as e:
return {“url”: url, “ok”: False, “error”: str(e), “text”: “”}

We enable local PDF ingestion and establish a flexible web search pipeline that works with or without a paid search API. We show how we gracefully handle optional inputs while maintaining a reliable research flow. We also implement robust URL fetching and text extraction to prepare clean source material for downstream reasoning.

Copy CodeCopiedUse a different Browserdef read_pdf_text(pdf_path: str, max_pages: int = 30) -> Dict[str, Any]:
reader = PdfReader(pdf_path)
pages = min(len(reader.pages), max_pages)
chunks = []
for i in range(pages):
try:
chunks.append(reader.pages[i].extract_text() or “”)
except Exception:
chunks.append(“”)
return {“pdf_path”: pdf_path, “pages_read”: pages, “text”: “nn”.join(chunks).strip()}

def extract_pdf_images(pdf_path: str, out_dir: str = “/content/extracted_images”, max_pages: int = 10) -> List[str]:
os.makedirs(out_dir, exist_ok=True)
doc = fitz.open(pdf_path)
saved = []
pages = min(len(doc), max_pages)
base = _safe_filename(os.path.basename(pdf_path).rsplit(“.”, 1)[0])

for p in range(pages):
page = doc[p]
img_list = page.get_images(full=True)
for img_i, img in enumerate(img_list):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n – pix.alpha >= 4:
pix = fitz.Pixmap(fitz.csRGB, pix)
img_path = os.path.join(out_dir, f”{base}_p{p+1}_img{img_i+1}.png”)
pix.save(img_path)
saved.append(img_path)

doc.close()
return saved

def vision_analyze_image(image_path: str, question: str, model: str = “gpt-4.1-mini”) -> Dict[str, Any]:
with open(image_path, “rb”) as f:
img_bytes = f.read()

resp = client.responses.create(
model=model,
input=[{
“role”: “user”,
“content”: [
{“type”: “input_text”, “text”: f”Answer concisely and accurately.nnQuestion: {question}”},
{“type”: “input_image”, “image_data”: img_bytes},
],
}],
)
return {“image_path”: image_path, “answer”: resp.output_text}

We focus on deep document understanding by extracting structured text and visual artifacts from PDFs. We integrate a vision-capable model to interpret charts and figures instead of treating them as opaque images. We ensure that numerical trends and visual insights can be converted into explicit, text-based evidence.

Copy CodeCopiedUse a different Browserdef write_markdown(path: str, content: str) -> str:
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, “w”, encoding=”utf-8″) as f:
f.write(content)
return path

def write_docx_from_markdown(docx_path: str, md: str, title: str = “Research Report”) -> str:
os.makedirs(os.path.dirname(docx_path), exist_ok=True)
doc = Document()
t = doc.add_paragraph()
run = t.add_run(title)
run.bold = True
run.font.size = Pt(18)
meta = doc.add_paragraph()
meta.add_run(f”Generated: {_now()}”).italic = True
doc.add_paragraph(“”)
for line in md.splitlines():
line = line.rstrip()
if not line:
doc.add_paragraph(“”)
continue
if line.startswith(“# “):
doc.add_heading(line[2:].strip(), level=1)
elif line.startswith(“## “):
doc.add_heading(line[3:].strip(), level=2)
elif line.startswith(“### “):
doc.add_heading(line[4:].strip(), level=3)
elif re.match(r”^s*[-*]s+”, line):
p = doc.add_paragraph(style=”List Bullet”)
p.add_run(re.sub(r”^s*[-*]s+”, “”, line).strip())
else:
doc.add_paragraph(line)
doc.save(docx_path)
return docx_path

@tool
def t_web_search(query: str, k: int = 6) -> str:
return json.dumps(web_search(query, k), ensure_ascii=False)

@tool
def t_fetch_url_text(url: str) -> str:
return json.dumps(fetch_url_text(url), ensure_ascii=False)

@tool
def t_list_pdfs() -> str:
pdf_dir = “/content/pdfs”
if not os.path.isdir(pdf_dir):
return json.dumps([])
paths = [os.path.join(pdf_dir, f) for f in os.listdir(pdf_dir) if f.lower().endswith(“.pdf”)]
return json.dumps(sorted(paths), ensure_ascii=False)

@tool
def t_read_pdf_text(pdf_path: str, max_pages: int = 30) -> str:
return json.dumps(read_pdf_text(pdf_path, max_pages=max_pages), ensure_ascii=False)

@tool
def t_extract_pdf_images(pdf_path: str, max_pages: int = 10) -> str:
imgs = extract_pdf_images(pdf_path, max_pages=max_pages)
return json.dumps(imgs, ensure_ascii=False)

@tool
def t_vision_analyze_image(image_path: str, question: str) -> str:
return json.dumps(vision_analyze_image(image_path, question), ensure_ascii=False)

@tool
def t_write_markdown(path: str, content: str) -> str:
return write_markdown(path, content)

@tool
def t_write_docx_from_markdown(docx_path: str, md_path: str, title: str = “Research Report”) -> str:
with open(md_path, “r”, encoding=”utf-8″) as f:
md = f.read()
return write_docx_from_markdown(docx_path, md, title=title)

We implement the full output layer by generating Markdown reports and converting them into polished DOCX documents. We expose all core capabilities as explicit tools that the agent can reason about and invoke step by step. We ensure that every transformation from raw data to final report remains deterministic and inspectable.

Copy CodeCopiedUse a different Browsermodel = OpenAIModel(model_id=”gpt-5″)

agent = CodeAgent(
tools=[
t_web_search,
t_fetch_url_text,
t_list_pdfs,
t_read_pdf_text,
t_extract_pdf_images,
t_vision_analyze_image,
t_write_markdown,
t_write_docx_from_markdown,
],
model=model,
add_base_tools=False,
additional_authorized_imports=[“json”,”re”,”os”,”math”,”datetime”,”time”,”textwrap”],
)

SYSTEM_INSTRUCTIONS = “””
You are a Swiss Army Knife Research Agent.
“””

def run_research(topic: str):
os.makedirs(“/content/report”, exist_ok=True)
prompt = f”””{SYSTEM_INSTRUCTIONS.strip()}

Research question:
{topic}

Steps:
1) List available PDFs (if any) and decide which are relevant.
2) Do web search for the topic.
3) Fetch and extract the text of the best sources.
4) If PDFs exist, extract text and images.
5) Visually analyze figures.
6) Write a Markdown report and convert to DOCX.
“””
return agent.run(prompt)

topic = “Build a research brief on the most reliable design patterns for tool-using agents (2024-2026), focusing on evaluation, citations, and failure modes.”
out = run_research(topic)
print(out[:1500] if isinstance(out, str) else out)

try:
from google.colab import files
files.download(“/content/report/report.md”)
files.download(“/content/report/report.docx”)
except Exception as e:
print(“Download skipped:”, str(e))

We assemble the complete research agent and define a structured execution plan for multi-step reasoning. We guide the agent to search, analyze, synthesize, and write using a single coherent prompt. We demonstrate how the agent produces a finished research artifact that can be reviewed, shared, and reused immediately.

In conclusion, we demonstrated how a well-designed tool-using agent can function as a reliable research assistant rather than a conversational toy. We showcased how explicit tools, disciplined prompting, and step-by-step execution allow the agent to search the web, analyze documents and visuals, and generate traceable, citation-aware reports. This approach offers a practical blueprint for building trustworthy research agents that emphasize evaluation, evidence, and failure awareness, capabilities increasingly essential for real-world AI systems.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Swiss Army Knife Research Agent with Tool-Using AI, Web Search, PDF Analysis, Vision, and Automated Reporting appeared first on MarkTechPost.

NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on …

Building simulators for robots has been a long term challenge. Traditional engines require manual coding of physics and perfect 3D models. NVIDIA is changing this with DreamDojo, a fully open-source, generalizable robot world model. Instead of using a physics engine, DreamDojo ‘dreams’ the results of robot actions directly in pixels.

https://arxiv.org/pdf/2602.06949

Scaling Robotics with 44k+ Hours of Human Experience

The biggest hurdle for AI in robotics is data. Collecting robot-specific data is expensive and slow. DreamDojo solves this by learning from 44k+ hours of egocentric human videos. This dataset, called DreamDojo-HV, is the largest of its kind for world model pretraining.

It features 6,015 unique tasks across 1M+ trajectories.

The data covers 9,869 unique scenes and 43,237 unique objects.

Pretraining used 100,000 NVIDIA H100 GPU hours to build 2B and 14B model variants.

Humans have already mastered complex physics, such as pouring liquids or folding clothes. DreamDojo uses this human data to give robots a ‘common sense’ understanding of how the world works.

https://arxiv.org/pdf/2602.06949

Bridging the Gap with Latent Actions

Human videos do not have robot motor commands. To make these videos ‘robot-readable,’ NVIDIA’s research team introduced continuous latent actions. This system uses a spatiotemporal Transformer VAE to extract actions directly from pixels.

The VAE encoder takes 2 consecutive frames and outputs a 32-dimensional latent vector.

This vector represents the most critical motion between frames.

The design creates an information bottleneck that disentangles action from visual context.

This allows the model to learn physics from humans and apply them to different robot bodies.

https://arxiv.org/pdf/2602.06949

Better Physics through Architecture

DreamDojo is based on the Cosmos-Predict2.5 latent video diffusion model. It uses the WAN2.2 tokenizer, which has a temporal compression ratio of 4. The team improved the architecture with 3 key features:

Relative Actions: The model uses joint deltas instead of absolute poses. This makes it easier for the model to generalize across different trajectories.

Chunked Action Injection: It injects 4 consecutive actions into each latent frame. This aligns the actions with the tokenizer’s compression ratio and fixes causality confusion.

Temporal Consistency Loss: A new loss function matches predicted frame velocities to ground-truth transitions. This reduces visual artifacts and keeps objects physically consistent.

Distillation for 10.81 FPS Real-Time Interaction

A simulator is only useful if it is fast. Standard diffusion models require too many denoising steps for real-time use. NVIDIA team used a Self Forcing distillation pipeline to solve this.

The distillation training was conducted on 64 NVIDIA H100 GPUs.

The ‘student’ model reduces denoising from 35 steps down to 4 steps.

The final model achieves a real-time speed of 10.81 FPS.

It is stable for continuous rollouts of 60 seconds (600 frames).

Unlocking Downstream Applications

DreamDojo’s speed and accuracy enable several advanced applications for AI engineers.

1. Reliable Policy Evaluation

Testing robots in the real world is risky. DreamDojo acts as a high-fidelity simulator for benchmarking.

Its simulated success rates show a Pearson correlation of (Pearson 𝑟=0.995) with real-world results.

The Mean Maximum Rank Violation (MMRV) is only 0.003.

2. Model-Based Planning

Robots can use DreamDojo to ‘look ahead.’ A robot can simulate multiple action sequences and pick the best one.

In a fruit-packing task, this improved real-world success rates by 17%.

Compared to random sampling, it provided a 2x increase in success.

3. Live Teleoperation

Developers can teleoperate virtual robots in real time. NVIDIA team demonstrated this using a PICO VR controller and a local desktop with an NVIDIA RTX 5090. This allows for safe and rapid data collection.

Summary of Model Performance

MetricDREAMDOJO-2BDREAMDOJO-14BPhysics Correctness62.50%73.50%Action Following63.45%72.55%FPS (Distilled)10.81N/A

NVIDIA has released all weights, training code, and evaluation benchmarks. This open-source release allows you to post-train DreamDojo on your own robot data today.

Key Takeaways

Massive Scale and Diversity: DreamDojo is pretrained on DreamDojo-HV, the largest egocentric human video dataset to date, featuring 44,711 hours of footage across 6,015 unique tasks and 9,869 scenes.

Unified Latent Action Proxy: To overcome the lack of action labels in human videos, the model uses continuous latent actions extracted via a spatiotemporal Transformer VAE, which serves as a hardware-agnostic control interface.

Optimized Training and Architecture: The model achieves high-fidelity physics and precise controllability by utilizing relative action transformations, chunked action injection, and a specialized temporal consistency loss.

Real-Time Performance via Distillation: Through a Self Forcing distillation pipeline, the model is accelerated to 10.81 FPS, enabling interactive applications like live teleoperation and stable, long-horizon simulations for over 1 minute.

Reliable for Downstream Tasks: DreamDojo functions as an accurate simulator for policy evaluation, showing a 0.995 Pearson correlation with real-world success rates, and can improve real-world performance by 17% when used for model-based planning.

Check out the Paper and Codes. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data appeared first on MarkTechPost.

Amazon SageMaker AI in 2025, a year in review part 1: Flexible Trainin …

In 2025, Amazon SageMaker AI saw dramatic improvements to core infrastructure offerings along four dimensions: capacity, price performance, observability, and usability. In this series of posts, we discuss these various improvements and their benefits. In Part 1, we discuss capacity improvements with the launch of Flexible Training Plans. We also describe improvements to price performance for inference workloads. In Part 2, we discuss enhancements made to observability, model customization, and model hosting.
Flexible Training Plans for SageMaker
SageMaker AI Training Plans now support inference endpoints, extending a powerful capacity reservation capability originally designed for training workloads to address the critical challenge of GPU availability for inference deployments. Deploying large language models (LLMs) for inference requires reliable GPU capacity, especially during critical evaluation periods, limited-duration production testing, or predictable burst workloads. Capacity constraints can delay deployments and impact application performance, particularly during peak hours when on-demand capacity becomes unpredictable. Training Plans can help solve this problem by making it possible to reserve compute capacity for specified time periods, facilitating predictable GPU availability precisely when teams need it most.
The reservation workflow is designed for simplicity and flexibility. You begin by searching for available capacity offerings that match your specific requirements—selecting instance type, quantity, duration, and desired time window. When you identify a suitable offering, you can create a reservation that generates an Amazon Resource Name (ARN), which serves as the key to your guaranteed capacity. The upfront, transparent pricing model helps support accurate budget planning while minimizing concerns about infrastructure availability, so teams can focus on their evaluation metrics and model performance rather than worrying about whether capacity will be available when they need it.
Throughout the reservation lifecycle, teams maintain operational flexibility to manage their endpoints as requirements evolve. You can update endpoints to new model versions while maintaining the same reserved capacity, using iterative testing and refinement during evaluation periods. Scaling capabilities help teams adjust instance counts within their reservation limits, supporting scenarios where initial deployments are conservative, but higher throughput testing becomes necessary. This flexibility helps make sure teams aren’t locked into rigid infrastructure decisions while still being able to benefit from the reserved capacity during critical time windows.
With support for endpoint updates, scaling capabilities, and seamless capacity management, Training Plans help give you control over both GPU availability and costs for time-bound inference workloads. Whether you’re running competitive model benchmarks to select the best-performing variant, performing limited-duration A/B tests to validate model improvements, or handling predictable traffic spikes during product launches, Training Plans for inference endpoints help provide the capacity guarantees teams need with transparent, upfront pricing. This approach is particularly valuable for data science teams conducting week-long or month-long evaluation projects, where the ability to reserve specific GPU instances in advance minimizes the uncertainty of on-demand availability and enables more predictable project timelines and budgets.
For more information, see Amazon SageMaker AI now supports Flexible Training Plans capacity for Inference.
Price performance
Enhancements made to SageMaker AI in 2025 help optimize inference economics through four key capabilities. Flexible Training Plans extend to inference endpoints with transparent upfront pricing. Inference components add Multi-AZ availability and parallel model copy placement during scaling that help accelerate deployment. EAGLE-3 speculative decoding delivers increased throughput improvements on inference requests. Dynamic multi-adapter inference enables on-demand loading of LoRA adapters.
Improvements to inference components
Generative models only start delivering value when they’re serving predictions in production. As applications scale, inference infrastructure must be as dynamic and reliable as the models themselves. That’s where SageMaker AI inference components come in. Inference components provide a modular way to manage model inference within an endpoint. Each inference component represents a self-contained unit of compute, memory, and model configuration that can be independently created, updated, and scaled. This design helps you operate production endpoints with greater flexibility. You can deploy multiple models, adjust capacity quickly, and roll out updates safely without redeploying the entire endpoint. For teams running real-time or high-throughput applications, inference components help bring fine-grained control to inference workflows. In the following sections, we review three major enhancements to SageMaker AI inference components that make them even more powerful in production environments. These updates add Multi-AZ high availability, controlled concurrency for multi-tenant workloads, and parallel scaling for faster response to traffic surges. Together, they help make running AI at scale more resilient, predictable, and efficient.
Building resilience with Multi-AZ high availability
Production systems face the same truth: failures happen. A single hardware fault, network issue, or Availability Zone outage can disrupt inference traffic and affect user experience. Now, SageMaker AI inference components automatically distribute workloads across multiple Availability Zones. You can run multiple inference component copies per Availability Zone, and SageMaker AI helps intelligently route traffic to instances that are healthy and have available capacity. This distribution adds fault tolerance at every layer of your deployment.
Multi-AZ high availability offers the following benefits:

Minimizes single points of failure by spreading inference workloads across Availability Zones
Automatically fails over to healthy instances when issues occur
Keeps uptime high to meet strict SLA requirements
Enables balanced cost and resilience through flexible deployment patterns

For example, a financial services company running real-time fraud detection can benefit from this feature. By deploying inference components across three Availability Zones, traffic can seamlessly redirect to the remaining Availability Zones if one goes offline, helping facilitate uninterrupted fraud detection when reliability matters most.
Parallel scaling and NVMe caching
Traffic patterns in production are rarely steady. One moment your system is quiet; the next, it’s flooded with requests. Previously, scaling inference components happened sequentially—each new model copy waited for the previous one to initialize before starting. During spikes, this sequential process could add several minutes of latency. With parallel scaling, SageMaker AI can now deploy multiple inference component copies simultaneously when an instance and the required resources are available. This helps shorten the time required to respond to traffic surges and improves responsiveness for variable workloads. For example, if an instance needs three model copies, they now deploy in parallel instead of waiting on one another. Parallel scaling helps accelerate the deployment of model copies onto inference components but does not accelerate the scaling up of models when traffic increases beyond provisioned capacity. NVMe caching helps accelerate model scaling for already provisioned inference components by caching model artifacts and images. NVMe caching’s ability to reduce scaling times helps reduce inference latency during traffic spikes, lower idle costs through faster scale-down, and provide greater elasticity for serving unpredictable or volatile workloads.
EAGLE-3
SageMaker AI has introduced (Extrapolation Algorithm for Greater Language-model Efficiency (EAGLE)-based adaptive speculative decoding to help accelerate generative AI inference. This enhancement supports six model architectures and helps you optimize performance using either SageMaker-provided datasets or your own application-specific data for highly adaptive, workload-specific results. The solution streamlines the workflow from optimization job creation through deployment, making it seamless to deliver low-latency generative AI applications at scale without compromising generation quality. EAGLE works by predicting future tokens directly from the model’s hidden layers rather than relying on an external draft model, resulting in more accurate predictions and fewer rejections. SageMaker AI automatically selects between EAGLE-2 and EAGLE-3 based on the model architecture, with launch support for LlamaForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM, Qwen2ForCausalLM, GptOssForCausalLM (EAGLE-3), and Qwen3NextForCausalLM (EAGLE-2). You can train EAGLE models from scratch, retrain existing models, or use pre-trained models from SageMaker JumpStart, with the flexibility to iteratively refine performance using your own curated datasets collected through features like Data Capture. The optimization workflow integrates seamlessly with existing SageMaker AI infrastructure through familiar APIs (create_model, create_endpoint_config, create_endpoint) and supports widely used training data formats, including ShareGPT and OpenAI chat and completions. Benchmark results are automatically generated during optimization jobs, providing clear visibility into performance improvements across metrics like Time to First Token (TTFT) and throughput, with trained EAGLE models showing significant gains over both base models and EAGLE models trained only on built-in datasets.
To run an EAGLE-3 optimization job, run the following command in the AWS Command Line Interface (AWS CLI):

aws sagemaker –region us-west-2 create-optimization-job
–optimization-job-name <job-name>
–account-id <account-id>
–deployment-instance-type ml.p5.48xlarge
–max-instance-count 10
–model-source ‘{
“SageMakerModel”: { “ModelName”: “Created Model name” }
}’
–optimization-configs'{
“ModelSpeculativeDecodingConfig”: {
“Technique”: “EAGLE”,
“TrainingDataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “Enter custom train data location”
}
}
}’
–output-config ‘{
“S3OutputLocation”: “Enter optimization output location”
}’
–stopping-condition ‘{“MaxRuntimeInSeconds”: 432000}’
–role-arn “Enter Execution Role ARN”

For more details, see Amazon SageMaker AI introduces EAGLE based adaptive speculative decoding to accelerate generative AI inference.
Dynamic multi-adapter inference on SageMaker AI Inference
SageMaker AI helped enhance the efficient multi-adapter inference capability introduced at re:Invent 2024, which now supports dynamic loading and unloading of LoRA adapters during inference invocations rather than pinning them at endpoint creation. This enhancement helps optimize resource utilization for on-demand model hosting scenarios.
Previously, the adapters were downloaded to disk and loaded into memory during the CreateInferenceComponent API call. With dynamic loading, adapters are registered using a lightweight, synchronous CreateInferenceComponent API, then downloaded and loaded into memory only when first invoked. This approach supports use cases where you can register thousands of fine-tuned adapters per endpoint while maintaining low-latency inference.
The system implements intelligent memory management, evicting least popular models during resource constraints. When memory reaches capacity—controlled by the SAGEMAKER_MAX_NUMBER_OF_ADAPTERS_IN_MEMORY environment variable—the system automatically unloads inactive adapters to make room for newly requested ones. Similarly, when disk space becomes constrained, the least recently used adapters are evicted from storage. This multi-tier caching strategy facilitates optimal resource utilization across CPU, GPU memory, and disk.
For security and compliance alignment, you can explicitly delete adapters using the DeleteInferenceComponent API. Upon deletion, SageMaker unloads the adapter from the base inference component containers and removes it from disk across the instances, facilitating the complete cleanup of customer data. The deletion process completes asynchronously with automatic retries, providing you with control over your adapter lifecycle while helping meet stringent data retention requirements.
This dynamic adapter loading capability powers the SageMaker AI serverless model customization feature, which helps you fine-tune popular AI models like Amazon Nova, DeepSeek, Llama, and Qwen using techniques like supervised fine-tuning, reinforcement learning, and direct preference optimization. When you complete fine-tuning through the serverless customization interface, the output LoRA adapter weights flow seamlessly to deployment—you can deploy to SageMaker AI endpoints using multi-adapter inference components. The hosting configurations from training recipes automatically include the appropriate dynamic loading settings, helping make sure customized models can be deployed efficiently without requiring you to manage infrastructure or load the adapters at endpoint creation time.
The following steps illustrate how you can use this feature in practice:

Create a base inference component with your foundation model:

import boto3

sagemaker = boto3.client(‘sagemaker’)

# Create base inference component with foundation model
response = sagemaker.create_inference_component(
InferenceComponentName=’llama-base-ic’,
EndpointName=’my-endpoint’,
Specification={
‘Container’: {
‘Image’: ‘your-container-image’,
‘Environment’: {
‘SAGEMAKER_MAX_NUMBER_OF_ADAPTERS_IN_MEMORY’: ’10’
}
},
‘ComputeResourceRequirements’: {
‘NumberOfAcceleratorDevicesRequired’: 2,
‘MinMemoryRequiredInMb’: 16384
}
}
)

Register Your LoRA adapters:

# Register adapter – completes in < 1 second
response = sagemaker.create_inference_component(
InferenceComponentName=’my-custom-adapter’,
EndpointName=’my-endpoint’,
Specification={
‘BaseInferenceComponentName’: ‘llama-base-ic’,
‘Container’: {
‘ArtifactUrl’: ‘s3://amzn-s3-demo-bucket/adapters/customer-support/’
}
}
)

Invoke your adapter (it loads automatically on first use):

runtime = boto3.client(‘sagemaker-runtime’)

# Invoke with adapter – loads into memory on first call
response = runtime.invoke_endpoint(
EndpointName=’my-endpoint’,
InferenceComponentName=’llama-base-ic’,
TargetModel=’s3://amzn-s3-demo-bucket/adapters/customer-support/’,
ContentType=’application/json’,
Body=json.dumps({‘inputs’: ‘Your prompt here’})
)

Delete adapters when no longer needed:

sagemaker.delete_inference_component(
InferenceComponentName=’my-custom-adapter’
)

This dynamic loading capability integrates seamlessly with the existing inference infrastructure of SageMaker, supporting the same base models and maintaining compatibility with the standard InvokeEndpoint API. By decoupling adapter registration from resource allocation, you can now deploy and manage more LoRA adapters cost-effectively, paying only for the compute resources actively serving inference requests.
Conclusion
The 2025 SageMaker AI enhancements represent a significant leap forward in making generative AI inference more accessible, reliable, and cost-effective for production workloads. With Flexible Training Plans now supporting inference endpoints, you can gain predictable GPU capacity precisely when you need it—whether for critical model evaluations, limited-duration testing, or handling traffic spikes. The introduction of Multi-AZ high availability, controlled concurrency, and parallel scaling with NVMe caching for inference components helps make sure production deployments can scale rapidly while maintaining resilience across Availability Zones. The adaptive speculative decoding of EAGLE-3 delivers increased throughput without sacrificing output quality, and dynamic multi-adapter inference helps teams efficiently manage more fine-tuned LoRA adapters on a single endpoint. Together, these capabilities help reduce the operational complexity and infrastructure costs of running AI at scale, so teams can focus on delivering value through their models rather than managing underlying infrastructure.
These improvements directly address some of the most pressing challenges facing AI practitioners today: securing reliable compute capacity, achieving low-latency inference at scale, and managing the growing complexity of multi-model deployments. By combining transparent capacity reservations, intelligent resource management, and performance optimizations that help deliver measurable throughput gains, SageMaker AI helps organizations deploy generative AI applications with confidence. The seamless integration between model customization and deployment—where fine-tuned adapters flow directly from training to production hosting—further helps accelerate the journey from experimentation to production.
Ready to accelerate your generative AI inference workloads? Explore Flexible Training Plans for inference endpoints to secure GPU capacity for your next evaluation cycle, implement EAGLE-3 speculative decoding to help boost throughput on your existing deployments, or use dynamic multi-adapter inference to more efficiently serve customized models. Refer to the Amazon SageMaker AI Documentation to get started, and stay tuned for Part 2 of this series, where we will dive into observability and model customization improvements. Share your experiences and questions in the comments—we’d love to hear how these capabilities are transforming your AI workloads.

About the authors
Dan Ferguson is a Sr. Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.
Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Sadaf Fardeen leads Inference Optimization charter for SageMaker. She owns optimization and development of LLM inference containers on SageMaker.
Suma Kasa is an ML Architect with the SageMaker Service team focusing on the optimization and development of LLM inference containers on SageMaker.
Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.
Deepti Ragha is a Senior Software Development Engineer on the Amazon SageMaker AI team, specializing in ML inference infrastructure and model hosting optimization. She builds features that improve deployment performance, reduce inference costs, and make ML accessible to organizations of all sizes. Outside of work, she enjoys traveling, hiking, and gardening.

Amazon SageMaker AI in 2025, a year in review part 2: Improved observa …

In 2025, Amazon SageMaker AI made several improvements designed to help you train, tune, and host generative AI workloads. In Part 1 of this series, we discussed Flexible Training Plans and price performance improvements made to inference components.
In this post, we discuss enhancements made to observability, model customization, and model hosting. These improvements facilitate a whole new class of customer use cases to be hosted on SageMaker AI.
Observability
The observability enhancements made to SageMaker AI in 2025 help deliver enhanced visibility into model performance and infrastructure health. Enhanced metrics provide granular, instance-level and container-level tracking of CPU, memory, GPU utilization, and invocation performance with configurable publishing frequencies, so teams can diagnose latency issues and resource inefficiencies that were previously hidden by endpoint-level aggregation. Rolling updates for inference components help transform deployment safety by alleviating the need for duplicate infrastructure provisioning—updates deploy in configurable batches with integrated Amazon CloudWatch alarm monitoring that triggers automatic rollbacks if issues are detected, facilitating zero-downtime deployments while minimizing risk through gradual validation.
Enhanced Metrics
SageMaker AI introduced enhanced metrics this year, helping deliver granular visibility into endpoint performance and resource utilization at both instance and container levels. This capability addresses a critical gap in observability, facilitating customers’ diagnosis of latency issues, invocation failures, and resource inefficiencies that were previously obscured by endpoint-level aggregation. Enhanced metrics provide instance-level tracking of CPU, memory, and GPU utilization alongside invocation performance metrics (latency, errors, throughput) with InstanceId dimensions for the SageMaker endpoints. For inference components, container-level metrics offer visibility into individual model replica resource consumption with both ContainerId and InstanceId dimensions.
You can configure metric publishing frequency, supplying near real-time monitoring for critical applications requiring rapid response. The self-service enablement through a simple MetricsConfig parameter in the CreateEndpointConfig API helps reduce time-to-insight, helping you self-diagnose performance issues. Enhanced metrics help you identify which specific instance or container requires attention, diagnose uneven traffic distribution across hosts, optimize resource allocation, and correlate performance issues with specific infrastructure resources. The feature works seamlessly with CloudWatch alarms and automatic scaling policies, providing proactive monitoring and automated responses to performance anomalies.
To enable enhanced metrics, add the MetricsConfig parameter when creating your endpoint configuration:

response = sagemaker_client.create_endpoint_config(
EndpointConfigName=’my-config’,
ProductionVariants=[{…}],
MetricsConfig={
‘EnableEnhancedMetrics’: True,
‘MetricPublishFrequencyInSeconds’: 60 # Supported: 10, 30, 60, 120, 180, 240, 300
}
)

Enhanced metrics are available across the AWS Regions for both single model endpoints and inference components, providing comprehensive observability for production AI deployments at scale.
Guardrail deployment with rolling updates
SageMaker AI introduced rolling updates for inference components, helping transform how you can deploy model updates with enhanced safety and efficiency. Traditional blue/green deployments require provisioning duplicate infrastructure, creating resource constraints—particularly for GPU-heavy workloads like large language models. Rolling updates deploy new model versions in configurable batches while dynamically scaling infrastructure, with integrated CloudWatch alarms monitoring metrics to trigger automatic rollbacks if issues are detected. This approach helps alleviate the need to provision duplicate fleets, reduces deployment overhead, and enables zero-downtime updates through gradual validation that minimizes risk while maintaining availability. For more details, see Enhance deployment guardrails with inference component rolling updates for Amazon SageMaker AI inference.
Usability
SageMaker AI usability improvements focus on removing complexity and accelerating time-to-value for AI teams. Serverless model customization reduces time for infrastructure planning by automatically provisioning compute resources based on model and data size, supporting advanced techniques like reinforcement learning from verifiable rewards (RLVR) and reinforcement learning from AI feedback (RLAIF) through both UI-based and code-based workflows with integrated MLflow experiment tracking. Bidirectional streaming enables real-time, multi-modal applications by maintaining persistent connections where data flows simultaneously in both directions—helping transform use cases like voice agents and live transcription from transactional exchanges into continuous conversations. Enhanced connectivity through comprehensive AWS PrivateLink support across the Regions and IPv6 compatibility helps make sure enterprise deployments can meet strict compliance alignment requirements while future-proofing network architectures.
Serverless model customization
The new SageMaker AI serverless customization capability addresses a critical challenge faced by organizations: the lengthy and complex process of fine-tuning AI models, which traditionally takes months and requires significant infrastructure management expertise. Many teams struggle with selecting appropriate compute resources, managing the technical complexity of advanced fine-tuning techniques like reinforcement learning, and navigating the end-to-end workflow from model selection through evaluation to deployment.

This serverless solution helps remove these barriers by automatically provisioning the right compute resources based on model and data size, making it possible for teams to focus on model tuning rather than infrastructure management and helping accelerate the customization process. The solution supports popular models including Amazon Nova, DeepSeek, GPT-OSS, Llama, and Qwen, providing both UI-based and code-based customization workflows that make advanced techniques accessible to teams with varying levels of technical expertise.
The solution offers multiple advanced customization techniques, including supervised fine-tuning, direct preference optimization, RLVR, and RLAIF. Each technique helps optimize models in different ways, with selection influenced by factors such as dataset size and quality, available computational resources, task requirements, desired accuracy levels, and deployment constraints. The solution includes integrated experiment tracking through serverless MLflow for automatic logging of critical metrics without code modifications, helping teams monitor and compare model performance throughout the customization process.

Deployment flexibility is a key feature, with options to deploy to either Amazon Bedrock for serverless inference or SageMaker AI endpoints for controlled resource management. The solution includes built-in model evaluation capabilities to compare customized models against base models, an interactive playground for testing with prompts or chat mode, and seamless integration with the broader Amazon SageMaker Studio environment. This end-to-end workflow—from model selection and customization through evaluation and deployment—is handled entirely within a unified interface.
Currently available in US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) Regions, the service operates on a pay-per-token model for both training and inference. This pricing approach helps make it cost-effective for organizations of different sizes to customize AI models without upfront infrastructure investments, and the serverless architecture helps make sure teams can scale their model customization efforts based on actual usage rather than provisioned capacity. For more information on this core capability, see New serverless customization in Amazon SageMaker AI accelerates model fine-tuning.
Bidirectional streaming
SageMaker AI introduced the bidirectional streaming capability in 2025, transforming inference from transactional exchanges into continuous conversations between users and models. This feature enables data to flow simultaneously in both directions over a single persistent connection, supporting real-time multi-modal use cases ranging from audio transcription and translation to voice agents. Unlike traditional approaches where clients send complete questions and wait for complete answers, bidirectional streaming allows speech and responses to flow concurrently—users can see results as soon as models begin generating them, and models can maintain context across continuous streams without re-sending conversation history. The implementation combines HTTP/2 and WebSocket protocols, with the SageMaker infrastructure managing efficient multiplexed connections from clients through routers to model containers.
The feature supports both bring-your-own-container implementations and partner integrations, with Deepgram serving as a launch partner offering their Nova-3 speech-to-text model through AWS Marketplace. This capability addresses critical enterprise requirements for real-time voice AI applications—particularly for organizations with strict compliance needs requiring audio processing to remain within their Amazon virtual private cloud (VPC)—while removing the operational overhead traditionally associated with self-hosted real-time AI solutions. The persistent connection approach reduces infrastructure overhead from TLS handshakes and connection management, replacing short-lived connections with efficient long-running sessions.
Developers can implement bidirectional streaming through two approaches: building custom containers that implement WebSocket protocol at ws://localhost:8080/invocations-bidirectional-stream with the appropriate Docker label (com.amazonaws.sagemaker.capabilities.bidirectional-streaming=true), or deploying pre-built partner solutions like Deepgram’s Nova-3 model directly from AWS Marketplace. The feature requires containers to handle incoming WebSocket data frames and send response frames back to SageMaker, with sample implementations available in both Python and TypeScript. For more details, see Introducing bidirectional streaming for real-time inference on Amazon SageMaker AI.
IPv6 and PrivateLink
Additionally, SageMaker AI expanded its connectivity capabilities in 2025 with comprehensive PrivateLink support across Regions and IPv6 compatibility for both public and private endpoints. These enhancements significantly help improve the service’s accessibility and security posture for enterprise deployments. PrivateLink integration makes it possible to access SageMaker AI endpoints privately from your VPCs without traversing the public internet, keeping the traffic within the AWS network infrastructure. This is particularly valuable for organizations with strict compliance requirements or data residency policies that mandate private connectivity for machine learning workloads.
The addition of IPv6 support for SageMaker AI endpoints addresses the growing need for modern IP addressing as organizations transition away from IPv4. You can now access SageMaker AI services using IPv6 addresses for both public endpoints and private VPC endpoints, providing flexibility in network architecture design and future-proofing infrastructure investments. The dual-stack capability (supporting both IPv4 and IPv6) facilitates backward compatibility while helping organizations adopt IPv6 at their own pace. Combined with PrivateLink, these connectivity enhancements help make SageMaker AI more accessible and secure for diverse enterprise networking environments, from traditional on-premises data centers connecting using AWS Direct Connect to modern cloud-based architectures built entirely on IPv6.
Conclusion
The 2025 enhancements to SageMaker AI represent a significant leap forward in making generative AI workloads more observable, reliable, and accessible for enterprise customers. From granular performance metrics that pinpoint infrastructure bottlenecks to serverless customization, these improvements address the real-world challenges teams face when deploying AI at scale. The combination of enhanced observability, safer deployment mechanisms, and streamlined workflows helps empower organizations to move faster while maintaining the reliability and security standards required for production systems.
These capabilities are available now across Regions, with features like enhanced metrics, rolling updates, and serverless customization ready to help transform how you can build and deploy AI applications. Whether you’re fine-tuning models for domain-specific tasks, building real-time voice agents with bidirectional streaming, or facilitating deployment safety with rolling updates and integrated monitoring, SageMaker AI helps provide the tools to accelerate your AI journey while reducing operational complexity.
Get started today by exploring the enhanced metrics documentation, trying serverless model customization, or implementing bidirectional streaming for your real-time inference workloads. For comprehensive guidance on implementing these features, refer to the Amazon SageMaker AI Documentation or reach out to your AWS account team to discuss how these capabilities can support your specific use cases.

About the authors
Dan Ferguson is a Sr. Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.
Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Sadaf Fardeen leads Inference Optimization charter for SageMaker. She owns optimization and development of LLM inference containers on SageMaker.
Suma Kasa is an ML Architect with the SageMaker Service team focusing on the optimization and development of LLM inference containers on SageMaker.
Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.
Deepti Ragha is a Senior Software Development Engineer on the Amazon SageMaker AI team, specializing in ML inference infrastructure and model hosting optimization. She builds features that improve deployment performance, reduce inference costs, and make ML accessible to organizations of all sizes. Outside of work, she enjoys traveling, hiking, and gardening.