Swann provides Generative AI to millions of IoT Devices using Amazon B …

If you’re managing Internet of Things (IoT) devices at scale, alert fatigue is probably undermining your system’s effectiveness. This post shows you how to implement intelligent notification filtering using Amazon Bedrock and its gen-AI capabilities. You’ll learn model selection strategies, cost optimization techniques, and architectural patterns for deploying gen-AI at IoT scale, based on Swann Communications deployment across millions of devices.
Smart home security customers now expect systems that can tell the difference between a delivery person and a potential intruder—not just detect motion. Customers were being overwhelmed with lot of daily notifications or false positives, with a lot of alerts being triggered by events that were irrelevant to the customers, such as passing cars, pets moving around, and so on. Users became frustrated with constant false alerts and started ignoring notifications entirely, including real security threats.
As a pioneer in do-it-yourself (DIY) security solutions, Swann Communications has built a global network of more than 11.74 million connected devices, serving homeowners and businesses across multiple continents. Swann partnered with Amazon Web Services (AWS) to develop a multi-model generative AI notification system to evolve their notification system from a basic, reactive alert mechanism into an intelligent, context-aware security assistant.
Business challenges driving the solution
Before implementing the new solution, Swann faced several critical challenges that required a fundamentally different approach to security notifications.
Swann’s previous system had basic detection that could only identify human or pet events without contextual awareness—treating a delivery person the same as a potential intruder—while offering no customization options for users to define what constituted a meaningful alert for their unique security needs. These technical constraints, compounded by scalability challenges in managing notifications cost-efficiently across tens of millions of devices, made it clear that incremental improvements wouldn’t suffice—Swann needed a fundamentally smarter approach.
Approximately 20 daily notifications per camera—most of them irrelevant—caused customers to miss critical security events, with many users disabling notifications within the first few months. This significantly reduced system effectiveness, demonstrating the need for intelligent filtering that delivered only meaningful alerts. Rather than managing multiple vendors and custom integrations, Swann used different AWS cloud services that work together. By using AWS integrated services, Swann’s engineering team could concentrate on creating new security features.
Why AWS and Amazon Bedrock were selected
When evaluating AI partners, Swann prioritized enterprise-grade capabilities that could reliably scale. AWS stood out for several key reasons:
Enterprise-grade AI capabilities
Swann chose AWS for its comprehensive, integrated approach to deploying generative AI at scale. Amazon Bedrock, a fully managed service, provided access to multiple foundation models through a single API, handling GPU provisioning, model deployment, and scaling automatically, so that Swann could test and compare different model families (such as Claude and Nova) without infrastructure changes while optimizing for either speed or accuracy based on each scenario, such as high-volume routine screening, threat verification requiring detailed analysis, time-sensitive alerts, and complex behavioral assessment. With approximately 275 million monthly inferences, the AWS pay-per-use pricing model, and the ability to use cost-effective models such as Nova Lite for routine analysis resulted in cost optimization. AWS services delivered low-latency inference across North America, Europe, and Asia-Pacific while providing data residency compliance and high availability for mission-essential security applications.
The AWS environment used by Swann included AWS IoT Core for device connectivity, Amazon Simple Storage Service (Amazon S3) for scalable storage and storing video feeds, and AWS Lambda to run code in response to events without managing servers, scaling from zero to thousands of executions and charging only for compute time used. Amazon Cognito is used to manage user authentication and authorization with secure sign-in, multi-factor authentication, social identity integration, and temporary AWS credentials. Amazon Simple Query Service (Amazon SQS) is used to manage message queuing, buffering requests during traffic spikes, and helping to ensure reliable processing even when thousands of cameras trigger simultaneously.
By using these capabilities to remove the effort of managing multiple vendors and custom integrations, Swann could focus on innovation rather than infrastructure. This cloud-centred integration accelerated time-to-market by 2 months while reducing operational overhead, an enabled the cost-effective deployment of sophisticated AI capabilities across millions of devices.
Scalability and performance requirements
Swann’s solution needed to handle millions of concurrent devices (more than 11.74 million cameras generating frames 24/7), variable workload patterns with peak activity during evening hours and weekends, real-time processing to provide sub-second latency for critical security events, global distribution with consistent performance across multiple geographic regions, and cost predictability through transparent pricing that scales linearly with usage. Swann found that Amazon Bedrock and AWS services gave them the best of both worlds: a global network that could handle their massive scale, plus smart cost controls that let them pick exactly the right model for each situation.
Solution architecture overview and implementation
Swann’s dynamic notifications system uses Amazon Bedrock, strategically using four foundation models (Nova Lite, Nova Pro, Claude Haiku, and Claude Sonnet) across two key features to balance performance, cost, and accuracy. This architecture, shown in the following figure, demonstrates how AWS services can be combined to create a scalable, intelligent video analysis solution using generative AI capabilities while optimizing for both performance and cost:

Edge device integration: Smart cameras and doorbells connect through the AWS IoT Device Gateway, providing real-time video feeds for analysis.
Data pipeline: Video content flows through Amazon EventBridge, Amazon S3, and Amazon SQS for reliable storage and message queuing.
Intelligent frame processing: Amazon Elastic Compute Cloud (Amazon EC2) instances (G3 and G4 family) use computer vision libraries to segment video’s into frames and handle frame selection and filtering to optimize processing efficiency. G3 and G4 instances are GPU-powered virtual servers designed for parallel processing workloads such as video analysis and AI inference. Unlike traditional CPUs that process tasks sequentially, GPUs contain thousands of cores that can analyze multiple video frames simultaneously. This means that Swann can process frames from thousands of cameras concurrently without latency bottlenecks, providing near real-time security monitoring.
Serverless processing: Lambda functions invoke Amazon Bedrock and implement model selection logic based on use case requirements.
Tiered model strategy: A cost-effective approach using multiple models with varying capabilities. Amazon Nova Lite for speed and cost efficiency in routine high-volume screening, Nova Pro for balanced performance in threat verification, Claude Haiku for ultra-low latency in time-critical alerts, and Claude Sonnet for advanced reasoning in complex behavioral analysis requiring nuanced reasoning.
Dynamic notifications: The custom notification service delivers real-time alerts to mobile applications based on detection results.

Best practices for generative AI implementation
The following best practices can help organizations optimize cost, performance, and accuracy when implementing similar generative AI solutions at scale:

Understanding RPM and token limits: Requests per minute (RPM) limits define the number of API calls allowed per minute, requiring applications to implement queuing or retry logic to handle high-volume workloads. Tokens are the basic units AI models use to process text and images with costs calculated per thousand tokens, making concise prompts essential for reducing expenses at scale.
Business logic optimization: Swann reduced API calls by 88% (from 17,000 to 2,000 RPM) by implementing intelligent pre-filtering (motion detection, zone-based analysis, and duplicate frame elimination) before invoking AI models.
Prompt engineering and token optimization: Swann achieved 88% token reduction (from 150 to 18 tokens per request) through three key strategies:

optimizing image resolution to reduce input tokens while preserving visual quality.
Deploying a custom pre-filtering model on GPU based EC2 instances to eliminate 65% of false detections (swaying branches, passing cars) before reaching Amazon Bedrock.
Engineering ultra-concise prompts with structured response formats that replaced verbose natural language with machine-parseable key-value pairs (for example, threat:LOW|type:person|action:delivery). Swann’s customer surveys revealed that these optimizations not only reduced latency and cost but also improved threat detection accuracy from 89% to 95%.

Prompt versioning, optimization, and testing: Swann versioned prompts with performance metadata (accuracy, cost, and latency) and A/B tested on 5–10% of traffic before rollout. Swann also uses Amazon Bedrock prompt optimization.
Model selection and tiered strategy: Swann selected models based on activity type.

Nova Lite (87% of requests): Handles fast screening of routine activity, such as passing cars, pets, and delivery personnel. Its low cost, high throughput, and sub-millisecond latency make it essential for high-volume, real-time analysis where speed and efficiency matter more than precision.
Nova Pro (8% of requests): Escalates from Nova Lite when potential threats require verification with higher accuracy. Distinguishes delivery personnel from intruders and identifies suspicious behavior patterns.
Claude Haiku (2% of requests): Powers the Notify Me When feature for immediate notification of user-defined criteria. Provides ultra-low latency for time-sensitive custom alerts.
Claude Sonnet (3% of requests): Handles complex edge cases requiring sophisticated reasoning. Analyzes multi-person interactions, ambiguous scenarios, and provides nuanced behavioral assessment.
Results: This intelligent routing achieves 95% overall accuracy while reducing costs by 99.7% compared to using Claude Sonnet for all requests from a projected $2.1 million to $6 thousand monthly. The key insight was that matching model capabilities to task complexity enables cost-effective generative AI deployment at scale, with business logic pre-filtering and tiered model selection delivering far greater savings than model choice alone.

Model distillation strategy: Swann taught smaller, faster AI models to mimic the intelligence of larger ones—like creating a lightweight version that’s almost as smart but works much faster and costs less than large models. For new features, Swann is exploring Nova model distillation techniques. It allows knowledge transfer from larger advanced models to smaller efficient ones. It also helps optimize model performance for particular use cases without requiring extensive labelled training data.
Implement comprehensive monitoring: Use Amazon CloudWatch to track critical performance metrics including latency percentiles—p50 (median response time), p95 (95th percentile, capturing worst-case for most users), and p99 (99th percentile, identifying outliers and system stress)—alongside token consumption, cost per inference, accuracy rates, and throttling events. These percentile metrics are crucial because average latency can mask performance issues; for example, a 200 ms average might hide that 5% of requests take more than 2 seconds, directly impacting customer experience.

Conclusion
After implementing Amazon Bedrock, Swann saw immediate improvements—customers received fewer but more relevant alerts. Alert volume dropped 25% while notification relevance increased 89%, and customer satisfaction increased by 3%. The system scales across 11.74 million devices with sub-300 ms p95 latency, demonstrating that sophisticated generative AI capabilities can be deployed cost-effectively in consumer IoT products. Dynamic notifications (shown in the following image) deliver context-aware security alerts.

The Notify Me When feature (shown in the following video) demonstrates intelligent customization. Users define what matters to them using natural language, such as “notify me if a dog enters the backyard” or “notify me if a child is near the swimming pool,” enabling truly personalized security monitoring.

Next steps
Organizations considering generative AI at scale should start with a clear, measurable business problem and pilot with a subset of devices before full deployment, optimizing for cost from day one through intelligent business logic and tiered model selection. Invest in comprehensive monitoring to enable continuous optimization and design architecture for graceful degradation to verify reliability even during service disruptions. Focus on prompt engineering and token optimization early to help deliver performance and cost improvements. Use managed services like Amazon Bedrock to handle infrastructure complexity and build flexible architecture that supports future model improvements and evolving AI capabilities.
Explore additional resources

Get Started with Amazon Bedrock
Amazon Bedrock Nova Models
Amazon Bedrock pricing
Prompt engineering concepts
Submit a model distillation job in Amazon Bedrock

About the authors
Aman Sharma is an Enterprise Solutions Architect at AWS, where he works with enterprise retail and supply chain customers across ANZ. With more than 21 years of experience in consulting, architecting, and solution design, passionate about democratizing AI and ML, helping customers design data and ML strategies. Outside of work, he enjoys exploring nature and wildlife photography.
Surjit Reghunathan is the Chief Technology Officer at Swann Communications, where he leads technology innovation and strategic direction for the company’s global IoT security platform. With expertise in scaling connected device solutions, Surjit drives the integration of AI and machine learning capabilities across Swann’s product portfolio. Outside of work, he enjoys long motorcycle rides and playing guitar.
Suraj Padinjarute is a Technical Account Manager at AWS, helping retail and supply chain customers maximize the value of their cloud investments. With over 20 years of IT experience in database administration, application support, and cloud transformation, he is passionate about enabling customers on their cloud journey. Outside of work, Suraj enjoys long-distance cycling and exploring the outdoors.

How to Design Complex Deep Learning Tensor Pipelines Using Einops with …

In this tutorial, we walk through advanced usage of Einops to express complex tensor transformations in a clear, readable, and mathematically precise way. We demonstrate how rearrange, reduce, repeat, einsum, and pack/unpack let us reshape, aggregate, and combine tensors without relying on error-prone manual dimension handling. We focus on real deep-learning patterns, such as vision patchification, multi-head attention, and multimodal token mixing, and show how einops serves as a compact tensor manipulation language that integrates naturally with PyTorch. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport sys, subprocess, textwrap, math, time

def pip_install(pkg: str):
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, pkg])

pip_install(“einops”)
pip_install(“torch”)

import torch
import torch.nn as nn
import torch.nn.functional as F

from einops import rearrange, reduce, repeat, einsum, pack, unpack
from einops.layers.torch import Rearrange, Reduce

torch.manual_seed(0)
device = “cuda” if torch.cuda.is_available() else “cpu”
print(“Device:”, device)

def section(title: str):
print(“n” + “=” * 90)
print(title)
print(“=” * 90)

def show_shape(name, x):
print(f”{name:>18} shape = {tuple(x.shape)} dtype={x.dtype} device={x.device}”)

We set up the execution environment and ensure all required dependencies are installed dynamically. We initialize PyTorch, einops, and utility helpers that standardize device selection and shape inspection. We also establish reusable printing utilities that help us track tensor shapes throughout the tutorial.

Copy CodeCopiedUse a different Browsersection(“1) rearrange”)
x = torch.randn(2, 3, 4, 5, device=device)
show_shape(“x”, x)

x_bhwc = rearrange(x, “b c h w -> b h w c”)
show_shape(“x_bhwc”, x_bhwc)

x_split = rearrange(x, “b (g cg) h w -> b g cg h w”, g=3)
show_shape(“x_split”, x_split)

x_tokens = rearrange(x, “b c h w -> b (h w) c”)
show_shape(“x_tokens”, x_tokens)

y = torch.randn(2, 7, 11, 13, 17, device=device)
y2 = rearrange(y, “b … c -> b c …”)
show_shape(“y”, y)
show_shape(“y2”, y2)

try:
_ = rearrange(torch.randn(2, 10, device=device), “b (h w) -> b h w”, h=3)
except Exception as e:
print(“Expected error (shape mismatch):”, type(e).__name__, “-“, str(e)[:140])

We demonstrate how we use rearrange to express complex reshaping and axis-reordering operations in a readable, declarative way. We show how to split, merge, and permute dimensions while preserving semantic clarity. We also intentionally trigger a shape error to illustrate how Einops enforces shape safety at runtime.

Copy CodeCopiedUse a different Browsersection(“2) reduce”)
imgs = torch.randn(8, 3, 64, 64, device=device)
show_shape(“imgs”, imgs)

gap = reduce(imgs, “b c h w -> b c”, “mean”)
show_shape(“gap”, gap)

pooled = reduce(imgs, “b c (h ph) (w pw) -> b c h w”, “mean”, ph=2, pw=2)
show_shape(“pooled”, pooled)

chmax = reduce(imgs, “b c h w -> b c”, “max”)
show_shape(“chmax”, chmax)

section(“3) repeat”)
vec = torch.randn(5, device=device)
show_shape(“vec”, vec)

vec_batched = repeat(vec, “d -> b d”, b=4)
show_shape(“vec_batched”, vec_batched)

q = torch.randn(2, 32, device=device)
q_heads = repeat(q, “b d -> b heads d”, heads=8)
show_shape(“q_heads”, q_heads)

We apply reduce and repeat to perform pooling, aggregation, and broadcasting operations without manual dimension handling. We compute global and local reductions directly within the transformation expression. We also show how repeating tensors across new dimensions simplifies batch and multi-head constructions.

Copy CodeCopiedUse a different Browsersection(“4) patchify”)
B, C, H, W = 4, 3, 32, 32
P = 8
img = torch.randn(B, C, H, W, device=device)
show_shape(“img”, img)

patches = rearrange(img, “b c (h p1) (w p2) -> b (h w) (p1 p2 c)”, p1=P, p2=P)
show_shape(“patches”, patches)

img_rec = rearrange(
patches,
“b (h w) (p1 p2 c) -> b c (h p1) (w p2)”,
h=H // P,
w=W // P,
p1=P,
p2=P,
c=C,
)
show_shape(“img_rec”, img_rec)

max_err = (img – img_rec).abs().max().item()
print(“Reconstruction max abs error:”, max_err)
assert max_err < 1e-6

section(“5) attention”)
B, T, D = 2, 64, 256
Hh = 8
Dh = D // Hh
x = torch.randn(B, T, D, device=device)
show_shape(“x”, x)

proj = nn.Linear(D, 3 * D, bias=False).to(device)
qkv = proj(x)
show_shape(“qkv”, qkv)

q, k, v = rearrange(qkv, “b t (three heads dh) -> three b heads t dh”, three=3, heads=Hh, dh=Dh)
show_shape(“q”, q)
show_shape(“k”, k)
show_shape(“v”, v)

scale = Dh ** -0.5
attn_logits = einsum(q, k, “b h t dh, b h s dh -> b h t s”) * scale
show_shape(“attn_logits”, attn_logits)

attn = attn_logits.softmax(dim=-1)
show_shape(“attn”, attn)

out = einsum(attn, v, “b h t s, b h s dh -> b h t dh”)
show_shape(“out (per-head)”, out)

out_merged = rearrange(out, “b h t dh -> b t (h dh)”)
show_shape(“out_merged”, out_merged)

We implement vision and attention mechanisms that are commonly found in modern deep learning models. We convert images into patch sequences and reconstruct them to verify reversibility and correctness. We then reshape projected tensors into a multi-head attention format and compute attention using einops.einsum for clarity and correctness.

Copy CodeCopiedUse a different Browsersection(“6) pack unpack”)
B, Cemb = 2, 128

class_token = torch.randn(B, 1, Cemb, device=device)
image_tokens = torch.randn(B, 196, Cemb, device=device)
text_tokens = torch.randn(B, 32, Cemb, device=device)
show_shape(“class_token”, class_token)
show_shape(“image_tokens”, image_tokens)
show_shape(“text_tokens”, text_tokens)

packed, ps = pack([class_token, image_tokens, text_tokens], “b * c”)
show_shape(“packed”, packed)
print(“packed_shapes (ps):”, ps)

mixer = nn.Sequential(
nn.LayerNorm(Cemb),
nn.Linear(Cemb, 4 * Cemb),
nn.GELU(),
nn.Linear(4 * Cemb, Cemb),
).to(device)

mixed = mixer(packed)
show_shape(“mixed”, mixed)

class_out, image_out, text_out = unpack(mixed, ps, “b * c”)
show_shape(“class_out”, class_out)
show_shape(“image_out”, image_out)
show_shape(“text_out”, text_out)
assert class_out.shape == class_token.shape
assert image_out.shape == image_tokens.shape
assert text_out.shape == text_tokens.shape

section(“7) layers”)
class PatchEmbed(nn.Module):
def __init__(self, in_channels=3, emb_dim=192, patch=8):
super().__init__()
self.patch = patch
self.to_patches = Rearrange(“b c (h p1) (w p2) -> b (h w) (p1 p2 c)”, p1=patch, p2=patch)
self.proj = nn.Linear(in_channels * patch * patch, emb_dim)

def forward(self, x):
x = self.to_patches(x)
return self.proj(x)

class SimpleVisionHead(nn.Module):
def __init__(self, emb_dim=192, num_classes=10):
super().__init__()
self.pool = Reduce(“b t c -> b c”, reduction=”mean”)
self.classifier = nn.Linear(emb_dim, num_classes)

def forward(self, tokens):
x = self.pool(tokens)
return self.classifier(x)

patch_embed = PatchEmbed(in_channels=3, emb_dim=192, patch=8).to(device)
head = SimpleVisionHead(emb_dim=192, num_classes=10).to(device)

imgs = torch.randn(4, 3, 32, 32, device=device)
tokens = patch_embed(imgs)
logits = head(tokens)
show_shape(“tokens”, tokens)
show_shape(“logits”, logits)

section(“8) practical”)
x = torch.randn(2, 32, 16, 16, device=device)
g = 8
xg = rearrange(x, “b (g cg) h w -> (b g) cg h w”, g=g)
show_shape(“x”, x)
show_shape(“xg”, xg)

mean = reduce(xg, “bg cg h w -> bg 1 1 1”, “mean”)
var = reduce((xg – mean) ** 2, “bg cg h w -> bg 1 1 1”, “mean”)
xg_norm = (xg – mean) / torch.sqrt(var + 1e-5)
x_norm = rearrange(xg_norm, “(b g) cg h w -> b (g cg) h w”, b=2, g=g)
show_shape(“x_norm”, x_norm)

z = torch.randn(3, 64, 20, 30, device=device)
z_flat = rearrange(z, “b c h w -> b c (h w)”)
z_unflat = rearrange(z_flat, “b c (h w) -> b c h w”, h=20, w=30)
assert (z – z_unflat).abs().max().item() < 1e-6
show_shape(“z_flat”, z_flat)

section(“9) views”)
a = torch.randn(2, 3, 4, 5, device=device)
b = rearrange(a, “b c h w -> b h w c”)
print(“a.is_contiguous():”, a.is_contiguous())
print(“b.is_contiguous():”, b.is_contiguous())
print(“b._base is a:”, getattr(b, “_base”, None) is a)

section(“Done You now have reusable einops patterns for vision, attention, and multimodal token packing”)

We demonstrate reversible token packing and unpacking for multimodal and transformer-style workflows. We integrate Einops layers directly into PyTorch modules to build clean, composable model components. We conclude by applying practical tensor grouping and normalization patterns that reinforce how einops simplifies real-world model engineering.

In conclusion, we established Einops as a practical and expressive foundation for modern deep-learning code. We showed that complex operations like attention reshaping, reversible token packing, and spatial pooling can be written in a way that is both safer and more readable than traditional tensor operations. With these patterns, we reduced cognitive overhead and minimized shape bugs. We wrote models that are easier to extend, debug, and reason about while remaining fully compatible with high-performance PyTorch workflows.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design Complex Deep Learning Tensor Pipelines Using Einops with Vision, Attention, and Multimodal Examples appeared first on MarkTechPost.

Google AI Introduces Natively Adaptive Interfaces (NAI): An Agentic Mu …

Google Research is proposing a new way to build accessible software with Natively Adaptive Interfaces (NAI), an agentic framework where a multimodal AI agent becomes the primary user interface and adapts the application in real time to each user’s abilities and context.

Instead of shipping a fixed UI and adding accessibility as a separate layer, NAI pushes accessibility into the core architecture. The agent observes, reasons, and then modifies the interface itself, moving from one-size-fits-all design to context-informed decisions.

What Natively Adaptive Interfaces (NAI) Change in the Stack?

NAI starts from a simple premise: if an interface is mediated by a multimodal agent, accessibility can be handled by that agent instead of by static menus and settings.

Key properties include:

The multimodal AI agent is the primary UI surface. It can see text, images, and layouts, listen to speech, and output text, speech, or other modalities.

Accessibility is integrated into this agent from the beginning, not bolted on later. The agent is responsible for adapting navigation, content density, and presentation style to each user.

The design process is explicitly user-centered, with people with disabilities treated as edge users who define requirements for everyone, not as an afterthought.

The framework targets what Google team calls the ‘accessibility gap’– the lag between adding new product features and making them usable for people with disabilities. Embedding agents into the interface is meant to reduce this gap by letting the system adapt without waiting for custom add-ons.

Agent Architecture: Orchestrator and Specialized Tools

Under NAI, the UI is backed by a multi-agent system. The core pattern is:

An Orchestrator agent maintains shared context about the user, the task, and the app state.

Specialized sub-agents implement focused capabilities, such as summarization or settings adaptation.

A set of configuration patterns defines how to detect user intent, add relevant context, adjust settings, and correct flawed queries.

For example, in NAI case studies around accessible video, Google team outlines core agent capabilities such as:

Understand user intent.

Refine queries and manage context across turns.

Engineer prompts and tool calls in a consistent way.

From a systems point of view, this replaces static navigation trees with dynamic, agent-driven modules. The ‘navigation model’ is effectively a policy over which sub-agent to run, with what context, and how to render its result back into the UI.

Multimodal Gemini and RAG for Video and Environments

NAI is explicitly built on multimodal models like Gemini and Gemma that can process voice, text, and images in a single context.

In the case of accessible video, Google describes a 2-stage pipeline:

Offline indexing

The system generates dense visual and semantic descriptors over the video timeline.

These descriptors are stored in an index keyed by time and content.

Online retrieval-augmented generation (RAG)

At playback time, when a user asks a question such as “What is the character wearing right now?”, the system retrieves relevant descriptors.

A multimodal model conditions on these descriptors plus the question to generate a concise, descriptive answer.

This design supports interactive queries during playback, not just pre-recorded audio description tracks. The same pattern generalizes to physical navigation scenarios where the agent needs to reason over a sequence of observations and user queries.

Concrete NAI Prototypes

Google’s NAI research work is grounded in several deployed or piloted prototypes built with partner organizations such as RIT/NTID, The Arc of the United States, RNID, and Team Gleason.

StreetReaderAI

Built for blind and low-vision users navigating urban environments.

Combines an AI Describer that processes camera and geospatial data with an AI Chat interface for natural language queries.

Maintains a temporal model of the environment, which allows queries like ‘Where was that bus stop?’ and replies such as ‘It is behind you, about 12 meters away.’

Multimodal Agent Video Player (MAVP)

Focused on online video accessibility.

Uses the Gemini-based RAG pipeline above to provide adaptive audio descriptions.

Lets users control descriptive density, interrupt playback with questions, and receive answers grounded in indexed visual content.

Grammar Laboratory

A bilingual (American Sign Language and English) learning platform created by RIT/NTID with support from Google.org and Google.

Uses Gemini to generate individualized multiple-choice questions.

Presents content through ASL video, English captions, spoken narration, and transcripts, adapting modality and difficulty to each learner.

Design process and curb-cut effects

The NAI documentation describes a structured process: investigate, build and refine, then iterate based on feedback. In one case study on video accessibility, the team:

Defined target users across a spectrum from fully blind to sighted.

Ran co-design and user test sessions with about 20 participants.

Went through more than 40 iterations informed by 45 feedback sessions.

The resulting interfaces are expected to produce a curb-cut effect. Features built for users with disabilities – such as better navigation, voice interactions, and adaptive summarization – often improve usability for a much wider population, including non-disabled users who face time pressure, cognitive load, or environmental constraints.

Key Takeaways

Agent is the UI, not an add-on: Natively Adaptive Interfaces (NAI) treat a multimodal AI agent as the primary interaction layer, so accessibility is handled by the agent directly in the core UI, not as a separate overlay or post-hoc feature.

Orchestrator + sub-agents architecture: NAI uses a central Orchestrator that maintains shared context and routes work to specialized sub-agents (for example, summarization or settings adaptation), turning static navigation trees into dynamic, agent-driven modules.

Multimodal Gemini + RAG for adaptive experiences: Prototypes such as the Multimodal Agent Video Player build dense visual indexes and use retrieval-augmented generation with Gemini to support interactive, grounded Q&A during video playback and other rich media scenarios.

Real systems: StreetReaderAI, MAVP, Grammar Laboratory: NAI is instantiated in concrete tools: StreetReaderAI for navigation, MAVP for video accessibility, and Grammar Laboratory for ASL/English learning, all powered by multimodal agents.

Accessibility as a core design constraint: The framework encodes accessibility into configuration patterns (detect intent, add context, adjust settings) and leverages the curb-cut effect, where solving for disabled users improves robustness and usability for the broader user base.

Check out the Technical details here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Introduces Natively Adaptive Interfaces (NAI): An Agentic Multimodal Accessibility Framework Built on Gemini for Adaptive UI Design appeared first on MarkTechPost.

Alibaba Open-Sources Zvec: An Embedded Vector Database Bringing SQLite …

Alibaba Tongyi Lab research team released ‘Zvec’, an open source, in-process vector database that targets edge and on-device retrieval workloads. It is positioned as ‘the SQLite of vector databases’ because it runs as a library inside your application and does not require any external service or daemon. It is designed for retrieval augmented generation (RAG), semantic search, and agent workloads that must run locally on laptops, mobile devices, or other constrained hardware/edge devices

The core idea is simple. Many applications now need vector search and metadata filtering but do not want to run a separate vector database service. Traditional server style systems are heavy for desktop tools, mobile apps, or command line utilities. An embedded engine that behaves like SQLite but for embeddings fits this gap.

https://zvec.org/en/blog/introduction/

Why embedded vector search matters for RAG?

RAG and semantic search pipelines need more than a bare index. They need vectors, scalar fields, full CRUD, and safe persistence. Local knowledge bases change as files, notes, and project states change.

Index libraries such as Faiss provide approximate nearest neighbor search but do not handle scalar storage, crash recovery, or hybrid queries. You end up building your own storage and consistency layer. Embedded extensions such as DuckDB-VSS add vector search to DuckDB but expose fewer index and quantization options and weaker resource control for edge scenarios. Service based systems such as Milvus or managed vector clouds require network calls and separate deployment, which is often overkill for on-device tools.

Zvec claims to fit in specifically for these local scenarios. It gives you a vector-native engine with persistence, resource governance, and RAG oriented features, packaged as a lightweight library.

Core architecture: in-process and vector-native

Zvec is implemented as an embedded library. You install it with pip install zvec and open collections directly in your Python process. There is no external server or RPC layer. You define schemas, insert documents, and run queries through the Python API.

The engine is built on Proxima, Alibaba Group’s high performance, production grade, battle tested vector search engine. Zvec wraps Proxima with a simpler API and embedded runtime. The project is released under the Apache 2.0 license.

Current support covers Python 3.10 to 3.12 on Linux x86_64, Linux ARM64, and macOS ARM64.

The design goals are explicit:

Embedded execution in process

Vector native indexing and storage

Production ready persistence and crash safety

This makes it suitable for edge devices, desktop applications, and zero-ops deployments.

Developer workflow: from install to semantic search

The quickstart documentation shows a short path from install to query.

Install the package:pip install zvec

Define a CollectionSchema with one or more vector fields and optional scalar fields.

Call create_and_open to create or open the collection on disk.

Insert Doc objects that contain an ID, vectors, and scalar attributes.

Build an index and run a VectorQuery to retrieve nearest neighbors.

Copy CodeCopiedUse a different Browserpip install zvec

Example:

Copy CodeCopiedUse a different Browserimport zvec

# Define collection schema
schema = zvec.CollectionSchema(
name=”example”,
vectors=zvec.VectorSchema(“embedding”, zvec.DataType.VECTOR_FP32, 4),
)

# Create collection
collection = zvec.create_and_open(path=”./zvec_example”, schema=schema,)

# Insert documents
collection.insert([
zvec.Doc(id=”doc_1″, vectors={“embedding”: [0.1, 0.2, 0.3, 0.4]}),
zvec.Doc(id=”doc_2″, vectors={“embedding”: [0.2, 0.3, 0.4, 0.1]}),
])

# Search by vector similarity
results = collection.query(
zvec.VectorQuery(“embedding”, vector=[0.4, 0.3, 0.3, 0.1]),
topk=10
)

# Results: list of {‘id’: str, ‘score’: float, …}, sorted by relevance
print(results)

Results come back as dictionaries that include IDs and similarity scores. This is enough to build a local semantic search or RAG retrieval layer on top of any embedding model.

Performance: VectorDBBench and 8,000+ QPS

Zvec is optimized for high throughput and low latency on CPUs. It uses multithreading, cache friendly memory layouts, SIMD instructions, and CPU prefetching.

In VectorDBBench on the Cohere 10M dataset, with comparable hardware and matched recall, Zvec reports more than 8,000 QPS. This is more than 2× the previous leaderboard #1, ZillizCloud, while also substantially reducing index build time in the same setup.

https://zvec.org/en/blog/introduction/

These metrics show that an embedded library can reach cloud level performance for high volume similarity search, as long as the workload resembles the benchmark conditions.

RAG capabilities: CRUD, hybrid search, fusion, reranking

The feature set is tuned for RAG and agentic retrieval.

Zvec supports:

Full CRUD on documents so the local knowledge base can change over time.

Schema evolution to adjust index strategies and fields.

Multi vector retrieval for queries that combine several embedding channels.

A built in reranker that supports weighted fusion and Reciprocal Rank Fusion.

Scalar vector hybrid search that pushes scalar filters into the index execution path, with optional inverted indexes for scalar attributes.

This allows you to build on device assistants that mix semantic retrieval, filters such as user, time, or type, and multiple embedding models, all within one embedded engine.

Key Takeaways

Zvec is an embedded, in-process vector database positioned as the ‘SQLite of vector database’ for on-device and edge RAG workloads.

It is built on Proxima, Alibaba’s high performance, production grade, battle tested vector search engine, and is released under Apache 2.0 with Python support on Linux x86_64, Linux ARM64, and macOS ARM64.

Zvec delivers >8,000 QPS on VectorDBBench with the Cohere 10M dataset, achieving more than 2× the previous leaderboard #1 (ZillizCloud) while also reducing index build time.

The engine provides explicit resource governance via 64 MB streaming writes, optional mmap mode, experimental memory_limit_mb, and configurable concurrency, optimize_threads, and query_threads for CPU control.

Zvec is RAG ready with full CRUD, schema evolution, multi vector retrieval, built in reranking (weighted fusion and RRF), and scalar vector hybrid search with optional inverted indexes, plus an ecosystem roadmap targeting LangChain, LlamaIndex, DuckDB, PostgreSQL, and real device deployments.

Check out the Technical details and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Alibaba Open-Sources Zvec: An Embedded Vector Database Bringing SQLite-like Simplicity and High-Performance On-Device RAG to Edge Applications appeared first on MarkTechPost.

How Amazon uses Amazon Nova models to automate operational readiness t …

Amazon is a global ecommerce and technology company that operates a vast network of fulfillment centers to store, process, and ship products to customers worldwide. The Amazon Global Engineering Services (GES) team is responsible for facilitating operational readiness across the company’s rapidly expanding network of fulfillment centers. When launching new fulfillment centers, Amazon must verify that each facility is properly equipped and ready for operations. This process is called operational readiness testing (ORT) and typically requires 2,000 hours of manual effort per facility to verify over 200,000 components across 10,500 workstations. Using Amazon Nova models, we’ve developed an automated solution that significantly reduces verification time while improving accuracy.
In this post, we discuss how Amazon Nova in Amazon Bedrock can be used to implement an AI-powered image recognition solution that automates the detection and validation of module components, significantly reducing manual verification efforts and improving accuracy.
Understanding the ORT Process
ORT is a comprehensive verification process that makes sure the components are properly installed before our fulfillment center is ready for launch. The bill of materials (BOM) serves as the master checklist, detailing every component that should be present in each module of the facility. Each component or item in the fulfillment center is assigned a unique identification number (UIN) that serves as its distinct identifier. These components are essential for accurate tracking, verification, and inventory management throughout the ORT process and beyond. In this post we will refer to UINs and components interchangeably.
The ORT workflow has five components:

Testing plan: Testers receive a testing plan, which includes a BOM that details the exact components and quantities required
Walk through: Testers walk through the fulfillment center and stop at each module to review the setup against the BOM. A module is a physical workstation or operational area
Verify: They verify proper installation and configuration of each UIN
Test: They perform functional testing (i.e. power, connectivity, etc.) on each component
Document: They document results for each UIN and move to next module

Finding the Right Approach
We evaluated multiple approaches to address the ORT automation challenge, with a focus on using image recognition capabilities from foundation models (FMs). Key factors in the decision-making process include:
Image Detection Capability: We selected Amazon Nova Pro for image detection after testing multiple AI models including Anthropic Claude Sonnet, Amazon Nova Pro, Amazon Nova Lite and Meta AI Segment Anything Model (SAM). Nova Pro met the criteria for production implementation.
Amazon Nova Pro Features:
Object Detection Capabilities

Purpose-built for object detection
Provides precise bounding box coordinates
Consistent detection results with bounding boxes

Image Processing

Built-in image resizing to a fixed aspect ratio
No manual resizing needed

Performance

Higher Request per Minute (RPM) quota on Amazon Bedrock
Higher Tokens per Minute (TPM) throughput
Cost-effective for large-scale detection

Serverless Architecture: We used AWS Lambda and Amazon Bedrock to maintain a cost-effective, scalable solution that didn’t require complex infrastructure management or model hosting.
Additional contextual understanding: To improve detection and reduce false positives, we used Anthropic Claude Sonnet 4.0 to generate text descriptions for each UIN and create detection parameters.
Solution Overview
The Intelligent Operational Readiness (IORA) solution includes several key services and is depicted in the architecture diagram that follows:

API Gateway: Amazon API Gateway handles user requests and routes to the appropriate Lambda functions
Synchronous Image Processing: Amazon Bedrock Nova Pro analyzes images with 2-5 second response times
Progress Tracking: The system tracks UIN detection progress (% UINs detected per module)
Data Storage: Amazon Simple Storage Service (S3) is used to store module images, UIN reference pictures, and results. Amazon DynamoDB is used for storing structured verification data
Compute: AWS Lambda is used for image analysis and data operations
Model inference: Amazon Bedrock is used for real-time inference for object detection as well as batch inference for description generation

Description Generation Pipeline
The description generation pipeline is one of the key systems that work together to automate the ORT process. The first is the description generation pipeline, which creates a standardized knowledge base for component identification and is run as a batch process when new modules are introduced. Images taken at the fulfillment center have different lighting conditions and camera angles, which can impact the ability of the model to consistently detect the right component. By using high-quality reference images, we can generate standardized descriptions for each UIN. We then generate detection rules using the BOM, which lists out the required UINs in each module, their associated quantities and specifications. This process makes sure that each UIN has a standardized description and appropriate detection rules, creating a robust foundation for the subsequent detection and evaluation processes.
The workflow is as follows:

Admin uploads UIN images and BOM data
Lambda function triggers two parallel processes:

Path A: UIN description generation

Process each UIN’s reference images through Claude Sonnet 4.0
Generate detailed UIN descriptions
Consolidate multiple descriptions into one description per UIN
Store consolidated descriptions in DynamoDB

Path B: Detection rule creation

Combine UIN descriptions with BOM data
Generate module-specific detection rules
Create false positive detection patterns
Store rules in DynamoDB

# UIN Description Generation Process
def generate_uin_descriptions(uin_images, bedrock_client):
“””
Generate enhanced UIN descriptions using Claude Sonnet
“””
for uin_id, image_set in uin_images.items():
# First pass: Generate initial descriptions from multiple angles
initial_descriptions = []
for image in image_set:
response = bedrock_client.invoke_model(
modelId=’anthropic.claude-4-sonnet-20240229-v1:0′,
body=json.dumps({
‘messages’: [
{
‘role’: ‘user’,
‘content’: [
{‘type’: ‘image’, ‘source’: {‘type’: ‘base64’, ‘data’: image}},
{‘type’: ‘text’, ‘text’: ‘Describe this UIN component in detail, including physical characteristics, typical installation context, and identifying features.’}
]
}
]
})
)
initial_descriptions.append(response[‘content’][0][‘text’])

# Second pass: Consolidate and enrich descriptions
consolidated_description = consolidate_descriptions(initial_descriptions, bedrock_client)

# Store in DynamoDB for quick retrieval
store_uin_description(uin_id, consolidated_description)

False positive detection patterns
To improve output consistency, we optimized the prompt by adding additional rules for common false positives. This helps filter out objects that are not relevant for detection. For instance, triangle signs should have a gate number and arrow and generic signs should not be detected.

3:
generic_object: “Any triangular sign or warning marker”
confused_with: “SIGN.GATE.TRIANGLE”
▼ distinguishing_features:
0: “Gate number text in black at top (e.g., ‘GATE 2350’)”
1: “Red downward-pointing arrow at bottom”
2: “Red border with white background”
3: “Black mounting system with suspension hardware”

trap_description: “Generic triangle sign ≠ SIGN.GATE.TRIANGLE without gate number and red arrow”

UIN Detection Evaluation Pipeline
This pipeline handles real-time component verification. We input the images taken by the tester, module-specific detection rules, and the UIN descriptions to Nova Pro using Amazon Bedrock. The outputs are the detected UINs with bounding boxes, along with installation status, defect identification, and confidence scores.

# UIN Detection Configuration
detection_config = {
‘model_selection’: ‘nova-pro’, # or ‘claude-sonnet’
‘module_config’: module_id,
‘prompt_engineering’: {
‘system_prompt’: system_prompt_template,
‘agent_prompt’: agent_prompt_template
},
‘data_sources’: {
‘s3_images_path’: f’s3://amzn-s3-demo-bucket/images/{module_id}/’,
‘descriptions_table’: ‘uin-descriptions’,
‘ground_truth_path’: f’s3://amzn-s3-demo-bucket/ground-truth/{module_id}/’
}
}

The Lambda function processes each module image using the selected configuration:

def detect_uins_in_module(image_data, module_bom, uin_descriptions):
“””
Detect UINs in module images using Nova Pro
“””
# Retrieve relevant UIN descriptions for the module
relevant_descriptions = get_descriptions_for_module(module_bom, uin_descriptions)

# Construct detection prompt with descriptions
detection_prompt = f”””
Analyze this module image to detect the following components:
{format_uin_descriptions(relevant_descriptions)}
For each UIN, provide:
– Detection status (True/False)
– Bounding box coordinates if detected
– Confidence score
– Installation status verification
– Any visible defects
“””

# Process with Amazon Bedrock Nova Pro
response = bedrock_client.invoke_model(
modelId=’amazon.nova-pro-v1:0′,
body=json.dumps({
‘messages’: [
{
‘role’: ‘user’,
‘content’: [
{‘type’: ‘image’, ‘source’: {‘type’: ‘base64’, ‘data’: image_data}},
{‘type’: ‘text’, ‘text’: detection_prompt}
]
}
]
})
)
return parse_detection_results(response)

End-to-End Application Pipeline
The application brings everything together and provides testers in the fulfillment center with a production-ready user interface. It also provides comprehensive analysis including precise UIN identification, bounding box coordinates, installation status verification, and defect detection with confidence scoring.
The workflow, which is reflected in the UI, is as follows:

A tester securely uploads the images to Amazon S3 from the frontend—either by taking a photo or uploading it manually. Images are automatically encrypted at rest in S3 using AWS Key Management Service (AWS KMS).
This triggers the verification, which calls the API endpoint for UIN verification. API calls between services use AWS Identity and Access Management (IAM) role-based authentication.
A Lambda function retrieves the images from S3.
Amazon Nova Pro detects required UINs from each image.
The results of the UIN detection are stored in DynamoDB with encryption enabled.

The following figure shows the UI after an image has been uploaded and processed. The information includes the UIN name, a description, when it was last updated, and so on.

The following image is of a dashboard in the UI that the user can use to review the results and manually override any inputs if necessary.
Results & Learnings
After building the prototype, we tested the solution in multiple fulfillment centers using Amazon Kindle tablets. We achieved 92% precision on a representative set of test modules with 2–5 seconds latency per image. Compared to manual operational readiness testing, IORA reduces the total testing time by 60%. Amazon Nova Pro was also able to identify missing labels from the ground truth data, which gave us an opportunity to improve the quality of the dataset.

“The precision results directly translate to time savings – 40% coverage equals 40% time reduction for our field teams. When the solution detects a UIN, our fulfillment center teams can confidently focus only on finding missing components.”
– Wayne Jones, Sr Program Manager, Amazon General Engineering Services

Key learnings:

Amazon Nova Pro excels at visual recognition tasks when provided with rich contextual descriptions, and outperforms accuracy using standalone image comparison.
Ground truth data quality significantly impacts model performance. The solution identified missing labels in the original dataset and helps improve human labelled data.
Modules with less than 20 UINs performed best, and we saw performance degradation for modules with 40 or more UINs. Hierarchical processing is needed for modules with over 40 components.
The serverless architecture using Lambda and Amazon Bedrock provides cost-effective scalability without infrastructure complexity.

Conclusion
This post demonstrates how to use Amazon Nova and Anthropic Claude Sonnet in Amazon Bedrock to build an automated image recognition solution for operational readiness testing. We showed you how to:

Process and analyze images at scale using Amazon Nova models
Generate and enrich component descriptions to improve detection accuracy
Build a reliable pipeline for real-time component verification
Store and manage results efficiently using managed storage services

This approach can be adapted for similar use cases that require automated visual inspection and verification across various industries including manufacturing, logistics, and quality assurance. Moving forward, we plan to enhance the system’s capabilities, conduct pilot implementations, and explore broader applications across Amazon operations.
For more information about Amazon Nova and other foundation models in Amazon Bedrock, visit the Amazon Bedrock documentation page.

About the Authors
Bishesh Adhikari is a Senior ML Prototyping Architect at AWS with over a decade of experience in software engineering and AI/ML. Specializing in generative AI, LLMs, NLP, CV, and GeoSpatial ML, he collaborates with AWS customers to build solutions for challenging problems through co-development. His expertise accelerates customers’ journey from concept to production, tackling complex use cases across various industries. In his free time, he enjoys hiking, traveling, and spending time with family and friends.
Hin Yee Liu is a Senior GenAI Engagement Manager at AWS. She leads AI prototyping engagements on complex technical challenges, working closely with customers to deliver production-ready solutions leveraging Generative AI, AI/ML, Big Data, and Serverless technologies through agile methodologies. Outside of work, she enjoys pottery, travelling, and trying out new restaurants around London.
Akhil Anand is a Program Manager at Amazon, passionate about using technology and data to solve critical business problems and drive innovation. He focuses on using data as a core foundation and AI as a powerful layer to accelerate business growth. Akhil collaborates closely with tech and business teams at Amazon to translate ideas into scalable solutions, facilitating a strong user-first approach and rapid product development. Outside of work, Akhil enjoys continuous learning, collaborating with friends to build new solutions, and watching Formula 1.
Zakaria Fanna is a Senior AI Prototyping Engineer at Amazon with over 15 years of experience across diverse IT domains, including Networking, DevOps, Automation, and AI/ML. He specializes in rapidly developing Minimum Viable Products (MVPs) for internal users. Zakaria enjoys tackling challenging technical problems and helping customers scale their solutions by leveraging cutting-edge technologies. In his free time, Zakaria enjoys continuous learning, sports, and cherishes time spent with his children and family.
Elad Dwek is a Senior AI Business Developer at Amazon, working within Global Engineering, Maintenance, and Sustainability. He partners with stakeholders from business and tech side to identify opportunities where AI can enhance business challenges or completely transform processes, driving innovation from prototyping to production. With a background in construction and physical engineering, he focuses on change management, technology adoption, and building scalable, transferable solutions that deliver continuous improvement across industries. Outside of work, he enjoys traveling around the world with his family.
Palash Choudhury is a Software Development Engineer at AWS Corporate FP&A with over 10 years of experience across frontend, backend, and DevOps technologies. He specializes in developing scalable solutions for corporate financial allocation challenges and actively leverages AI/ML technologies to automate workflows and solve complex business problems. Passionate about innovation, Palash enjoys experimenting with emerging technologies to transform traditional business processes.

Iberdrola enhances IT operations using Amazon Bedrock AgentCore

Iberdrola, one of the world’s largest utility companies, has embraced cutting-edge AI technology to revolutionize its IT operations in ServiceNow. By using different agentic architectures, Iberdrola has transformed the way thousands of change requests and incident tickets are managed, streamlining processes and enhancing productivity across departments.
Through its partnership with AWS, Iberdrola implemented those agents in a groundbreaking solution using Amazon Bedrock AgentCore, targeting three key areas: optimizing change request validation in the draft phase, enriching incident management with contextual intelligence, and simplifying change model selection using conversational AI. These innovations reduce bottlenecks, help teams accelerate ticket resolution, and deliver consistent and high-quality data handling throughout the organization.
Amazon Bedrock AgentCore helps Iberdrola deploy production-ready AI agents seamlessly. With serverless compute capabilities, robust security, and integrated observability, the platform helps Iberdrola scale solutions across departments while adhering to enterprise-grade reliability and compliance standards.
Challenges with change and incident management
Iberdrola has simplified the multi-phase process of change management using AI-powered validation. A group of orchestrated agents make sure requests align with intended modifications while formatting and verifying mandatory fields in real time. This optimized approach avoids manual resubmissions and drastically reduces processing times, helping teams focus on driving impactful outcomes.
Using a swarm of agents to perform contextual enrichment, Iberdrola’s networking department now processes incidents faster and with greater precision. This enrichment lets technicians access configuration item details, review related historical incidents, and categorize tickets by environment and alert types, enhancing response times and enabling teams to swiftly address critical issues.
Solution overview
Iberdrola establishes its agentic AI practice through a layered architecture that separates operational concerns while enabling seamless integration across IT workflows. ServiceNow serves as the primary input source, and a MicroGateway provides intelligent routing to direct requests to relevant agents. A dedicated data layer maintains enterprise information, processing raw ServiceNow data through extract, transform, and load (ETL) pipelines for agent consumption.

The architecture comprises three layers:

Agentic AI resources – This layer encompasses all agent deployments, Model Context Protocol (MCP) servers for standardized data access, authentication mechanisms, and memory objects that maintain contextual information. The design enables domain-specific agent development while sharing common infrastructure services.
Inference layer – A streamlined abstraction provides large language model (LLM) inference capabilities from the organization’s portfolio of integrated models. This layer provides consistent model access patterns while supporting experimentation without requiring agent modifications.
Data layer – A comprehensive information foundation contains operational data, analytical datasets, and transactional records. This layer enriches agent capabilities by providing access to historical patterns, real-time operational status, and contextual information necessary for intelligent decision-making.

This design enables three distinct use cases that address different operational challenges:

Enhanced change management validation – The first implementation supports the draft phase of Iberdrola’s change management process through a deterministic agentic workflow. Multiple specialized agents work in sequence to validate change model appropriateness and verify that mandatory fields contain correctly formatted information. When validation errors are detected, the system provides clear feedback to requesters before allowing progression to subsequent phases.
Intelligent incident enrichment – The incident management solution demonstrates multi-agent orchestration for Iberdrola’s Networking department. A master agent receives each incident and selectively engages specialized agents for tagging, contextual enrichment, similarity detection, and change impact analysis. This adaptive approach assists technicians by categorizing incidents, identifying related historical cases, and extracting configuration item details.
Conversational change model assistant – The third use case addresses the complexity of selecting appropriate change models through a conversational AI assistant. The agent collects information about technology families, change objectives, and deployment environments to recommend suitable change models. The system provides clickable recommendations that open pre-filled change forms, streamlining the change request process.

The conceptual architecture translates into a production-ready implementation through Amazon Bedrock AgentCore, which provides managed primitives for building and deploying enterprise AI agents. The serverless approach of Amazon Bedrock AgentCore enables Iberdrola to focus on agent logic rather than infrastructure management while providing scalability and operational reliability.

Amazon Bedrock AgentCore components
AgentCore Runtime serves as the foundation for agent deployment, accepting containerized agents built with any framework—in Iberdrola’s case, LangGraph—and deploying them through Amazon Elastic Container Registry (Amazon ECR) repositories. AgentCore Runtime maintains serverless characteristics, scaling based on request volume while providing session isolation. Each agent session can run up to 8 hours for complex workflows. Logs and metrics generated by AgentCore Runtime are automatically captured by AgentCore Observability. In addition, Iberdrola has configured explicit logging to their self-hosted Langfuse instance for centralized monitoring.
AgentCore Memory provides contextual continuity across agent interactions by maintaining memory objects per agent session. Using the memory object, agents can store and retrieve session state, conversation history, and intermediate processing results. This capability is essential for Iberdrola’s multi-step workflows where agents must maintain context across validation phases or incident enrichment processes.
AgentCore Gateway simplifies tool integration by acting as an MCP server that “MCPifies” external tools and services. Rather than requiring custom integration code for each data source, AgentCore Gateway provides standardized interfaces that agents can consume consistently. This approach is particularly valuable for Iberdrola’s ServiceNow endpoint connections.
AgentCore Identity manages both inbound and outbound authentication flows, integrating with Entra ID through OAuth 2.0 protocols. For inbound requests, AgentCore Identity validates bearer tokens and authorizes access to underlying resources. For outbound operations, it handles token acquisition and manages secure communication with downstream tools.
AgentCore Observability captures telemetry data from agents using OpenTelemetry standards and surfaces this information through Amazon CloudWatch. This integration provides comprehensive monitoring of operational metrics without requiring additional instrumentation.
Technical implementation
The implementation uses LiteLLM as a proxy layer for consistent access to Amazon Nova and Anthropic Claude models through Amazon Bedrock and various other models. This abstraction enables agents to interact with different model variants using standardized API calls while Amazon Bedrock Guardrails provides safety controls for model outputs.
The architecture addresses Iberdrola’s enterprise security requirements through a virtual private cloud (VPC) configuration within AgentCore Runtime, so agents can securely access internal resources while maintaining network isolation. VPC endpoints provide secure communication with internal data sources without exposing traffic to the public internet.
Users initiate requests through ServiceNow, which communicates through a REST API to the MicroGateway that routes requests to appropriate use case agents. The data architecture implements a hybrid approach combining real-time operational access with enriched analytical datasets. Raw ServiceNow data flows through ETL processes into Amazon Simple Storage Service (Amazon S3) storage, then into Amazon Relational Database Service (Amazon RDS) databases enhanced with pgvector extensions for semantic search.
The logs and metrics generated by the agents deployed in AgentCore Runtime can be monitored using AgentCore Observability. In addition, Iberdrola uses self-hosted Langfuse on Amazon Elastic Kubernetes Service (Amazon EKS) for a holistic view of spans and traces generated by the LLMs and the agents.
Use case details
In this section, we discuss the implementation of two use cases mentioned earlier: enhanced change management and intelligent incident management.
Enhanced change management
The first use case demonstrates an agentic workflow that supports the draft phase of Iberdrola’s change management process through sequential agent execution within a single AgentCore Runtime. The workflow processes change requests through four specialized agents—Rule Extractor, Content Validator, AIM Model Analyst, and Phase Transition—with each agent receiving context from the previous step.
The implementation consists of the following key components:

Single runtime context flow – Agents operate within one AgentCore Runtime instance, maintaining seamless context and session state across the entire validation pipeline
LangGraph orchestration – Agents are defined as a graph structure, enabling visual workflow representation, conditional branching based on validation results, and comprehensive audit trails
Vector-enhanced validation – Pgvector-enabled PostgreSQL supports semantic similarity searches, enabling the AIM Model Analyst agent to match change models based on technical descriptions rather than keyword matching
Consistent processing – Change requests follow identical validation steps, meeting compliance requirements and quality standards

Intelligent incident management
The second use case demonstrates intelligent multi-agent orchestration for incident management, where a Smart Solver Agent analyzes incoming incidents and selectively engages specialized agents based on contextual needs. This implementation adapts processing steps to each incident’s unique characteristics, optimizing resource utilization while providing comprehensive enrichment when needed.
The implementation consists of the following key components:

Intelligent orchestration – The Smart Solver Agent analyzes incident content and determines which specialized agents to invoke based on missing context and potential value-add
Specialized agent engagement – Five specialized agents (Tag Classifier, Incident Similarity, Incident Associator, Change Associator, Context Retriever) are available to provide enrichment based on the detail and complexity of the incident
Adaptive processing – The system adjusts enrichment activities based on incident complexity—simple incidents might only require tagging, whereas complex issues receive full contextual analysis

Lessons learned
The implementation of AI agents at Iberdrola demonstrates how the managed primitives of Amazon Bedrock AgentCore significantly accelerate enterprise AI deployment. Amazon Bedrock AgentCore minimized the infrastructure complexity typically required for agentic AI, helping teams focus on agent logic while achieving scalable and secured cloud resources.“At Iberdrola, we’re extending our production AI platform with a new agentic capability powered by Amazon Bedrock AgentCore,” says Iñigo Gutierrez, AI Global Expert Engineer at Iberdrola. “By using a managed serverless runtime with built-in identity, memory, and observability, we can ship LangGraph-based agents that plan, call tools through MCP-style gateways, and operate securely inside our VPC. This feature moves us from point automations to reusable, production-grade agents—reducing engineering cognitive load and accelerating safe delivery across IT operations.”
Key success factors
The solution offers the following key benefits:

Purpose-built runtime – AgentCore Runtime provides a fully-managed quick start environments to host AI agents with complete session isolation. Additionally, out-of-the-box streaming and MCP and A2A support from AgentCore Runtime alleviate the need to develop custom solutions and build support for these protocols.
Managed infrastructure – The serverless compute runtimes, identity, and memory services of Amazon Bedrock AgentCore minimize custom development overhead for enterprise-grade capabilities.
Enterprise security – VPC support and comprehensive tagging aligns with stringent IT requirements, accelerating development without compromising security standards.
Open and framework-agnostic – Amazon Bedrock AgentCore fits well with development guidelines because you can choose the development framework, such as LangGraph, by adding a simple decorator. Furthermore, it has no restrictions on using third-party or open-source solutions like Langfuse.
Scalable tool discovery – AgentCore Gateway automatically indexes tools and provides serverless semantic search, scaling from tens to hundreds of targets, totally managed.

Future roadmap
Iberdrola is considering the following future enhancements to the solution:

Agent catalog – Improve governance and discovery of agents seamlessly integrated into the Amazon Bedrock AgentCore ecosystem
New supported protocols and standards – Evolve Iberdrola’s agent development to use new protocols supported (such as A2A) by AgentCore Runtime and other managed services
Managed orchestration and real-time flow monitoring – Build platform-provided dashboards that automatically manage and monitor complex interactions between multiple AI agents, tools, or workflows

Conclusion
Iberdrola’s innovative implementation showcases its leadership and vision in using advanced AI technologies to transform its operational workflows. By adopting Amazon Bedrock AgentCore, Iberdrola has demonstrated how organizations can deploy production-ready AI agents with remarkable efficiency while meeting robust enterprise security and scalability standards. Through its strategic use of Amazon Bedrock AgentCore managed primitives, Iberdrola has realized substantial productivity gains and unparalleled improvements in data quality across its change and incident management processes. This successful transformation underscores Iberdrola’s commitment to excellence in using intelligent solutions to solve complex operational challenges. It also highlights the unique value proposition of Amazon Bedrock AgentCore: industry-first serverless compute for AI agents, integrated enterprise-grade security, and adaptable deployment patterns that accommodate diverse processing requirements. The platform’s ability to streamline infrastructure complexity while supporting specialized workflows makes it an ideal foundation for enterprise AI initiatives.
Organizations looking to implement AI agents in production environments can draw inspiration from Iberdrola’s architectural patterns and its effective execution of AI-driven solutions. Iberdrola’s success serves as a blueprint for accelerating deployments and achieving operational excellence with an Amazon Bedrock AgentCore managed approach, which reduces time-to-value and supports the scale and reliability demanded by enterprise AI systems.

About the authors
Talha Chattha is a Sr. Agentic AI Specialist SA at AWS, based in Stockholm. With 10+ years of experience working with AI, Talha now helps establish practices to ease the path to production for Agentic AI workloads. Talha is an expert in AgentCore and supports customers across entire EMEA. He holds passion about meta-agents, async patterns, advanced hierarchical solutions and optimized context engineering for agents. When not shaping the future of AI, he explores the scenic European landscapes and delicious cuisines. Connect with Talha at LinkedIn.
Unai Bermejo is a Global Expert AI Engineer at Iberdrola. With 10 years of experience in applied AI, AI research, and software engineering, Unai now helps Iberdrola establish best practices and frameworks in AI and agentic initiatives, aligned with corporate platforms and business needs. He acts as a technical bridge between AI technology, Cloud engineering teams, and business developers, driving the adoption of scalable, responsible, and high‑impact AI solutions across the organization.
Xabier Muruaga is the Global Head of AI and Data at Iberdrola. With over 15 years of experience in AI/ML and data‑driven architectures, he leads the company’s strategy and governance for secure, cloud‑native, and production‑ready AI platforms. His background across architecture, digital transformation, and energy technologies enables him to drive responsible, high‑impact AI and agentic initiatives across the organization.
Iñigo Gutierrez is a Global Cloud AI Engineer at Iberdrola with five years of experience in Cloud architecture, platform engineering, and AI enablement. Based in Bilbao, he is responsible for the design, evolution, and governance of the company’s corporate Cloud platforms, ensuring they provide a secure and scalable foundation for AI and digital transformation initiatives. Iñigo acts as a technical enabler between Cloud engineering teams, AI projects, and business units, promoting standardized practices, operational excellence, and the adoption of responsible, high‑impact AI solutions across the organization.

Building real-time voice assistants with Amazon Nova Sonic compared to …

Voice AI agents are reshaping how we interact with technology. From customer service and healthcare assistance to home automation and personal productivity, these intelligent virtual assistants are rapidly gaining popularity across industries. Their natural language capabilities, constant availability, and increasing sophistication make them valuable tools for businesses seeking efficiency and individuals desiring seamless digital experiences.
Amazon Nova Sonic delivers real-time, human-like voice conversations through the bidirectional streaming interface. It understands different speaking styles and generates expressive responses that adapt to both the words spoken and the way they are spoken. The model supports multiple languages and offers both masculine and feminine voices, making it ideal for customer support, marketing calls, voice assistants, and educational applications.
When compared with newer architectures such as Amazon Nova Sonic—which combines speech understanding and generation into a single end-to-end model—classic AI voice chat systems use cascading architectures with sequential processing. These systems process a user’s speech through a distinct pipeline: The cascaded models approach breaks down voice AI processing into separate components:

Voice activity detection (VAD): A pre-processing VAD is required to detect when the user pauses or stops speaking.
Speech-to-text (STT): The user’s spoken words are converted into a written text format by an automatic speech recognition (ASR) model.
Large language model (LLM) processing: The transcribed text is then fed to a LLM or dialogue manager, which analyzes the input and generates a relevant textual response based on the conversation’s context.
Text-to-speech (TTS): The AI’s text-based reply is then converted back into natural-sounding spoken audio by a TTS model, which is then played to the user.

The following diagram illustrates the conceptual flow of how users interact with Nova Sonic for real-time voice conversations compared to a cascading voice assistant solution.

The core challenges of cascading architecture
While a cascading architecture offers benefits such as modular design, specialized components, and debuggability, cumulative latency and reduced interactivity are its drawbacks.
The cascade effect
Consider a voice assistant handling a simple weather query. In cascading pipelines, each processing step introduces latency and potential errors. Customer implementations showed how initial misinterpretations can compound through the pipeline, often resulting in irrelevant responses. This cascading effect complicated troubleshooting and negatively impacted overall user experience.
Time is everything
Real conversations require natural timing. Sequential processing can create noticeable delays in response times. These interruptions in conversational flow can lead to user friction.
The integration challenge
Voice AI demands more than just speech processing—it requires natural interaction patterns. Customer feedback highlighted how orchestrating multiple components made it difficult to handle dynamic conversation elements like interruptions or rapid exchanges. Engineering resources often focused more on pipeline management.
Resource reality
Cascading architectures require independent computing resources, monitoring, and maintenance for each component. This architectural complexity impacts both development velocity and operational efficiency. Scaling challenges intensify as conversation volumes increase, affecting system reliability and cost optimization.
Impact on voice assistant development
These insights drove key architectural decisions in Nova Sonic development, addressing the fundamental need for unified speech-to-speech processing that enables natural, responsive voice experiences without the complexity of multi-component management.
Comparing the two approaches
To compare the speech-to-speech and cascaded approach to building voice AI agents, consider the following:

Consideration
Speech-to-speech (Nova Sonic)
Cascaded models

Latency
Optimized latency performance and TTFA  We evaluate the latency performance of Nova Sonic model using the Time to First Audio (TTFA 1.09) metric. TTFA measures the elapsed time from the completion of a user’s spoken query until the first byte of response audio is received. See technical report and model card.
Potential added latency and errors Cascaded models can use multiple models across speech recognition, language understanding, and voice generation, but are challenged by added latency and potential error propagation between stages. By using modern asynchronous orchestration frameworks like Pipecat and LiveKit, you can minimize latency. Streaming components and using text-to-speech fillers help maintain natural conversational flow and reduce delays

Architecture and development complexity
Simplified architecture Nova Sonic combines speech-to-text, natural language understanding, and text-to-speech in the one model with built-in tool use and barge-in detection, providing an event-driven architecture for key input and output events, and a bidirectional streaming API for a simplified developer experience.
Potential complexity in architecture Developers need to select best-in-class models for each stage of the pipeline, while orchestrating additional components such as asynchronous pipelines for delegated agents and tool use, TTS fillers and (VAD).

Model selection and customization
Less control over individual components Amazon Nova Sonic allows customization of voices, built-in tool use and integrations to Amazon Bedrock Knowledge Bases and Amazon Bedrock AgentCore. However, it offers less granular control over individual model components compared to fully modular cascaded systems.
Potential granular control over each step Cascaded models provide more control over each step by allowing individual tuning, replacement, and optimization of each model components such as STT, language understanding, and TTS independently. This includes models from Amazon Bedrock Marketplace, Amazon SageMaker AI and fine–tuned models. This modularity enables selection and flexibility of models, making it ideal for complex or specialized capabilities requiring tailored performance.

Cost structure
Simplified cost structure through an integrated approach Amazon Nova Sonic is priced on a token-based consumption model.
Potential complexity in costs associated with multiple components Cascaded models consist of multiple components whose costs need to be estimated. This is especially important at scale and high volumes.

Language and accent support
Languages supported by Nova Sonic
Potential broader language support through specialized models including the ability to switch languages mid-conversation

Region availability
Regions supported by Nova Sonic
Potential broader region support because of the broad selection of models and ability to self-host models on Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon SageMaker.

The two approaches also have some shared traits.

Telephony and transport options
Both cascaded and speech-to-speech approaches support a variety of telephony and transport protocols such as WebRTC and WebSocket, enabling real-time, low-latency audio streaming over the web and phone networks. These protocols facilitate seamless, bidirectional audio exchange crucial for natural conversational experiences, allowing voice AI systems to integrate easily with existing communication infrastructures while maintaining responsiveness and audio quality.

Evaluations, observability, and testing
Both cascaded and speech-to-speech voice AI approaches can be systematically evaluated, observed, and tested for reliable comparison. Investing in a voice AI evaluation and observability system is recommended to gain confidence in production accuracy and performance. Such a system should be capable of tracing the entire input-to-output pipeline, capturing metrics and conversation data end-to-end to comprehensively assess quality, latency, and conversational robustness over time.

Developer frameworks
Both cascaded and speech-to-speech approaches are well supported by leading open-source voice AI frameworks like Pipecat and LiveKit. These frameworks provide modular, flexible pipelines and real-time processing capabilities that developers can use to build, customize, and orchestrate voice AI models efficiently across different components and interaction styles.

When to use each approach
The following diagram shows a practical framework to guide your architecture decision:

Use speech-to-speech when:

Simplicity of implementation is important
The use case fits within Nova Sonic’s capabilities
You’re looking for a real-time chat experience that feels human-like and delivers low latency

Use cascaded models when:

Customization of individual components is required
You need to use specialized models from the Amazon Bedrock Marketplace, Amazon SageMaker AI, or fine-tuned models for your specific domain
You need support for languages or accents not covered by Nova Sonic
The use case requires specialized processing at specific stages

Conclusion
In this post, you learned how Amazon Nova Sonic is designed to solve some of the challenges faced by cascaded approaches, simplify building voice AI agents, and provide natural conversational capabilities. We also provided guidance on when to choose each approach to help you make informed decisions for your voice AI projects. If you’re looking to enhance your cascaded voice system, you know have the basics of how to migrate to Nova Sonic so you can offer seamless, real-time conversational experiences with a simplified architecture.
To learn more, see Amazon Nova Sonic and contact your account team to explore how you can accelerate your voice AI initiatives.
Resources

Amazon Nova Sonic Technical Report and Model Card
Amazon Nova Sonic User Guide
Amazon Nova Sonic and Amazon Bedrock AgentCore
Amazon Nova Sonic Telephony Integration Guide
Amazon Nova Sonic and Pipecat
Amazon Nova Sonic and LiveKit

About the authors
Daniel Wirjo is a Solutions Architect at AWS, focused on AI and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive growth and innovation on AWS. Outside of work, Daniel enjoys taking walks with a coffee in hand, appreciating nature, and learning new ideas.
Ravi Thakur is a Sr Solutions Architect at AWS based in Charlotte, NC. He has cross‑industry experience across retail, financial services, healthcare, and energy & utilities, and specializes in solving complex business challenges using well‑architected cloud patterns. His expertise spans microservices, cloud‑native architectures, and generative AI. Outside of work, Ravi enjoys motorcycle rides and family getaways.
Lana Zhang is a Senior Specialist Solutions Architect for Generative AI at AWS within the Worldwide Specialist Organization. She specializes in AI/ML, with a focus on use cases such as AI voice assistants and multimodal understanding. She works closely with customers across diverse industries, including media and entertainment, gaming, sports, advertising, financial services, and healthcare, to help them transform their business solutions through AI.

Microsoft AI Proposes OrbitalBrain: Enabling Distributed Machine Learn …

Earth observation (EO) constellations capture huge volumes of high-resolution imagery every day, but most of it never reaches the ground in time for model training. Downlink bandwidth is the main bottleneck. Images can sit on orbit for days while ground models train on partial and delayed data.

Microsoft Researchers introduced ‘OrbitalBrain’ framework as a different approach. Instead of using satellites only as sensors that relay data to Earth, it turns a nanosatellite constellation into a distributed training system. Models are trained, aggregated, and updated directly in space, using onboard compute, inter-satellite links, and predictive scheduling of power and bandwidth.

OrbitalBrain: A Distributed Framework For Training ML Models in Space

The BentPipe Bottleneck

Most commercial constellations use the BentPipe model. Satellites collect images, store them locally, and dump them to ground stations whenever they pass overhead.

The research team evaluates a Planet-like constellation with 207 satellites and 12 ground stations. At maximum imaging rate, the system captures 363,563 images per day. With 300 MB per image and realistic downlink constraints, only 42,384 images can be transmitted in that period, around 11.7% of what was captured. Even if images are compressed to 100 MB, only 111,737 images, about 30.7%, reach the ground within 24 hours.

Limited onboard storage adds another constraint. Old images must be deleted to make room for new ones, which means many potentially useful samples are never available for ground-based training.

Why Conventional Federated Learning is not Enough

Federated learning (FL) seems like an obvious fit for satellites. Each satellite could train locally and send model updates to a ground server for aggregation. The research team evaluate several FL baselines adapted to this setting:

AsyncFL

SyncFL

FedBuff

FedSpace

However, these methods assume more stable communication and more flexible power than satellites can provide. When the research team simulate realistic orbital dynamics, intermittent ground contact, limited power, and non-i.i.d. data across satellites, these baselines show unstable convergence and large accuracy drops, in the range of 10%–40% compared to idealized conditions.

The time-to-accuracy curves flatten and oscillate, especially when satellites are isolated from ground stations for long periods. Many local updates become stale before they can be aggregated.

OrbitalBrain: Constellation-Centric Training in Space

OrbitalBrain starts from 3 observations:

Constellations are usually operated by a single commercial entity, so raw data can be shared across satellites.

Orbits, ground station visibility, and solar power are predictable from orbital elements and power models.

Inter-satellite links (ISLs) and onboard accelerators are now practical on nano-satellites.

The framework exposes 3 actions for each satellite in a scheduling window:

Local Compute (LC): train the local model on stored images.

Model Aggregation (MA): exchange and aggregate model parameters over ISLs.

Data Transfer (DT): exchange raw images between satellites to reduce data skew.

A controller running in the cloud, reachable via ground stations, computes a predictive schedule for each satellite. The schedule decides which action to prioritize in each future window, based on forecasts of energy, storage, orbital visibility, and link opportunities.

Core Components: Profiler, MA, DT, Executor

Guided performance profiler

Model aggregation over ISLs

Data transferrer for label rebalancing

Executor

Experimental setup

OrbitalBrain is implemented in Python on top of the CosmicBeats orbital simulator and the FLUTE federated learning framework. Onboard compute is modeled as an NVIDIA-Jetson-Orin-Nano-4GB GPU, with power and communication parameters calibrated from public satellite and radio specifications.

The research team simulate 24-hour traces for 2 real constellations:

Planet: 207 satellites with 12 ground stations.

Spire: 117 satellites.

They evaluate 2 EO classification tasks:

fMoW: around 360k RGB images, 62 classes, DenseNet-161 with the last 5 layers trainable.

So2Sat: around 400k multispectral images, 17 classes, ResNet-50 with the last 5 layers trainable.

Results: faster time-to-accuracy and higher accuracy

OrbitalBrain is compared with BentPipe, AsyncFL, SyncFL, FedBuff, and FedSpace under full physical constraints.

For fMoW, after 24 hours:

Planet: OrbitalBrain reaches 52.8% top-1 accuracy.

Spire: OrbitalBrain reaches 59.2% top-1 accuracy.

For So2Sat:

Planet: 47.9% top-1 accuracy.

Spire: 47.1% top-1 accuracy.

These results improve over the best baseline by 5.5%–49.5%, depending on dataset and constellation.

In terms of time-to-accuracy, OrbitalBrain achieves 1.52×–12.4× speedup compared to state-of-the-art ground-based or federated learning approaches. This comes from using satellites that cannot currently reach a ground station by aggregating over ISLs and from rebalancing data distributions via DT.

Ablation studies show that disabling MA or DT significantly degrades both convergence speed and final accuracy. Additional experiments indicate that OrbitalBrain remains robust when cloud cover hides part of the imagery, when only a subset of satellites participate, and when image sizes and resolutions vary.

Implications for satellite AI workloads

OrbitalBrain demonstrates that model training can move into space and that satellite constellations can act as distributed ML systems, not just data sources. By coordinating local training, model aggregation, and data transfer under strict bandwidth, power, and storage constraints, the framework enables fresher models for tasks like forest fire detection, flood monitoring, and climate analytics, without waiting days for data to reach terrestrial data centers.

Key Takeaways

BentPipe downlink is the core bottleneck: Planet-like EO constellations can only downlink about 11.7% of captured 300 MB images per day, and about 30.7% even with 100 MB compression, which severely limits ground-based model training.

Standard federated learning fails under real satellite constraints: AsyncFL, SyncFL, FedBuff, and FedSpace degrade by 10%–40% in accuracy when realistic orbital dynamics, intermittent links, power limits, and non-i.i.d. data are applied, leading to unstable convergence.

OrbitalBrain co-schedules compute, aggregation, and data transfer in orbit: A cloud controller uses forecasts of orbit, power, storage, and link opportunities to select Local Compute, Model Aggregation via ISLs, or Data Transfer per satellite, maximizing a utility function per action.

Label rebalancing and model staleness are handled explicitly: A guided profiler tracks model staleness and loss to define compute utility, while the data transferrer uses Jensen–Shannon divergence on label histograms to drive raw-image exchanges that reduce non-i.i.d. effects.

OrbitalBrain delivers higher accuracy and up to 12.4× faster time-to-accuracy: In simulations on Planet and Spire constellations with fMoW and So2Sat, OrbitalBrain improves final accuracy by 5.5%–49.5% over BentPipe and FL baselines and achieves 1.52×–12.4× speedups in time-to-accuracy.

Check out the Paper. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Microsoft AI Proposes OrbitalBrain: Enabling Distributed Machine Learning in Space with Inter-Satellite Links and Constellation-Aware Resource Optimization Strategies appeared first on MarkTechPost.

A Coding Implementation to Establish Rigorous Prompt Versioning and Re …

In this tutorial, we show how we treat prompts as first-class, versioned artifacts and apply rigorous regression testing to large language model behavior using MLflow. We design an evaluation pipeline that logs prompt versions, prompt diffs, model outputs, and multiple quality metrics in a fully reproducible manner. By combining classical text metrics with semantic similarity and automated regression flags, we demonstrate how we can systematically detect performance drift caused by seemingly small prompt changes. Along the tutorial, we focus on building a workflow that mirrors real software engineering practices, but applied to prompt engineering and LLM evaluation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install -U “openai>=1.0.0” mlflow rouge-score nltk sentence-transformers scikit-learn pandas

import os, json, time, difflib, re
from typing import List, Dict, Any, Tuple

import mlflow
import pandas as pd
import numpy as np

from openai import OpenAI
from rouge_score import rouge_scorer
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download(“punkt”, quiet=True)
nltk.download(“punkt_tab”, quiet=True)

if not os.getenv(“OPENAI_API_KEY”):
try:
from google.colab import userdata # type: ignore
k = userdata.get(“OPENAI_API_KEY”)
if k:
os.environ[“OPENAI_API_KEY”] = k
except Exception:
pass

if not os.getenv(“OPENAI_API_KEY”):
import getpass
os.environ[“OPENAI_API_KEY”] = getpass.getpass(“Enter OPENAI_API_KEY (input hidden): “).strip()

assert os.getenv(“OPENAI_API_KEY”), “OPENAI_API_KEY is required.”

We set up the execution environment by installing all required dependencies and importing the core libraries used throughout the tutorial. We securely load the OpenAI API key at runtime, ensuring credentials are never hard-coded in the notebook. We also initialize essential NLP resources to ensure the evaluation pipeline runs reliably across different environments.

Copy CodeCopiedUse a different BrowserMODEL = “gpt-4o-mini”
TEMPERATURE = 0.2
MAX_OUTPUT_TOKENS = 250

ABS_SEM_SIM_MIN = 0.78
DELTA_SEM_SIM_MAX_DROP = 0.05
DELTA_ROUGE_L_MAX_DROP = 0.08
DELTA_BLEU_MAX_DROP = 0.10

mlflow.set_tracking_uri(“file:/content/mlruns”)
mlflow.set_experiment(“prompt_versioning_llm_regression”)

client = OpenAI()
embedder = SentenceTransformer(“all-MiniLM-L6-v2”)

EVAL_SET = [
{
“id”: “q1”,
“input”: “Summarize in one sentence: MLflow tracks experiments, runs, parameters, metrics, and artifacts.”,
“reference”: “MLflow helps track machine learning experiments by logging runs with parameters, metrics, and artifacts.”
},
{
“id”: “q2”,
“input”: “Rewrite professionally: ‘this model is kinda slow but it works ok.'”,
“reference”: “The model is somewhat slow, but it performs reliably.”
},
{
“id”: “q3”,
“input”: “Extract key fields as JSON: ‘Order 5531 by Alice costs $42.50 and ships to Toronto.'”,
“reference”: ‘{“order_id”:”5531″,”customer”:”Alice”,”amount_usd”:42.50,”city”:”Toronto”}’
},
{
“id”: “q4”,
“input”: “Answer briefly: What is prompt regression testing?”,
“reference”: “Prompt regression testing checks whether prompt changes degrade model outputs compared to a baseline.”
},
]

PROMPTS = [
{
“version”: “v1_baseline”,
“prompt”: (
“You are a precise assistant.n”
“Follow the user request carefully.n”
“If asked for JSON, output valid JSON only.n”
“User: {user_input}”
)
},
{
“version”: “v2_formatting”,
“prompt”: (
“You are a helpful, structured assistant.n”
“Respond clearly and concisely.n”
“Prefer clean formatting.n”
“User request: {user_input}”
)
},
{
“version”: “v3_guardrailed”,
“prompt”: (
“You are a rigorous assistant.n”
“Rules:n”
“1) If user asks for JSON, output ONLY valid minified JSON.n”
“2) Otherwise, keep the answer short and factual.n”
“User: {user_input}”
)
},
]

We define all experimental configurations, including model parameters, regression thresholds, and MLflow tracking settings. We construct the evaluation dataset and explicitly declare multiple prompt versions to compare and test for regressions. By centralizing these definitions, we ensure that prompt changes and evaluation logic remain controlled and reproducible.

Copy CodeCopiedUse a different Browserdef call_llm(formatted_prompt: str) -> str:
resp = client.responses.create(
model=MODEL,
input=formatted_prompt,
temperature=TEMPERATURE,
max_output_tokens=MAX_OUTPUT_TOKENS,
)
out = getattr(resp, “output_text”, None)
if out:
return out.strip()
try:
texts = []
for item in resp.output:
if getattr(item, “type”, “”) == “message”:
for c in item.content:
if getattr(c, “type”, “”) in (“output_text”, “text”):
texts.append(getattr(c, “text”, “”))
return “n”.join(texts).strip()
except Exception:
return “”

smooth = SmoothingFunction().method3
rouge = rouge_scorer.RougeScorer([“rougeL”], use_stemmer=True)

def safe_tokenize(s: str) -> List[str]:
s = (s or “”).strip().lower()
if not s:
return []
try:
return nltk.word_tokenize(s)
except LookupError:
return re.findall(r”bw+b”, s)

def bleu_score(ref: str, hyp: str) -> float:
r = safe_tokenize(ref)
h = safe_tokenize(hyp)
if len(h) == 0 or len(r) == 0:
return 0.0
return float(sentence_bleu([r], h, smoothing_function=smooth))

def rougeL_f1(ref: str, hyp: str) -> float:
scores = rouge.score(ref or “”, hyp or “”)
return float(scores[“rougeL”].fmeasure)

def semantic_sim(ref: str, hyp: str) -> float:
embs = embedder.encode([ref or “”, hyp or “”], normalize_embeddings=True)
return float(cosine_similarity([embs[0]], [embs[1]])[0][0])

We implement the core LLM invocation and evaluation metrics used to assess prompt quality. We compute BLEU, ROUGE-L, and semantic similarity scores to capture both surface-level and semantic differences in model outputs. It allows us to evaluate prompt changes from multiple complementary perspectives rather than relying on a single metric.

Copy CodeCopiedUse a different Browserdef evaluate_prompt(prompt_template: str) -> Tuple[pd.DataFrame, Dict[str, float], str]:
rows = []
for ex in EVAL_SET:
p = prompt_template.format(user_input=ex[“input”])
y = call_llm(p)
ref = ex[“reference”]

rows.append({
“id”: ex[“id”],
“input”: ex[“input”],
“reference”: ref,
“output”: y,
“bleu”: bleu_score(ref, y),
“rougeL_f1”: rougeL_f1(ref, y),
“semantic_sim”: semantic_sim(ref, y),
})

df = pd.DataFrame(rows)
agg = {
“bleu_mean”: float(df[“bleu”].mean()),
“rougeL_f1_mean”: float(df[“rougeL_f1”].mean()),
“semantic_sim_mean”: float(df[“semantic_sim”].mean()),
}
outputs_jsonl = “n”.join(json.dumps(r, ensure_ascii=False) for r in rows)
return df, agg, outputs_jsonl

def log_text_artifact(text: str, artifact_path: str):
mlflow.log_text(text, artifact_path)

def prompt_diff(old: str, new: str) -> str:
a = old.splitlines(keepends=True)
b = new.splitlines(keepends=True)
return “”.join(difflib.unified_diff(a, b, fromfile=”previous_prompt”, tofile=”current_prompt”))

def compute_regression_flags(baseline: Dict[str, float], current: Dict[str, float]) -> Dict[str, Any]:
d_sem = baseline[“semantic_sim_mean”] – current[“semantic_sim_mean”]
d_rouge = baseline[“rougeL_f1_mean”] – current[“rougeL_f1_mean”]
d_bleu = baseline[“bleu_mean”] – current[“bleu_mean”]

flags = {
“abs_semantic_fail”: current[“semantic_sim_mean”] < ABS_SEM_SIM_MIN,
“drop_semantic_fail”: d_sem > DELTA_SEM_SIM_MAX_DROP,
“drop_rouge_fail”: d_rouge > DELTA_ROUGE_L_MAX_DROP,
“drop_bleu_fail”: d_bleu > DELTA_BLEU_MAX_DROP,
“delta_semantic”: float(d_sem),
“delta_rougeL”: float(d_rouge),
“delta_bleu”: float(d_bleu),
}
flags[“regression”] = any([flags[“abs_semantic_fail”], flags[“drop_semantic_fail”], flags[“drop_rouge_fail”], flags[“drop_bleu_fail”]])
return flags

We build the evaluation and regression logic that runs each prompt against the evaluation set and aggregates results. We log prompt artifacts, prompt diffs, and evaluation outputs to MLflow, ensuring every experiment remains auditable. We also compute regression flags that automatically identify whether a prompt version degrades performance relative to the baseline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“Running prompt versioning + regression testing with MLflow…”)
print(f”Tracking URI: {mlflow.get_tracking_uri()}”)
print(f”Experiment: {mlflow.get_experiment_by_name(‘prompt_versioning_llm_regression’).name}”)

run_summary = []
baseline_metrics = None
baseline_prompt = None
baseline_df = None
baseline_metrics_name = None

with mlflow.start_run(run_name=f”prompt_regression_suite_{int(time.time())}”) as parent_run:
mlflow.set_tag(“task”, “prompt_versioning_regression_testing”)
mlflow.log_param(“model”, MODEL)
mlflow.log_param(“temperature”, TEMPERATURE)
mlflow.log_param(“max_output_tokens”, MAX_OUTPUT_TOKENS)
mlflow.log_param(“eval_set_size”, len(EVAL_SET))

for pv in PROMPTS:
ver = pv[“version”]
prompt_t = pv[“prompt”]

with mlflow.start_run(run_name=ver, nested=True) as child_run:
mlflow.log_param(“prompt_version”, ver)
log_text_artifact(prompt_t, f”prompts/{ver}.txt”)

if baseline_prompt is not None and baseline_metrics_name is not None:
diff = prompt_diff(baseline_prompt, prompt_t)
log_text_artifact(diff, f”prompt_diffs/{baseline_metrics_name}_to_{ver}.diff”)
else:
log_text_artifact(“BASELINE_PROMPT (no diff)”, f”prompt_diffs/{ver}.diff”)

df, agg, outputs_jsonl = evaluate_prompt(prompt_t)

mlflow.log_dict(agg, f”metrics/{ver}_agg.json”)
log_text_artifact(outputs_jsonl, f”outputs/{ver}_outputs.jsonl”)

mlflow.log_metric(“bleu_mean”, agg[“bleu_mean”])
mlflow.log_metric(“rougeL_f1_mean”, agg[“rougeL_f1_mean”])
mlflow.log_metric(“semantic_sim_mean”, agg[“semantic_sim_mean”])

if baseline_metrics is None:
baseline_metrics = agg
baseline_prompt = prompt_t
baseline_df = df
baseline_metrics_name = ver
flags = {“regression”: False, “delta_bleu”: 0.0, “delta_rougeL”: 0.0, “delta_semantic”: 0.0}
mlflow.set_tag(“regression”, “false”)
else:
flags = compute_regression_flags(baseline_metrics, agg)
mlflow.log_metric(“delta_bleu”, flags[“delta_bleu”])
mlflow.log_metric(“delta_rougeL”, flags[“delta_rougeL”])
mlflow.log_metric(“delta_semantic”, flags[“delta_semantic”])
mlflow.set_tag(“regression”, str(flags[“regression”]).lower())
for k in [“abs_semantic_fail”,”drop_semantic_fail”,”drop_rouge_fail”,”drop_bleu_fail”]:
mlflow.set_tag(k, str(flags[k]).lower())

run_summary.append({
“prompt_version”: ver,
“bleu_mean”: agg[“bleu_mean”],
“rougeL_f1_mean”: agg[“rougeL_f1_mean”],
“semantic_sim_mean”: agg[“semantic_sim_mean”],
“delta_bleu_vs_baseline”: float(flags.get(“delta_bleu”, 0.0)),
“delta_rougeL_vs_baseline”: float(flags.get(“delta_rougeL”, 0.0)),
“delta_semantic_vs_baseline”: float(flags.get(“delta_semantic”, 0.0)),
“regression_flag”: bool(flags[“regression”]),
“mlflow_run_id”: child_run.info.run_id,
})

summary_df = pd.DataFrame(run_summary).sort_values(“prompt_version”)
print(“n=== Aggregated Results (higher is better) ===”)
display(summary_df)

regressed = summary_df[summary_df[“regression_flag”] == True]
if len(regressed) > 0:
print(“n Regressions detected:”)
display(regressed[[“prompt_version”,”delta_bleu_vs_baseline”,”delta_rougeL_vs_baseline”,”delta_semantic_vs_baseline”,”mlflow_run_id”]])
else:
print(“n No regressions detected under current thresholds.”)

if len(regressed) > 0 and baseline_df is not None:
worst_ver = regressed.sort_values(“delta_semantic_vs_baseline”, ascending=False).iloc[0][“prompt_version”]
worst_prompt = next(p[“prompt”] for p in PROMPTS if p[“version”] == worst_ver)
worst_df, _, _ = evaluate_prompt(worst_prompt)

merged = baseline_df[[“id”,”output”,”bleu”,”rougeL_f1″,”semantic_sim”]].merge(
worst_df[[“id”,”output”,”bleu”,”rougeL_f1″,”semantic_sim”]],
on=”id”,
suffixes=(“_baseline”, f”_{worst_ver}”)
)
merged[“delta_semantic”] = merged[“semantic_sim_baseline”] – merged[f”semantic_sim_{worst_ver}”]
merged[“delta_rougeL”] = merged[“rougeL_f1_baseline”] – merged[f”rougeL_f1_{worst_ver}”]
merged[“delta_bleu”] = merged[“bleu_baseline”] – merged[f”bleu_{worst_ver}”]
print(f”n=== Per-example deltas: baseline vs {worst_ver} (positive delta = worse) ===”)
display(
merged[[“id”,”delta_semantic”,”delta_rougeL”,”delta_bleu”,”output_baseline”,f”output_{worst_ver}”]]
.sort_values(“delta_semantic”, ascending=False)
)

print(“nOpen MLflow UI (optional) by running:”)
print(“!mlflow ui –backend-store-uri file:/content/mlruns –host 0.0.0.0 –port 5000”)

We orchestrate the full prompt regression testing workflow using nested MLflow runs. We compare each prompt version against the baseline, log metric deltas, and record regression outcomes in a structured summary table. This completes a repeatable, engineering-grade pipeline for prompt versioning and regression testing that we can extend to larger datasets and real-world applications.

In conclusion, we established a practical, research-oriented framework for prompt versioning and regression testing that enables us to evaluate LLM behavior with discipline and transparency. We showed how MLflow enables us to track prompt evolution, compare outputs across versions, and automatically flag regressions based on well-defined thresholds. This approach helps us move away from ad hoc prompt tuning and toward measurable, repeatable experimentation. By adopting this workflow, we ensured that prompt updates improve model behavior intentionally rather than introducing hidden performance regressions.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation to Establish Rigorous Prompt Versioning and Regression Testing Workflows for Large Language Models using MLflow appeared first on MarkTechPost.

Meet OAT: The New Action Tokenizer Bringing LLM-Style Scaling and Flex …

Robots are entering their GPT-3 era. For years, researchers have tried to train robots using the same autoregressive (AR) models that power large language models (LLMs). If a model can predict the next word in a sentence, it should be able to predict the next move for a robotic arm. However, a technical wall has blocked this progress: continuous robot movements are difficult to turn into discrete tokens.

A team of researchers from Harvard University and Stanford University have released a new framework called Ordered Action Tokenization (OAT) to bridge this gap.

https://arxiv.org/pdf/2602.04215

The Messy Reality of Robot Actions

Tokenization turns complex data into a sequence of discrete numbers (tokens). For robots, these actions are continuous signals like joint angles. Previous strategies had fatal flaws:

Binning: Turns every action dimension into a ‘bin.’ While simple, it creates massive sequences that make training and inference slow.

FAST (Frequency-space Action Sequence Tokenization): Uses math to compress movements into frequency coefficients. It is fast but often produces ‘undecodable’ sequences where small errors cause the robot to halt or move unpredictably.

Learned Latent Tokenizers: These use a learned ‘dictionary’ of movements. They are safe but lack a specific order, meaning the model treats early and late tokens as equally important.

https://arxiv.org/pdf/2602.04215

The Three Golden Rules of OAT

The research team identified 3 essential properties—desiderata—for a functional robot tokenizer:

High Compression (P.1): Token sequences must be short to keep models efficient.

Total Decodability (P.2): The decoder must be a total function, ensuring every possible token sequence maps to a valid movement.

Causal Ordering (P.3): Tokens must have a left-to-right structure where early tokens capture global motion and later tokens refine details.

The Secret Sauce: Nested Dropout and Registers

OAT uses a transformer encoder with register tokens to summarize action chunks. To force the model to learn ‘important’ things first, the research team used a innovative approach called Nested Dropout.

https://arxiv.org/pdf/2602.04215

Breaking the Benchmarks

The research team tested OAT across 20+ tasks in 4 major simulation benchmarks. OAT consistently outperformed the industry-standard Diffusion Policy (DP) and previous tokenizers.

Performance Results

BenchmarkOAT Success RateDP Success RateBin Token CountOAT Token CountLIBERO56.3% 36.6% 224 8 RoboMimic73.1% 67.1% 224 8 MetaWorld24.4% 19.3% 128 8 RoboCasa54.6% 54.0% 384 8

‘Anytime’ Inference: Speed vs. Precision

The most practical benefit of OAT is prefix-based detokenization. Since the tokens are ordered by importance, you can stop the model early.

Coarse Actions: Decoding just 1 or 2 tokens gives the robot a general direction quickly, which is useful for low-latency tasks.

Fine Actions: Generating all 8 tokens provides the high-precision details needed for complex insertions.

This allows for a smooth trade-off between computation cost and action fidelity that previous fixed-length tokenizers could not offer.

Key Takeaways

Solving the Tokenization Gap: OAT addresses a fundamental limitation in applying autoregressive models to robotics by introducing a learned tokenizer that simultaneously achieves high compression, total decodability, and causal ordering.

Ordered Representation via Nested Dropout: By utilizing nested dropout during training, OAT forces the model to prioritize global, coarse motion patterns in early tokens while reserving later tokens for fine-grained refinements.

Total Decodability and Reliability: Unlike prior frequency-domain methods like FAST, OAT ensures the detokenizer is a total function, meaning every possible token sequence generates a valid action chunk, preventing runtime execution failures.

Flexible ‘Anytime’ Inference: The ordered structure enables prefix-based decoding, allowing robots to execute coarse actions from just one or two tokens to save computation or full eight-token sequences for high-precision tasks.

Superior Performance Across Benchmarks: Autoregressive policies equipped with OAT consistently outperform diffusion-based baselines and other tokenization schemes, achieving a 52.3% aggregate success rate and superior results in real-world ‘Pick & Place’ and ‘Stack Cups’ tasks.

Check out the Paper, Repo and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meet OAT: The New Action Tokenizer Bringing LLM-Style Scaling and Flexible, Anytime Inference to the Robotics World appeared first on MarkTechPost.

Automated Reasoning checks rewriting chatbot reference implementation

Today, we are publishing a new open source sample chatbot that shows how to use feedback from Automated Reasoning checks to iterate on the generated content, ask clarifying questions, and prove the correctness of an answer.
The chatbot implementation also produces an audit log that includes mathematically verifiable explanations for the answer validity and a user interface that shows developers the iterative, rewriting process happening behind the scenes. Automated Reasoning checks use logical deduction to automatically demonstrate that a statement is correct. Unlike large language models, Automated Reasoning tools are not guessing or predicting accuracy. Instead, they rely on mathematical proofs to verify compliance with policies. This blog post dives deeper into the implementation architecture for the Automated Reasoning checks rewriting chatbot.
Improve accuracy and transparency with Automated Reasoning checks
LLMs can sometimes generate responses that sound convincing but contain factual errors—a phenomenon known as hallucination. Automated Reasoning checks validate a user’s question and an LLM-generated answer, giving rewriting feedback that points out ambiguous statements, assertions that are too broad, and factually incorrect claims based on ground truth knowledge encoded in Automated Reasoning policies.
A chatbot that uses Automated Reasoning checks to iterate on its answers before presenting them to users helps improve accuracy because it can make precise statements that explicitly answer users’ yes/no questions without leaving room for ambiguity; and helps improve transparency because it can provide mathematically verifiable proofs of why its statements are correct, making generative AI applications auditable and explainable even in regulated environments.
Now that you understand the benefits, let’s explore how you can implement this in your own applications.
Chatbot reference implementation
The chatbot is a Flask application that exposes APIs to submit questions and check the status of an answer. To show the inner workings of the system, the APIs also let you retrieve information about the status of each iteration, the feedback from Automated Reasoning checks, and the rewriting prompt sent to the LLM.
You can use the frontend NodeJS application to configure an LLM from Amazon Bedrock to generate answers, select an Automated Reasoning policy for validation, and set the maximum number of iterations to correct an answer. Selecting a chat thread in the user interface opens a debug panel on the right that displays each iteration on the content and the validation output.

Figure 1 – Chat interface with debug panel

Once Automated Reasoning checks say a response is valid, the verifiable explanation for the validity is displayed.

Figure 2 – Automated Reasoning checks validity proof

How the iterative rewriting loop works
The open source reference implementation automatically helps improve chatbot answers by iterating on the feedback from Automated Reasoning checks and rewriting the response. When asked to validate a chatbot question and answer (Q&A), Automated Reasoning checks return a list of findings. Each finding represents an independent logical statement identified in the input Q&A. For example, for the Q&A “How much does S3 storage cost? In US East (N. Virginia), S3 costs $0.023/GB for the first 50Tb; in Asia Pacific (Sydney), S3 costs $0.025/GB for the first 50Tb” Automated Reasoning checks would produce two findings, one that validates the price for S3 in us-east-1 is $0.023, and one for ap-southeast-2.
When parsing a finding for a Q&A, Automated Reasoning checks separate the input into a list of factual premises and claims made against those premises. A premise can be a factual statement in the user question, like “I’m an S3 user in Virginia,” or an assumption laid out in the answer, like “For requests sent to us-east-1…” A claim represents a statement being verified. In our S3 pricing example from the previous paragraph, the Region would be a premise, and the price point would be a claim.
Each finding includes a validation result (VALID, INVALID, SATISFIABLE, TRANSLATION_AMBIGUOUS, IMPOSSIBLE) as well as the feedback necessary to rewrite the answer so that it is VALID. The feedback changes depending on the validation result. For example, ambiguous findings include two interpretations of the input text, satisfiable findings include two scenarios that show how the claims could be true in some cases and false in others. You can see the possible finding types in our API documentation.
With this context out of the way, we can dive deeper into how the reference implementation works:
Initial response and validation
When the user submits a question through the UI, the application first calls the configured Bedrock LLM to generate an answer, then calls the ApplyGuardrail API to validate the Q&A.
Using the output from Automated Reasoning checks in the ApplyGuardrail response, the application enters a loop where each iteration checks the Automated Reasoning checks feedback, performs an action like asking the LLM to rewrite an answer based on the feedback, and then calls ApplyGuardrail to validate the updated content again.
The rewriting loop (The heart of the system)
After the initial validation, the system uses the output from the Automated Reasoning checks to decide the next step. First, it sorts the findings based on their priority – addressing the most important first: TRANSLATION_AMBIGUOUS, IMPOSSIBLE, INVALID, SATISFIABLE, VALID. Then, it selects the highest priority finding and addresses it with the logic below. Since VALID is last in the prioritized list, the system will only accept something as VALID after addressing the other findings.

For TRANSLATION_AMBIGUOUS findings, the Automated Reasoning checks return two interpretations of the input text. For SATISFIABLE findings, the Automated Reasoning checks return two scenarios that prove and disprove the claims. Using the feedback, the application asks the LLM to decide on whether it wants to try and rewrite the answer to clarify ambiguities or ask the user follow up questions to gather additional information. For example, the SATISFIABLE feedback may say that the price of $0.023 is valid only if the Region is US East (N. Virginia). The LLM can use this information to ask about the application Region. When the LLM decides to ask follow-up questions, the loop pauses and waits for the user to answer the questions, then the LLM regenerates the answer based on the clarifications and the loop restarts.
For IMPOSSIBLE findings, the Automated Reasoning checks return a list of the rules that contradict the premises – accepted facts in the input content. Using the feedback, the application asks the LLM to rewrite the answer to avoid logical inconsistencies.
For INVALID findings, the Automated Reasoning checks return the rules from the Automated Reasoning policy that make the claims invalid based on the premises and policy rules. Using the feedback, the application asks the LLM to rewrite its answer so that it is consistent with the rules.
For VALID findings, the application exits the loop and returns the answer to the user.

After each answer rewrite, the system sends the Q&A to the ApplyGuardrail API for validation; the next iteration of the loop starts with the feedback from this call. Each iteration stores the findings and prompts with full context in the thread data structure, creating an audit trail of how the system arrived at the definitive answer.
Getting Started with the Automated Reasoning checks rewriting chatbot
To try our reference implementation, the first step is to create an Automated Reasoning policy:

Navigate to Amazon Bedrock in the AWS Management Console in one of the supported Regions in the United States or European Regions.
From the left navigation, open the Automated Reasoning page in the Build category.
Using the dropdown menu of the Create policy button, choose Create sample policy.
Enter a name for the policy and then choose Create policy at the bottom of the page.

Once you have created a policy, you can proceed to download and run the reference implementation:

Clone the Amazon Bedrock Samples repository.
Follow the instructions in the README file to install dependencies, build the frontend, and start the application.
Using your preferred browser navigate to http://localhost8080 and start testing.

Backend implementation details
If you’re planning to adapt this implementation for production use, this section goes over the key components in the backend architecture. You will find these components in the backend directory of the repository.

ThreadManager: Orchestrates a conversation lifecycle management. It handles the creation, retrieval, and status tracking of conversation threads, maintaining proper state throughout the rewriting process. The ThreadManager implements thread-safe operations using a lock to help prevent race conditions when multiple operations attempt to modify the same conversation simultaneously. It also tracks threads awaiting user input and can identify stale threads that have exceeded a configurable timeout.
ThreadProcessor: Handles the rewriting loop using a state machine pattern for clear, maintainable control flow. The processor manages state transitions between phases like GENERATE_INITIAL, VALIDATE, CHECK_QUESTIONS, HANDLE_RESULT, and REWRITING_LOOP, progressing the conversation correctly through each stage.
ValidationService: Integrates with Amazon Bedrock Guardrails. This service takes each LLM-generated response and submits it for validation using the ApplyGuardrail API. It handles the communication with AWS, manages retry logic with exponential backoff for transient failures, and parses the validation results into structured findings.
LLMResponseParser: Interprets the LLM’s intentions during the rewriting loop. When the system asks the LLM to fix an invalid response, the model must decide whether to attempt a rewrite (REWRITE), ask clarifying questions (ASK_QUESTIONS), or declare the task impossible due to contradictory premises (IMPOSSIBLE). The parser examines the LLM’s response for specific markers like “DECISION:“, “ANSWER:“, and “QUESTION:“, extracting structured information from natural language output. It handles markdown formatting gracefully and enforces limits on the number of questions (maximum 5).
AuditLogger: Writes structured JSON logs to a dedicated audit log file, recording two key event types: VALID_RESPONSE when a response passes validation, and MAX_ITERATIONS_REACHED when the system exhausts the set number of retry attempts. Each audit entry captures the timestamp, thread ID, prompt, response, model ID, and validation findings. The logger also extracts and records Q&A exchanges from clarification iterations, including whether the user answered or skipped the questions.

Together, these components help create a robust foundation for building trustworthy AI applications that combine the flexibility of large language models with the rigor of mathematical verification.
For detailed guidance on implementing Automated Reasoning checks in production:

Workshop: Generative AI Reliability with Automated Reasoning checks
Technical Blog: Minimize generative AI hallucinations with Amazon Bedrock Automated Reasoning checks
Use Case Blog: Build verifiable explainability into financial services workflows with Automated Reasoning checks for Amazon Bedrock Guardrails
Documentation: Amazon Bedrock Guardrails User Guide

About the authors

Stefano Buliani
Stefano is a Product Manager in the Automated Reasoning team at AWS.  With over 10 years at AWS, he has worked on serverless technologies, including open source projects like Serverless Java Container and has helped customers deploy hundreds of applications to production.

Scale LLM fine-tuning with Hugging Face and Amazon SageMaker AI

Enterprises are increasingly shifting from relying solely on large, general-purpose language models to developing specialized large language models (LLMs) fine-tuned on their own proprietary data. Although foundation models (FMs) offer impressive general capabilities, they often fall short when applied to the complexities of enterprise environments—where accuracy, security, compliance, and domain-specific knowledge are non-negotiable.
To meet these demands, organizations are adopting cost-efficient models tailored to their internal data and workflows. By fine-tuning on proprietary documents and domain-specific terminology, enterprises are building models that understand their unique context—resulting in more relevant outputs, tighter data governance, and simpler deployment across internal tools.
This shift is also a strategic move to reduce operational costs, improve inference latency, and maintain greater control over data privacy. As a result, enterprises are redefining their AI strategy as customized, right-sized models aligned to their business needs.
Scaling LLM fine-tuning for enterprise use cases presents real technical and operational hurdles, which are being overcome through the powerful partnership between Hugging Face and Amazon SageMaker AI.
Many organizations face fragmented toolchains and rising complexity when adopting advanced fine-tuning techniques like Low-Rank Adaptation (LoRA), QLoRA, and Reinforcement Learning with Human Feedback (RLHF). Additionally, the resource demands of large model training—including memory limitations and distributed infrastructure challenges—often slow down innovation and strains internal teams.
To overcome this, SageMaker AI and Hugging Face have joined forces to simplify and scale model customization. By integrating the Hugging Face Transformers libraries into SageMaker’s fully managed infrastructure, enterprises can now:

Run distributed fine-tuning jobs out of the box, with built-in support for parameter-efficient tuning methods
Use optimized compute and storage configurations that reduce training costs and improve GPU utilization
Accelerate time to value by using familiar open source libraries in a production-grade environment

This collaboration helps businesses focus on building domain-specific, right-sized LLMs, unlocking AI value faster while maintaining full control over their data and models.
In this post, we show how this integrated approach transforms enterprise LLM fine-tuning from a complex, resource-intensive challenge into a streamlined, scalable solution for achieving better model performance in domain-specific applications. We use the meta-llama/Llama-3.1-8B model, and execute a Supervised Fine-Tuning (SFT) job to improve the model’s reasoning capabilities on the MedReason dataset by using distributed training and optimization techniques, such as Fully-Sharded Data Parallel (FSDP) and LoRA with the Hugging Face Transformers library, executed with Amazon SageMaker Training Jobs.
Understanding the core concepts
The Hugging Face Transformers library is an open-source toolkit designed to fine-tune LLMs by enabling seamless experimentation and deployment with popular transformer models.
The Transformers library supports a variety of methods for aligning LLMs to specific objectives, including:

Thousands of pre-trained models – Access to a vast collection of models like BERT, Meta Llama, Qwen, T5, and more, which can be used for tasks such as text classification, translation, summarization, question answering, object detection, and speech recognition.
Pipelines API – Simplifies common tasks (such as sentiment analysis, summarization, and image segmentation) by handling tokenization, inference, and output formatting in a single call.
Trainer API – Provides a high-level interface for training and fine-tuning models, supporting features like mixed precision, distributed training, and integration with popular hardware accelerators.
Tokenization tools – Efficient and flexible tokenizers for converting raw text into model-ready inputs, supporting multiple languages and formats.

SageMaker Training Jobs is a fully managed, on-demand machine learning (ML) service that runs remotely on AWS infrastructure to train a model using your data, code, and chosen compute resources. This service abstracts away the complexities of provisioning and managing the underlying infrastructure, so you can focus on developing and fine-tuning your ML and foundation models. Key capabilities offered by SageMaker training jobs are:

Fully managed – SageMaker handles resource provisioning, scaling, and management for your training jobs, so you don’t need to manually set up servers or clusters.
Flexible input – You can use built-in algorithms, pre-built containers, or bring your own custom training scripts and Docker containers, to execute training workloads with most popular frameworks such as the Hugging Face Transformers library.
Scalable – It supports single-node or distributed training across multiple instances, making it suitable for both small and large-scale ML workloads.
Integration with multiple data sources – Training data can be stored in Amazon Simple Storage Service (Amazon S3), Amazon FSx, and Amazon Elastic Block Store (Amazon EBS), and output model artifacts are saved back to Amazon S3 after training is complete.
Customizable – You can specify hyperparameters, resource types (such as GPU or CPU instances), and other settings for each training job.
Cost-efficient options – Features like managed Spot Instances, flexible training plans, and heterogeneous clusters help optimize training costs.

Solution overview
The following diagram illustrates the solution workflow of using the Hugging Face Transformers library with a SageMaker Training job.

The workflow consists of the following steps:

The user prepares the dataset by formatting it with the specific prompt style used for the selected model.
The user prepares the training script by using the Hugging Face Transformers library to start the training workload, by specifying the configuration for the distribution option selected, such as Distributed Data Parallel (DDP) or Fully-Sharded Data Parallel (FSDP).
The user submits an API request to SageMaker AI, passing the location of the training script, the Hugging Face Training container URI, and the training configurations required, such as distribution algorithm, instance type, and instance count.
SageMaker AI uses the training job launcher script to run the training workload on a managed compute cluster. Based on the selected configuration, SageMaker AI provisions the required infrastructure, orchestrates distributed training, and upon completion, automatically decommissions the cluster.

This streamlined architecture delivers a fully managed user experience, helping you quickly develop your training code, define training parameters, and select your preferred infrastructure. SageMaker AI handles the end-to-end infrastructure management with a pay-as-you-go pricing model that bills only for the net training time in seconds.
Prerequisites
You must complete the following prerequisites before you can run the Meta Llama 3.1 8B fine-tuning notebook:

Make the following quota increase requests for SageMaker AI. For this use case, you will need to request a minimum of 1 p4d.24xlarge instance (with 8 x NVIDIA A100 GPUs) and scale to more p4d.24xlarge instances (depending on time-to-train and cost-to-train trade-offs for your use case). To help determine the right cluster size for the fine-tuning workload, you can use tools like VRAM Calculator or “Can it run LLM“. On the Service Quotas console, request the following SageMaker AI quotas:

P4D instances (p4.24xlarge) for training job usage: 1

Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess and AmazonS3FullAccess to give required access to SageMaker AI to run the examples.
Assign the following policy as a trust relationship to your IAM role:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “”,
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“sagemaker.amazonaws.com”
]
},
“Action”: “sts:AssumeRole”
}
]
}

(Optional) Create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding role. You can also use JupyterLab in your local setup

These permissions grant broad access and are not recommended for use in production environments. See the SageMaker Developer Guide for guidance on defining more fine-grained permissions.
Prepare the dataset
To prepare the dataset, you must load the UCSC-VLAA/MedReason dataset. MedReason is a large-scale, high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in LLMs. The following table shows an example of the data.

dataset_name
id_in_dataset
question
answer
reasoning
options

medmcqa
7131
Urogenital Diaphragm is made up of the following…
Colle’s fascia. Explanation: Colle’s fascia do…
Finding reasoning paths:n1. Urogenital diaphr…
Answer Choices:nA. Deep transverse Perineusn…

medmcqa
7133
Child with Type I Diabetes. What is the advise…
After 5 years. Explanation: Screening for diab…
**Finding reasoning paths:**nn1. Type 1 Diab…
Answer Choices:nA. After 5 yearsnB. After 2 …

medmcqa
7134
Most sensitive test for H pylori is-
Biopsy urease test. Explanation: Davidson&…
**Finding reasoning paths:**nn1. Consider th…
Answer Choices:nA. Fecal antigen testnB. Bio…

We want to use the following columns for preparing our dataset:

question – The question being posed
answer – The correct answer to the question
reasoning – A detailed, step-by-step logical explanation of how to arrive at the correct answer

We can use the following steps to format the input in the proper style used for Meta Llama 3.1, and configure the data channels for SageMaker training jobs on Amazon S3:

Load the UCSC-VLAA/MedReason dataset, using the first 10,000 rows of the original dataset: from datasets import load_dataset
dataset = load_dataset(“UCSC-VLAA/MedReason”, split=”train[:10000]”)
Apply the proper chat template to the dataset by using the apply_chat_template method of the Tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

def prepare_dataset(sample):

system_text = (
“You are a deep-thinking AI assistant.nn”
“For every user question, first write your thoughts and reasoning inside … tags, then provide your answer.”
)

messages = []

messages.append({“role”: “system”, “content”: system_text})
messages.append({“role”: “user”, “content”: sample[“question”]})
messages.append(
{
“role”: “assistant”,
“content”: f”n{sample[‘reasoning’]}nn{sample[‘answer’]}”,
}
)

# Apply chat template
sample[“text”] = tokenizer.apply_chat_template(
messages, tokenize=False
)

return sample

The function prepare_dataset will iterate over the elements of the dataset, and use the apply_chat_template function to have a prompt template in the following form:

system
{{SYSTEM_PROMPT}}
user
{{QUESTION}}
assistant

{{REASONING}}

{{FINAL_ANSWER}}

The following code is an example of the formatted prompt:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a deep-thinking AI assistant.
For every user question, first write your thoughts and reasoning inside … tags, then provide your answer.
<|eot_id|><|start_header_id|>user<|end_header_id|>
A 66-year-old man presents to the emergency room with blurred vision, lightheadedness, and chest pain that started 30 minutes ago. The patient is awake and alert.
His history is significant for uncontrolled hypertension, coronary artery disease, and he previously underwent percutaneous coronary intervention.
He is afebrile. The heart rate is 102/min, the blood pressure is 240/135 mm Hg, and the O2 saturation is 100% on room air.
An ECG is performed and shows no acute changes. A rapid intravenous infusion of a drug that increases peripheral venous capacitance is started.
This drug has an onset of action that is less than 1 minute with rapid serum clearance than necessitates a continuous infusion. What is the most severe side effect of this medication?
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

### Finding Reasoning Paths:
1. **Blurred vision, lightheadedness, and chest pain** → Malignant hypertension → Rapid IV antihypertensive therapy.
2. **Uncontrolled hypertension and coronary artery disease** → Malignant hypertension → Rapid IV antihypertensive therapy.
3. **Severe hypertension (BP 240/135 mm Hg)** → Risk of end-organ damage → Malignant hypertension → Rapid IV antihypertensive therapy.
4. **Chest pain and history of coronary artery disease** → Risk of myocardial ischemia → Malignant hypertension → Rapid IV antihypertensive therapy. —

### Reasoning Process:
1. **Clinical Presentation and Diagnosis**: – The patient presents with blurred vision…

Cyanide poisoning
<|eot_id|><|end_of_text|>

Split the dataset into train, validation, and test datasets:

from datasets import Dataset, DatasetDict
from random import randint

train_dataset = Dataset.from_pandas(train)
val_dataset = Dataset.from_pandas(val)
test_dataset = Dataset.from_pandas(test)

dataset = DatasetDict({“train”: train_dataset, “val”: val_dataset})
train_dataset = dataset[“train”].map(
prepare_dataset, remove_columns=list(train_dataset.features)
)

val_dataset = dataset[“val”].map(
prepare_dataset, remove_columns=list(val_dataset.features)
)

Prepare the training and validation datasets for the SageMaker training job by saving them as JSON files and constructing the S3 paths where these files will be uploaded:

train_dataset.to_json(“./data/train/dataset.jsonl”)
val_dataset.to_json(“./data/val/dataset.jsonl”)

s3_client.upload_file(
“./data/train/dataset.jsonl”, bucket_name, f”{input_path}/train/dataset.jsonl”
)
s3_client.upload_file(
“./data/val/dataset.jsonl”, bucket_name, f”{input_path}/val/dataset.jsonl”
)

Prepare the training script
To fine-tune meta-llama/Llama-3.1-8B with a SageMaker Training job, we prepared the train.py file, which serves as the entry point of the training job to execute the fine-tuning workload.
The training process can use Trainer or SFTTrainer classes to fine-tune our model. This simplifies the process of continued pre-training for LLMs. This approach makes fine-tuning efficient for adapting pre-trained models to specific tasks or domains.
The Trainer and SFTTrainer classes both facilitate model training with Hugging Face transformers. The Trainer class is the standard high-level API for training and evaluating transformer models on a wide range of tasks, including text classification, sequence labeling, and text generation. The SFTTrainer is a subclass built specifically for supervised fine-tuning of LLMs, particularly for instruction-following or conversational tasks.
To accelerate the model fine-tuning, we distribute the training workload by using the FSDP technique. It is an advanced parallelism technique designed to train large models that might not fit in the memory of a single GPU, with the following benefits:

Parameter sharding – Instead of replicating the entire model on each GPU, FSDP splits (shards) model parameters, optimizer states, and gradients across GPUs
Memory efficiency – By sharding, FSDP drastically reduces the memory footprint on each device, enabling training of larger models or larger batch sizes
Synchronization – During training, FSDP gathers only the necessary parameters for each computation step, then releases memory immediately after, further saving resources
CPU offload – Optionally, FSDP can offload some data to CPUs to save even more GPU memory

In our example, we use the Trainer class and define the required TrainingArguments to execute the FSDP distributed workload:

from transformers import (
Trainer,
TrainingArguments
)

trainer = Trainer(
model=model,
train_dataset=train_ds,
eval_dataset=test_ds if test_ds is not None else None,
args=transformers.TrainingArguments(
**training_args,
),
callbacks=callbacks,
data_collator=transformers.DataCollatorForLanguageModeling(
tokenizer, mlm=False
)
)

To further optimize the fine-tuning workload, we use the QLoRA technique, which quantizes a pre-trained language model to 4 bits and attaches small Low-Rank Adapters, which are fine-tuned:

from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(script_args.model_id)

# Define PAD token
tokenizer.pad_token = tokenizer.eos_token

# Configure quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type=”nf4″,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_storage=torch.bfloat16
)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
script_args.model_id,
trust_remote_code=True,
quantization_config=bnb_config,
use_cache=not training_args.gradient_checkpointing,
cache_dir=”/tmp/.cache”,
**model_configs,
)

The script_args and training_args are provided as hyperparameters for the SageMaker Training job in a configuration recipe .yaml file and parsed in the train.py file by using the TrlParser class provided by Hugging Face TRL:

model_id: “meta-llama/Llama-3.1-8B-Instruct” # Hugging Face model id
# sagemaker specific parameters
output_dir: “/opt/ml/model” # path to where SageMaker will upload the model
checkpoint_dir: “/opt/ml/checkpoints/” # path to where SageMaker will upload the model checkpoints
train_dataset_path: “/opt/ml/input/data/train/” # path to where S3 saves train dataset
val_dataset_path: “/opt/ml/input/data/val/” # path to where S3 saves test dataset
save_steps: 100 # Save checkpoint every this many steps
token: “”
# training parameters
lora_r: 32
lora_alpha:64
lora_dropout: 0.1
learning_rate: 2e-4 # learning rate scheduler
num_train_epochs: 2 # number of training epochs
per_device_train_batch_size: 4 # batch size per device during training
per_device_eval_batch_size: 2 # batch size for evaluation
gradient_accumulation_steps: 4 # number of steps before performing a backward/update pass
gradient_checkpointing: true # use gradient checkpointing
bf16: true # use bfloat16 precision
tf32: false # use tf32 precision
fsdp: “full_shard auto_wrap offload” #FSDP configurations
fsdp_config:
backward_prefetch: “backward_pre”
cpu_ram_efficient_loading: true
offload_params: true
forward_prefetch: false
use_orig_params: true
warmup_steps: 100
weight_decay: 0.01
merge_weights: true # merge weights in the base model

For the implemented use case, we decided to fine-tune the adapter with the following values:

lora_r: 32 – Allows the adapter to capture more complex reasoning transformations.
lora_alpha: 64 – Given the reasoning task we are trying to improve, this value allows the adapter to have a significant impact to the base.
lora_dropout: 0.05 – We want to preserve reasoning connection by avoiding breaking important ones.
warmup_steps: 100 – Gradually increases the learning rate to the specified value. For this reasoning task, we want the model to learn a new structure without forgetting the previous knowledge.
weight_decay: 0.01 – Maintains model generalization.

Prepare the configuration file for the SageMaker Training job by saving them as JSON files and constructing the S3 paths where these files will be uploaded:

import os

if default_prefix:
input_path = f”{default_prefix}/datasets/llm-fine-tuning-modeltrainer-sft”
else:
input_path = f”datasets/llm-fine-tuning-modeltrainer-sft”

train_config_s3_path = f”s3://{bucket_name}/{input_path}/config/args.yaml”

# upload the model yaml file to s3
model_yaml = “args.yaml”
s3_client.upload_file(model_yaml, bucket_name, f”{input_path}/config/args.yaml”)
os.remove(“./args.yaml”)

print(f”Training config uploaded to:”)
print(train_config_s3_path)

SFT training using a SageMaker Training job
To run a fine-tuning workload using the SFT training script and SageMaker Training jobs, we use the ModelTrainer class.
The ModelTrainer class is a and more intuitive approach to model training that significantly enhances user experience and supports distributed training, Build Your Own Container (BYOC), and recipes. For additional information refer to the SageMaker Python SDK documentation.
Set up the fine-tuning workload with the following steps:

Specify the instance type, the container image for the training job, and the checkpoint path where the model will be stored:

instance_type = “ml.p4d.24xlarge”
instance_count = 1

image_uri = image_uris.retrieve(
framework=”huggingface”,
region=sagemaker_session.boto_session.region_name,
version=”4.56.2″,
base_framework_version=”pytorch2.8.0″,
instance_type=instance_type,
image_scope=”training”,
)

Define the source code configuration by pointing to the created train.py:

from sagemaker.train.configs import SourceCode

source_code = SourceCode(
source_dir=”./scripts”,
requirements=”requirements.txt”,
entry_script=”train.py”,
)

Configure the training compute by optionally providing the parameter keep_alive_period_in_seconds to use managed warm pools, to retain and reuse the cluster during the experimentation phase:

from sagemaker.train.configs Compute

compute_configs = Compute(
instance_type=instance_type,
instance_count=instance_count,
keep_alive_period_in_seconds=0,
)

Create the ModelTrainer function by providing the required training setup, and define the argument distributed=Torchrun() to use torchrun as a launcher to execute the training job in a distributed manner across the available GPUs in the selected instance:

from sagemaker.train.configs import (
CheckpointConfig,
OutputDataConfig,
StoppingCondition,
)
from sagemaker.train.distributed import Torchrun
from sagemaker.train.model_trainer import ModelTrainer

# define Training Job Name
job_name = f”train-{model_id.split(‘/’)[-1].replace(‘.’, ‘-‘)}-sft”

# define OutputDataConfig path
output_path = f”s3://{bucket_name}/{job_name}”

# Define the ModelTrainer
model_trainer = ModelTrainer(
training_image=image_uri,
source_code=source_code,
base_job_name=job_name,
compute=compute_configs,
distributed=Torchrun(),
stopping_condition=StoppingCondition(max_runtime_in_seconds=18000),
hyperparameters={
“config”: “/opt/ml/input/data/config/args.yaml” # path to TRL config which was uploaded to s3
},
output_data_config=OutputDataConfig(s3_output_path=output_path),
checkpoint_config=CheckpointConfig(
s3_uri=output_path + “/checkpoint”, local_path=”/opt/ml/checkpoints”
),
)

Set up the input channels for the ModelTrainer by creating InputData objects from the provided S3 bucket paths for the training and validation dataset, and for the configuration parameters:

from sagemaker.train.configs import InputData
# Pass the input data
train_input = InputData(
channel_name=”train”,
data_source=train_dataset_s3_path, # S3 path where training data is stored
)
val_input = InputData(
channel_name=”val”,
data_source=val_dataset_s3_path, # S3 path where validation data is stored
)
config_input = InputData(
channel_name=”config”,
data_source=train_config_s3_path, # S3 path where configurations are stored
)
# Check input channels configured
data = [train_input, val_input, config_input]

Submit the training job:

model_trainer.train(input_data_config=data, wait=False)

The training job with Flash Attention 2 for one epoch with a dataset of 10,000 samples takes approximately 18 minutes to complete.
Deploy and test fine-tuned Meta Llama 3.1 8B on SageMaker AI
To evaluate your fine-tuned model, you have several options. You can use an additional SageMaker Training job to evaluate the model with Hugging Face Lighteval on SageMaker AI, or you can deploy the model to a SageMaker real-time endpoint and interactively test the model by using techniques like LLM as judge to compare generated content with ground truth content. For a more comprehensive evaluation that demonstrates the impact of fine-tuning on model performance, you can use the MedReason evaluation script to compare the base meta-llama/Llama-3.1-8B model with your fine-tuned version.
In this example, we use the deployment approach, iterating over the test dataset and evaluating the model on those samples using a simple loop.

Select the instance type and the container image for the endpoint:

import boto3

sm_client = boto3.client(“sagemaker”, region_name=sess.boto_region_name)

image_uri = “763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.13-gpu-py312”

Create the SageMaker Model using the container URI for vLLM and the S3 path to your model. Set your vLLM configuration, including the number of GPUs and max input tokens. For a full list of configuration options, see vLLM engine arguments.

env = {
“SM_VLLM_MODEL”: “/opt/ml/model”,
“SM_VLLM_DTYPE”: “bfloat16”,
“SM_VLLM_GPU_MEMORY_UTILIZATION”: “0.8”,
“SM_VLLM_MAX_MODEL_LEN”: json.dumps(1024 * 16),
“SM_VLLM_MAX_NUM_SEQS”: “1”,
“SM_VLLM_ENABLE_CHUNKED_PREFILL”: “true”,
“SM_VLLM_KV_CACHE_DTYPE”: “auto”,
“SM_VLLM_TENSOR_PARALLEL_SIZE”: “4”,
}

model_response = sm_client.create_model(
ModelName=f”{model_id.split(‘/’)[-1].replace(‘.’, ‘-‘)}-model”,
ExecutionRoleArn=role,
PrimaryContainer={
“Image”: image_uri,
“Environment”: env,
“ModelDataSource”: {
“S3DataSource”: {
“S3Uri”: f”s3://{bucket_name}/{job_prefix}/{job_name}/output/model.tar.gz”,
“S3DataType”: “S3Prefix”,
“CompressionType”: “Gzip”,
}
},
},
)

Create the endpoint configuration by specifying the type and number of instances:

instance_count = 1
instance_type = “ml.g5.12xlarge”
health_check_timeout = 700

endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=f”{model_id.split(‘/’)[-1].replace(‘.’, ‘-‘)}-config”,
ProductionVariants=[
{
“VariantName”: “AllTraffic”,
“ModelName”: f”{model_id.split(‘/’)[-1].replace(‘.’, ‘-‘)}-model”,
“InstanceType”: instance_type,
“InitialInstanceCount”: instance_count,
“ModelDataDownloadTimeoutInSeconds”: health_check_timeout,
“ContainerStartupHealthCheckTimeoutInSeconds”: health_check_timeout,
“InferenceAmiVersion”: “al2-ami-sagemaker-inference-gpu-3-1″,
}
],
)

Deploy the model:

endpoint_response = sm_client.create_endpoint(
EndpointName=f”{model_id.split(‘/’)[-1].replace(‘.’, ‘-‘)}-sft”,
EndpointConfigName=f”{model_id.split(‘/’)[-1].replace(‘.’, ‘-‘)}-config”,
)

SageMaker AI will now create the endpoint and deploy the model to it. This can take 5–10 minutes. Afterwards, you can test the model by sending some example inputs to the endpoint. You can use the invoke_endpoint method of the sagemaker-runtime client to send the input to the model and get the output:

import json
import pandas as pd

eval_dataset = []

for index, el in enumerate(test_dataset, 1):
print(“Processing item “, index)

payload = {
“messages”: [
{
“role”: “system”,
“content”: “You are a deep-thinking AI assistant.nnFor every user question, first write your thoughts and reasoning inside <think>…</think> tags, then provide your answer.”,
},
{“role”: “user”, “content”: el[“question”]},
],
“max_tokens”: 4096,
“stop”: [“<|eot_id|>”, “<|end_of_text|>”],
“temperature”: 0.4,
“top_p”: 0.9,
“repetition_penalty”: 1.15,
“no_repeat_ngram_size”: 3,
“do_sample”: True,
}

response = predictor.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=”application/json”,
Body=json.dumps(payload),
)

result = json.loads(response[“Body”].read().decode())
eval_dataset.append([el[“question”], result[“choices”][0][“message”][“content”]])

print(“**********************************************”)

eval_dataset_df = pd.DataFrame(
eval_dataset, columns=[“question”, “answer”]
)

eval_dataset_df.to_json(
“./eval_dataset_results.jsonl”, orient=”records”, lines=True
)

The following are some examples of generated answers:

Question: “Perl’s stain or prussion blue test is for:”

Answer Fine-tuned: “””
<think>
The Perl’s stain or Prussian blue test is used to detect the presence of iron in biological samples.
It involves adding potassium ferrocyanide (K4[Fe(CN)6]) to the sample,
which reacts with the iron ions present in it to form a dark blue-colored compound known as ferric ferrocyanide.
This reaction can be observed visually, allowing researchers to determine if iron is present in the sample.
</think>

In simpler terms, the Perl’s stain or Prussian blue test is used to identify iron in biological samples.
“””

The fine-tuned model shows strong reasoning capabilities by providing structured, detailed explanations with clear thought processes, breaking down the concepts step-by-step before arriving at the final answer. This example showcases the effectiveness of our fine-tuning approach using Hugging Face Transformers and a SageMaker Training job.
Clean up
To clean up your resources to avoid incurring additional charges, follow these steps:

Delete any unused SageMaker Studio resources.
(Optional) Delete the SageMaker Studio domain.
Verify that your training job isn’t running anymore. To do so, on the SageMaker console, under Training in the navigation pane, choose Training jobs.
Delete the SageMaker endpoint.

Conclusion
In this post, we demonstrated how enterprises can efficiently scale fine-tuning of both small and large language models by using the integration between the Hugging Face Transformers library and SageMaker Training jobs. This powerful combination transforms traditionally complex and resource-intensive processes into streamlined, scalable, and production-ready workflows.
Using a practical example with the meta-llama/Llama-3.1-8B model and the MedReason dataset, we demonstrated how to apply advanced techniques like FSDP and LoRA to reduce training time and cost—without compromising model quality.
This solution highlights how enterprises can effectively address common LLM fine-tuning challenges such as fragmented toolchains, high memory and compute requirements, and multi-node scaling inefficiencies and GPU underutilization.
By using the integrated Hugging Face and SageMaker architecture, businesses can now build and deploy customized, domain-specific models faster—with greater control, cost-efficiency, and scalability.
To get started with your own LLM fine-tuning project, explore the code samples provided in our GitHub repository.

About the Authors
Florent Gbelidji is a Machine Learning Engineer for Customer Success at Hugging Face. Based in Paris, France, Florent joined Hugging Face 3.5 years ago as an ML Engineer in the Expert Acceleration Program, helping companies build solutions with open source AI. He is now the Cloud Partnership Tech Lead for the AWS account, driving integrations between the Hugging Face environment and AWS services.
Bruno Pistone is a Senior Worldwide Generative AI/ML Specialist Solutions Architect at AWS based in Milan, Italy. He works with AWS product teams and large customers to help them fully understand their technical needs and design AI and machine learning solutions that take full advantage of the AWS cloud and Amazon ML stack. His expertise includes distributed training and inference workloads, model customization, generative AI, and end-to-end ML. He enjoys spending time with friends, exploring new places, and traveling to new destinations.
Louise Ping is a Senior Worldwide GenAI Specialist, where she helps partners build go-to-market strategies and leads cross-functional initiatives to expand opportunities and drive adoption. Drawing from her diverse AWS experience across Storage, APN Partner Marketing, and AWS Marketplace, she works closely with strategic partners like Hugging Face to drive technical collaborations. When not working at AWS, she attempts home improvement projects—ideally with limited mishaps.
Safir Alvi is a Worldwide GenAI/ML Go-To-Market Specialist at AWS based in New York. He focuses on advising strategic global customers on scaling their model training and inference workloads on AWS, and driving adoption of Amazon SageMaker AI Training Jobs and Amazon SageMaker HyperPod. He specializes in optimizing and fine-tuning generative AI and machine learning models across diverse industries, including financial services, healthcare, automotive, and manufacturing.

New Relic transforms productivity with generative AI on AWS

New Relic Inc. is a San Francisco-based technology company that pioneered application performance monitoring (APM) and provides comprehensive observability solutions. Serving leading customers worldwide, including major brands like Ryanair, New Relic helps organizations monitor and optimize their digital systems to deliver better customer experiences.
New Relic faced a challenge common to many rapidly growing enterprises. Their engineers were spending valuable time searching through fragmented documentation across multiple systems, with time consuming internal system queries, in some cases, taking more than a day. As a leading observability platform supporting thousands of customers worldwide, New Relic knew a more efficient way to access and utilize organizational knowledge was needed.
This challenge led to the creation of New Relic NOVA (New Relic Omnipresence Virtual Assistant): an innovative artificial intelligence (AI) tool built on Amazon Web Services (AWS). New Relic NOVA has transformed how New Relic employees access and interact with company knowledge and systems.
Working with the Generative AI Innovation Center, New Relic NOVA evolved from a knowledge assistant into a comprehensive productivity engine. New Relic NOVA is built on AWS services including Amazon Bedrock, Amazon Kendra, Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB. Through Strands Agents, New Relic NOVA provides intelligent code reviews, AI governance, and managed Model Context Protocol (MCP) services.
Amazon Bedrock is a fully managed service that provides access to leading foundation models for building generative AI applications, eliminating the need to manage infrastructure while enabling teams to customize models for their specific use cases. Through a single API, developers can experiment with and evaluate different foundation models, integrate them with enterprise systems, and build secure AI applications at scale.
The solution has reduced information search time while automating complex operational workflows. Through collaboration with the Generative AI Innovation Center, New Relic NOVA was developed into a solution that now processes over 1,000 daily queries across their organization. New Relic NOVA integrates seamlessly with Confluence, GitHub, Salesforce, Slack, and various internal systems, maintaining 80% accuracy in its responses for both knowledge-based queries and transactional tasks.
We will show how New Relic NOVA is architected using AWS services to create a scalable, intelligent assistant that goes beyond document retrieval to handle complex tasks like automated team permission requests and rate limit management. We explore the technical architecture, development journey, and key lessons learned in building an enterprise-grade AI solution that delivers measurable productivity gains at scale.
Solution overview
In designing New Relic NOVA, New Relic established several critical objectives beyond the initial goal of improving documentation search. These included maintaining data security during knowledge retrieval and achieving consistent response quality across different data sources. As shown in Figure 1, New Relic NOVA’s AWS architecture enables seamless interaction between users and various AWS services while maintaining security and scalability. The solution required a flexible framework that could evolve with the organization’s needs for both knowledge retrieval and transactional tasks. A key challenge was balancing these requirements while keeping response times under 20 seconds to maintain user engagement.

Figure 1 – Solution architecture of New Relic NOVA framework
The development team identified several potential risks early in the project. These included the possibility of exposing sensitive information through AI responses, maintaining accuracy when retrieving from multiple data sources, and ensuring system reliability at enterprise scale. Figure 2 illustrates New Relic NOVA’s detailed agent workflow, demonstrating how queries are processed and routed through various specialized agents to address user intentions. Additionally, the team implemented comprehensive security controls which included personable identifiable information (PII) detection and masking, along with a robust evaluation framework to monitor and maintain response quality.

Figure 2 – New Relic NOVA agent workflow architecture
The project also revealed opportunities for future optimization. These include expanding an agent hierarchy architecture to support additional automated workflows and developing more sophisticated analytics for tracking user interaction patterns. The team’s experience suggests that organizations undertaking similar projects should focus on establishing clear evaluation metrics early and building flexible architectures that can accommodate evolving business needs.
Solution
New Relic NOVA was developed over an eight-week period, involving a collaborative effort between internal engineering, security, legal, and compliance teams and the AWS Generative AI Innovation Center. This partnership accelerated rapid development and iteration, leveraging AWS expertise in large-scale AI implementations.
Agent architecture
The New Relic NOVA architecture consists of three key layers:

Main agent layer – This acts as a controllable orchestration for executing different workflows by identifying the user intent and delegating efforts to the following downstream layers:

Retrieval Augmented Generation (RAG) with customized ingested knowledge from Amazon Bedrock Knowledge Bases or Amazon Kendra.
Agents for direct interaction with third-party platforms.
Customized agents for handling internal New Relic tasks.
Fallback handling if users’ responses cannot be determined.

Data source layers (vector DB, enrich, data sources) – These layers represent resources where internal knowledge (for example, New Relic standards documentation and code repository documentation) are ingested for retrieval or RAG purposes. The benefit of these custom resources is to enhance information and search performance for use information requests.
Agents layer – Comprises two distinct agent types:

Strands Agents with MCP: Handle multi-step processes for third-party services, leveraging MCP for standardized service interactions.
Custom action agents: Execute New Relic-specific tasks such as permission requests and service limit modifications, providing precise control over internal systems.

A central agent acts as an orchestrator, routing queries to specialized sub-agents in a delegation model where responses flow directly back to the user rather than requiring inter-agent reasoning or adjustments. Meanwhile, Strands Agents are used to efficiently manage third-party service integrations using MCP. This approach gives New Relic NOVA the best of both worlds: the orchestration model maintains flexibility for internal processes while standardizing external services through MCP, creating a scalable foundation for New Relic regarding future automation needs.
Data integration strategy
The power lies in the ability of New Relic NOVA to seamlessly integrate multiple data sources, providing a unified interface for knowledge retrieval. This approach includes:

Amazon Bedrock Knowledge Bases for Confluence: Confirms direct synchronization with Confluence spaces and maintains up-to-date information.
Amazon Kendra for GitHub Enterprise: Indexes and searches GitHub repositories, providing quick access to code documentation.
Strands Agents for Salesforce and Jira: Custom agents execute SOQL and JQL queries, respectively, to fetch relevant data from their respective platforms (Salesforce and Jira).
Amazon Q Index for Slack: Uses Amazon Q Index capabilities to implement a RAG solution for Slack channel history, chosen for its rapid development potential.

A unique aspect of the data integration of New Relic NOVA is the custom document enrichment process. During ingestion, documents are enhanced with metadata, keywords, and summaries, significantly improving retrieval relevance and accuracy.
Using Amazon Nova models
Amazon Nova is AWS’s new generation of foundation models designed to deliver frontier intelligence with industry-leading price performance for enterprise use cases. The Amazon Nova family of models can process diverse inputs including text, images, and video, excelling in tasks from interactive chat to document analysis, while supporting advanced capabilities like RAG systems and AI agent workflows.
To optimize performance and cost-efficiency, New Relic NOVA utilizes Amazon Nova Lite and Pro models through Amazon Bedrock. These models were carefully selected to balance response quality with latency, enabling New Relic NOVA to maintain sub-20 second response times while processing complex queries. Amazon Bedrock provides access to diverse foundation model families. Its standardized framework and prompt optimization supports seamless switching between models without code changes. This allows New Relic NOVA to optimize for speed with Amazon Nova Lite or, because of complexity, switch to Amazon Nova Pro while maintaining consistent performance and cost efficiency.
Advanced RAG implementation
New Relic NOVA employs a sophisticated RAG approach, utilizing Amazon Bedrock Knowledge Bases, Amazon Kendra, and Amazon Q Index. To maximize retrieval accuracy, New Relic NOVA implements several key optimization techniques:

Hierarchical chunking: Amazon Bedrock Knowledge Bases employs hierarchical chunking, a method proven most effective through extensive experimentation with various chunking methodologies.
Context enrichment: A custom AWS Lambda function enhances chunks during knowledge base ingestion, incorporating relevant keywords and contextual information. This process is particularly valuable for code-related content, where structural and semantic cues significantly impact retrieval performance.
Metadata integration: During knowledge base document ingestion, additional context, such as summaries, titles, authors, creation dates, and last modified dates, is appended as document metadata. This enriched metadata enhances the quality and relevance of retrieved information.
Custom document processing: For specific data sources like GitHub repositories, tailored document processing techniques are applied to preserve code structure and improve search relevance.

These techniques work in concert to optimize the RAG system within New Relic NOVA, delivering highly accurate retrieval across varied document types while minimizing development effort through existing connectors. The combination of hierarchical chunking, context enrichment, metadata integration, and custom document processing enables New Relic NOVA to provide precise, context-aware responses regardless of the data source or document format.
Evaluation framework
New Relic NOVA implements a comprehensive evaluation framework, leveraging Amazon Bedrock foundation models for its large language model (LLM)-as-a-judge approach, along with validation datasets that combine questions, ground truth answers, and source document URLs. This evaluation framework, which can be executed on-demand in development environments, encompasses three critical metrics for system validation:

Answer accuracy measurement utilizes a 1–5 discrete scale rating system, where the LLM evaluates the generated response’s factual alignment with the established ground truth data.
Context relevance assessment on a scale of 1–5, analyzing the retrieved context’s relevance to the user query.
Response latency tracking measures workflow performance, from initial query input to final answer generation, ensuring optimal user experience through comprehensive timing analysis.

This triple-metric evaluation approach supports detailed performance optimization across the New Relic NOVA solution core functionalities.
Observability and continuous improvements
The solution includes a comprehensive observability framework that collects metrics and analyzes user feedback. The metric and feedback collection is implemented through New Relic AI monitoring solutions. Feedback is implemented through the Slack reaction feature (emoji responses), users can quickly provide feedback on New Relic NOVA responses. These reactions are captured by a New Relic python agent and sent to a https://one.newrelic.com/ domain. The feedback collection system provides valuable insights for:

Measuring user satisfaction with responses.
Identifying areas where accuracy can be improved.
Understanding usage patterns across different teams.
Tracking the effectiveness of different types of queries.
Monitoring the performance of various data sources.
Tracing each LLM call and latency.

The collected feedback data can be analyzed using AWS analytics services such as AWS Glue for ETL processing, Amazon Athena for querying, and Amazon QuickSight for visualization. This data-driven approach enables continuous improvement of New Relic NOVA and helps prioritize future enhancements based on actual user interactions.
Internal teams are already experiencing the advantages of New Relic NOVA. Figure 3 showcases some of the responses captured by the Slack feedback process.

Figure 3 – Users Slack message exchanges about New Relic NOVA experience
Considerations and next steps
The success of New Relic NOVA highlights several key learnings for organizations looking to implement similar solutions:

Start with a clear understanding of user pain points and measurable success criteria.
Implement robust data integration strategies with custom document enrichment.
Use the generative AI services and foundation models that best fit your use cases to achieve optimal results.
Build in feedback mechanisms from the start to enable continuous improvement.
Focus on both speed and accuracy to ensure user adoption.

In terms of next steps, New Relic NOVA is evolving from a standalone solution into a comprehensive enterprise AI platform by integrating cutting-edge AWS technologies and open-source frameworks. In the future, New Relic anticipates leveraging Amazon S3 Vectors. It offers up to 90% cost reduction for vector storage and querying compared to conventional approaches, enabling the handling of massive-scale AI workloads more efficiently. New Relic is looking to explore Amazon Bedrock AgentCore for enterprise-grade security, memory management, and scalable AI agent deployment, supporting robust production capabilities.
Additionally, New Relic is exploring Strands Agent Workflows, an open-source SDK that streamlines building AI agents from simple conversational assistants to complex autonomous workflows. This technology stack positions New Relic NOVA to deliver enterprise-ready AI solutions that scale seamlessly while maintaining cost efficiency and developer productivity.
Conclusion
The journey of creating New Relic NOVA demonstrates how enterprises can use the generative AI services of AWS to transform organizational productivity. Through the integration of Amazon Bedrock, Amazon Kendra, and other AWS services, New Relic created an AI assistant that transforms their internal operations. Working with the Generative AI Innovation Center of AWS, New Relic achieved a 95% reduction in information search time across their organization while automating complex operational workflows.
Learn more about transforming your business with generative AI by visiting the Generative AI Innovation Center or speak with an AWS Partner Specialist or AWS Representative to know how we can help accelerate your business.
Further reading

Building generative AI applications on AWS – AWS Classroom Training
Generative AI Lens – AWS Well-Architected Framework – Gain a deep understanding of how to design, deploy, and operate generative AI applications on AWS effectively
Build an end-to-end RAG solution using Amazon Bedrock Knowledge Bases and AWS CloudFormation
Open Protocols for Agent Interoperability Part 1: Inter-Agent Communication on MCP

About the authors
Yicheng Shen is a lead software engineer for New Relic NOVA, where he focuses on developing gen AI and agentic solutions that transform how businesses understand their application performance. When he’s not building intelligent systems, you’ll find him exploring the outdoors with his family and their dog.
Sarathy Varadarajan, Senior Director of Engineering at New Relic, drives AI-first transformation and developer productivity, aiming for tenfold gains via intelligent automation and enterprise AI. He scaled engineering teams from 15 to over 350 in Bangalore and Hyderabad. He enjoys family time and volleyball.
Joe King is an AWS Senior Data Scientist at the Generative AI Innovation Center, where he helps organizations architect and implement cutting-edge generative AI solutions. With deep expertise in science, engineering, and AI/ML architecture, he specializes in transforming complex generative AI use cases into scalable solutions on AWS.
Priyashree Roy is an AWS data scientist at the Generative AI Innovation Center, where she applies her deep expertise in machine learning and generative AI to build cutting-edge solutions for AWS strategic customers. With a PhD in experimental particle physics, she brings a rigorous scientific approach to solving complex real-world problems through advanced AI technologies.
Gene Su is an AWS Data Scientist at the Generative AI Innovation Center, specializing in generative AI solutions for finance, retail, and other industries. He uses his expertise in large language models (LLMs) to deliver generative AI applications on AWS.
Dipanshu Jain is a generative AI Strategist at AWS, helping unlock the potential of gen AI through strategic advisory and tailored solution development. Specialized in identifying high-impact generative AI use cases, shaping execution roadmaps, and guiding cross-functional teams through proofs of concept—from discovery to production.
Ameer Hakme is an AWS Solutions Architect that collaborates with Independent Software Vendors (ISVs) in the Northeast region, assisting in designing and building scalable and modern platforms on the AWS Cloud. An expert in AI/ML and generative AI, Ameer helps customers unlock the potential of these cutting-edge technologies. In his leisure time, he enjoys riding his motorcycle and spending quality time with his family.

How to Design Production-Grade Mock Data Pipelines Using Polyfactory w …

In this tutorial, we walk through an advanced, end-to-end exploration of Polyfactory, focusing on how we can generate rich, realistic mock data directly from Python type hints. We start by setting up the environment and progressively build factories for data classes, Pydantic models, and attrs-based classes, while demonstrating customization, overrides, calculated fields, and the generation of nested objects. As we move through each snippet, we show how we can control randomness, enforce constraints, and model real-world structures, making this tutorial directly applicable to testing, prototyping, and data-driven development workflows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys

def install_package(package):
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, package])

packages = [
“polyfactory”,
“pydantic”,
“email-validator”,
“faker”,
“msgspec”,
“attrs”
]

for package in packages:
try:
install_package(package)
print(f”✓ Installed {package}”)
except Exception as e:
print(f”✗ Failed to install {package}: {e}”)

print(“n”)

print(“=” * 80)
print(“SECTION 2: Basic Dataclass Factories”)
print(“=” * 80)

from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime, date
from uuid import UUID
from polyfactory.factories import DataclassFactory

@dataclass
class Address:
street: str
city: str
country: str
zip_code: str

@dataclass
class Person:
id: UUID
name: str
email: str
age: int
birth_date: date
is_active: bool
address: Address
phone_numbers: List[str]
bio: Optional[str] = None

class PersonFactory(DataclassFactory[Person]):
pass

person = PersonFactory.build()
print(f”Generated Person:”)
print(f” ID: {person.id}”)
print(f” Name: {person.name}”)
print(f” Email: {person.email}”)
print(f” Age: {person.age}”)
print(f” Address: {person.address.city}, {person.address.country}”)
print(f” Phone Numbers: {person.phone_numbers[:2]}”)
print()

people = PersonFactory.batch(5)
print(f”Generated {len(people)} people:”)
for i, p in enumerate(people, 1):
print(f” {i}. {p.name} – {p.email}”)
print(“n”)

We set up the environment and ensure all required dependencies are installed. We also introduce the core idea of using Polyfactory to generate mock data from type hints. By initializing the basic dataclass factories, we establish the foundation for all subsequent examples.

Copy CodeCopiedUse a different Browserprint(“=” * 80)
print(“SECTION 3: Customizing Factory Behavior”)
print(“=” * 80)

from faker import Faker
from polyfactory.fields import Use, Ignore

@dataclass
class Employee:
employee_id: str
full_name: str
department: str
salary: float
hire_date: date
is_manager: bool
email: str
internal_notes: Optional[str] = None

class EmployeeFactory(DataclassFactory[Employee]):
__faker__ = Faker(locale=”en_US”)
__random_seed__ = 42

@classmethod
def employee_id(cls) -> str:
return f”EMP-{cls.__random__.randint(10000, 99999)}”

@classmethod
def full_name(cls) -> str:
return cls.__faker__.name()

@classmethod
def department(cls) -> str:
departments = [“Engineering”, “Marketing”, “Sales”, “HR”, “Finance”]
return cls.__random__.choice(departments)

@classmethod
def salary(cls) -> float:
return round(cls.__random__.uniform(50000, 150000), 2)

@classmethod
def email(cls) -> str:
return cls.__faker__.company_email()

employees = EmployeeFactory.batch(3)
print(“Generated Employees:”)
for emp in employees:
print(f” {emp.employee_id}: {emp.full_name}”)
print(f” Department: {emp.department}”)
print(f” Salary: ${emp.salary:,.2f}”)
print(f” Email: {emp.email}”)
print()
print()

print(“=” * 80)
print(“SECTION 4: Field Constraints and Calculated Fields”)
print(“=” * 80)

@dataclass
class Product:
product_id: str
name: str
description: str
price: float
discount_percentage: float
stock_quantity: int
final_price: Optional[float] = None
sku: Optional[str] = None

class ProductFactory(DataclassFactory[Product]):
@classmethod
def product_id(cls) -> str:
return f”PROD-{cls.__random__.randint(1000, 9999)}”

@classmethod
def name(cls) -> str:
adjectives = [“Premium”, “Deluxe”, “Classic”, “Modern”, “Eco”]
nouns = [“Widget”, “Gadget”, “Device”, “Tool”, “Appliance”]
return f”{cls.__random__.choice(adjectives)} {cls.__random__.choice(nouns)}”

@classmethod
def price(cls) -> float:
return round(cls.__random__.uniform(10.0, 1000.0), 2)

@classmethod
def discount_percentage(cls) -> float:
return round(cls.__random__.uniform(0, 30), 2)

@classmethod
def stock_quantity(cls) -> int:
return cls.__random__.randint(0, 500)

@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
if instance.final_price is None:
instance.final_price = round(
instance.price * (1 – instance.discount_percentage / 100), 2
)
if instance.sku is None:
name_part = instance.name.replace(” “, “-“).upper()[:10]
instance.sku = f”{instance.product_id}-{name_part}”
return instance

products = ProductFactory.batch(3)
print(“Generated Products:”)
for prod in products:
print(f” {prod.sku}”)
print(f” Name: {prod.name}”)
print(f” Price: ${prod.price:.2f}”)
print(f” Discount: {prod.discount_percentage}%”)
print(f” Final Price: ${prod.final_price:.2f}”)
print(f” Stock: {prod.stock_quantity} units”)
print()
print()

We focus on generating simple but realistic mock data using dataclasses and default Polyfactory behavior. We show how to quickly create single instances and batches without writing any custom logic. It helps us validate how Polyfactory automatically interprets type hints to populate nested structures.

Copy CodeCopiedUse a different Browserprint(“=” * 80)
print(“SECTION 6: Complex Nested Structures”)
print(“=” * 80)

from enum import Enum

class OrderStatus(str, Enum):
PENDING = “pending”
PROCESSING = “processing”
SHIPPED = “shipped”
DELIVERED = “delivered”
CANCELLED = “cancelled”

@dataclass
class OrderItem:
product_name: str
quantity: int
unit_price: float
total_price: Optional[float] = None

@dataclass
class ShippingInfo:
carrier: str
tracking_number: str
estimated_delivery: date

@dataclass
class Order:
order_id: str
customer_name: str
customer_email: str
status: OrderStatus
items: List[OrderItem]
order_date: datetime
shipping_info: Optional[ShippingInfo] = None
total_amount: Optional[float] = None
notes: Optional[str] = None

class OrderItemFactory(DataclassFactory[OrderItem]):
@classmethod
def product_name(cls) -> str:
products = [“Laptop”, “Mouse”, “Keyboard”, “Monitor”, “Headphones”,
“Webcam”, “USB Cable”, “Phone Case”, “Charger”, “Tablet”]
return cls.__random__.choice(products)

@classmethod
def quantity(cls) -> int:
return cls.__random__.randint(1, 5)

@classmethod
def unit_price(cls) -> float:
return round(cls.__random__.uniform(5.0, 500.0), 2)

@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
if instance.total_price is None:
instance.total_price = round(instance.quantity * instance.unit_price, 2)
return instance

class ShippingInfoFactory(DataclassFactory[ShippingInfo]):
@classmethod
def carrier(cls) -> str:
carriers = [“FedEx”, “UPS”, “DHL”, “USPS”]
return cls.__random__.choice(carriers)

@classmethod
def tracking_number(cls) -> str:
return ”.join(cls.__random__.choices(‘0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ’, k=12))

class OrderFactory(DataclassFactory[Order]):
@classmethod
def order_id(cls) -> str:
return f”ORD-{datetime.now().year}-{cls.__random__.randint(100000, 999999)}”

@classmethod
def items(cls) -> List[OrderItem]:
return OrderItemFactory.batch(cls.__random__.randint(1, 5))

@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
if instance.total_amount is None:
instance.total_amount = round(sum(item.total_price for item in instance.items), 2)
if instance.shipping_info is None and instance.status in [OrderStatus.SHIPPED, OrderStatus.DELIVERED]:
instance.shipping_info = ShippingInfoFactory.build()
return instance

orders = OrderFactory.batch(2)
print(“Generated Orders:”)
for order in orders:
print(f”n Order {order.order_id}”)
print(f” Customer: {order.customer_name} ({order.customer_email})”)
print(f” Status: {order.status.value}”)
print(f” Items ({len(order.items)}):”)
for item in order.items:
print(f” – {item.quantity}x {item.product_name} @ ${item.unit_price:.2f} = ${item.total_price:.2f}”)
print(f” Total: ${order.total_amount:.2f}”)
if order.shipping_info:
print(f” Shipping: {order.shipping_info.carrier} – {order.shipping_info.tracking_number}”)
print(“n”)

We build more complex domain logic by introducing calculated and dependent fields within factories. We show how we can derive values such as final prices, totals, and shipping details after object creation. This allows us to model realistic business rules directly inside our test data generators.

Copy CodeCopiedUse a different Browserprint(“=” * 80)
print(“SECTION 7: Attrs Integration”)
print(“=” * 80)

import attrs
from polyfactory.factories.attrs_factory import AttrsFactory

@attrs.define
class BlogPost:
title: str
author: str
content: str
views: int = 0
likes: int = 0
published: bool = False
published_at: Optional[datetime] = None
tags: List[str] = attrs.field(factory=list)

class BlogPostFactory(AttrsFactory[BlogPost]):
@classmethod
def title(cls) -> str:
templates = [
“10 Tips for {}”,
“Understanding {}”,
“The Complete Guide to {}”,
“Why {} Matters”,
“Getting Started with {}”
]
topics = [“Python”, “Data Science”, “Machine Learning”, “Web Development”, “DevOps”]
template = cls.__random__.choice(templates)
topic = cls.__random__.choice(topics)
return template.format(topic)

@classmethod
def content(cls) -> str:
return ” “.join(Faker().sentences(nb=cls.__random__.randint(3, 8)))

@classmethod
def views(cls) -> int:
return cls.__random__.randint(0, 10000)

@classmethod
def likes(cls) -> int:
return cls.__random__.randint(0, 1000)

@classmethod
def tags(cls) -> List[str]:
all_tags = [“python”, “tutorial”, “beginner”, “advanced”, “guide”,
“tips”, “best-practices”, “2024”]
return cls.__random__.sample(all_tags, k=cls.__random__.randint(2, 5))

posts = BlogPostFactory.batch(3)
print(“Generated Blog Posts:”)
for post in posts:
print(f”n ‘{post.title}'”)
print(f” Author: {post.author}”)
print(f” Views: {post.views:,} | Likes: {post.likes:,}”)
print(f” Published: {post.published}”)
print(f” Tags: {‘, ‘.join(post.tags)}”)
print(f” Preview: {post.content[:100]}…”)
print(“n”)

print(“=” * 80)
print(“SECTION 8: Building with Specific Overrides”)
print(“=” * 80)

custom_person = PersonFactory.build(
name=”Alice Johnson”,
age=30,
email=”alice@example.com”
)
print(f”Custom Person:”)
print(f” Name: {custom_person.name}”)
print(f” Age: {custom_person.age}”)
print(f” Email: {custom_person.email}”)
print(f” ID (auto-generated): {custom_person.id}”)
print()

vip_customers = PersonFactory.batch(
3,
bio=”VIP Customer”
)
print(“VIP Customers:”)
for customer in vip_customers:
print(f” {customer.name}: {customer.bio}”)
print(“n”)

We extend Polyfactory usage to validated Pydantic models and attrs-based classes. We demonstrate how we can respect field constraints, validators, and default behaviors while still generating valid data at scale. It ensures our mock data remains compatible with real application schemas.

Copy CodeCopiedUse a different Browserprint(“=” * 80)
print(“SECTION 9: Field-Level Control with Use and Ignore”)
print(“=” * 80)

from polyfactory.fields import Use, Ignore

@dataclass
class Configuration:
app_name: str
version: str
debug: bool
created_at: datetime
api_key: str
secret_key: str

class ConfigFactory(DataclassFactory[Configuration]):
app_name = Use(lambda: “MyAwesomeApp”)
version = Use(lambda: “1.0.0”)
debug = Use(lambda: False)

@classmethod
def api_key(cls) -> str:
return f”api_key_{”.join(cls.__random__.choices(‘0123456789abcdef’, k=32))}”

@classmethod
def secret_key(cls) -> str:
return f”secret_{”.join(cls.__random__.choices(‘0123456789abcdef’, k=64))}”

configs = ConfigFactory.batch(2)
print(“Generated Configurations:”)
for config in configs:
print(f” App: {config.app_name} v{config.version}”)
print(f” Debug: {config.debug}”)
print(f” API Key: {config.api_key[:20]}…”)
print(f” Created: {config.created_at}”)
print()
print()

print(“=” * 80)
print(“SECTION 10: Model Coverage Testing”)
print(“=” * 80)

from pydantic import BaseModel, ConfigDict
from typing import Union

class PaymentMethod(BaseModel):
model_config = ConfigDict(use_enum_values=True)
type: str
card_number: Optional[str] = None
bank_name: Optional[str] = None
verified: bool = False

class PaymentMethodFactory(ModelFactory[PaymentMethod]):
__model__ = PaymentMethod

payment_methods = [
PaymentMethodFactory.build(type=”card”, card_number=”4111111111111111″),
PaymentMethodFactory.build(type=”bank”, bank_name=”Chase Bank”),
PaymentMethodFactory.build(verified=True),
]

print(“Payment Method Coverage:”)
for i, pm in enumerate(payment_methods, 1):
print(f” {i}. Type: {pm.type}”)
if pm.card_number:
print(f” Card: {pm.card_number}”)
if pm.bank_name:
print(f” Bank: {pm.bank_name}”)
print(f” Verified: {pm.verified}”)
print(“n”)

print(“=” * 80)
print(“TUTORIAL SUMMARY”)
print(“=” * 80)
print(“””
This tutorial covered:

1. ✓ Basic Dataclass Factories – Simple mock data generation
2. ✓ Custom Field Generators – Controlling individual field values
3. ✓ Field Constraints – Using PostGenerated for calculated fields
4. ✓ Pydantic Integration – Working with validated models
5. ✓ Complex Nested Structures – Building related objects
6. ✓ Attrs Support – Alternative to dataclasses
7. ✓ Build Overrides – Customizing specific instances
8. ✓ Use and Ignore – Explicit field control
9. ✓ Coverage Testing – Ensuring comprehensive test data

Key Takeaways:
– Polyfactory automatically generates mock data from type hints
– Customize generation with classmethods and decorators
– Supports multiple libraries: dataclasses, Pydantic, attrs, msgspec
– Use PostGenerated for calculated/dependent fields
– Override specific values while keeping others random
– Perfect for testing, development, and prototyping

For more information:
– Documentation: https://polyfactory.litestar.dev/
– GitHub: https://github.com/litestar-org/polyfactory
“””)
print(“=” * 80)

We cover advanced usage patterns such as explicit overrides, constant field values, and coverage testing scenarios. We show how we can intentionally construct edge cases and variant instances for robust testing. This final step ties everything together by demonstrating how Polyfactory supports comprehensive and production-grade test data strategies.

In conclusion, we demonstrated how Polyfactory enables us to create comprehensive, flexible test data with minimal boilerplate while still retaining fine-grained control over every field. We showed how to handle simple entities, complex nested structures, and Pydantic model validation, as well as explicit field overrides, within a single, consistent factory-based approach. Overall, we found that Polyfactory enables us to move faster and test more confidently, as it reliably generates realistic datasets that closely mirror production-like scenarios without sacrificing clarity or maintainability.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design Production-Grade Mock Data Pipelines Using Polyfactory with Dataclasses, Pydantic, Attrs, and Nested Models appeared first on MarkTechPost.

ByteDance Releases Protenix-v1: A New Open-Source Model Achieving AF3- …

How close can an open model get to AlphaFold3-level accuracy when it matches training data, model scale and inference budget? ByteDance has introduced Protenix-v1, a comprehensive AlphaFold3 (AF3) reproduction for biomolecular structure prediction, released with code and model parameters under Apache 2.0. The model targets AF3-level performance across protein, DNA, RNA and ligand structures while keeping the entire stack open and extensible for research and production.

The core release also ships with PXMeter v1.0.0, an evaluation toolkit and dataset suite for transparent benchmarking on more than 6k complexes with time-split and domain-specific subsets.

What is Protenix-v1?

Protenix is described as ‘Protenix: Protein + X‘, a foundation model for high-accuracy biomolecular structure prediction. It predicts all-atom 3D structures for complexes that can include:

Proteins

Nucleic acids (DNA and RNA)

Small-molecule ligands

The research team defines Protenix as a comprehensive AF3 reproduction. It re-implements the AF3-style diffusion architecture for all-atom complexes and exposes it in a trainable PyTorch codebase.

The project is released as a full stack:

Training and inference code

Pre-trained model weights

Data and MSA pipelines

A browser-based Protenix Web Server for interactive use

AF3-level performance under matched constraints

As per the research team Protenix-v1 (protenix_base_default_v1.0.0) is ‘the first fully open-source model that outperforms AlphaFold3 across diverse benchmark sets while adhering to the same training data cutoff, model scale, and inference budget as AlphaFold3.‘

The important constraints are:

Training data cutoff: 2021-09-30, aligned with AF3’s PDB cutoff.

Model scale: Protenix-v1 itself has 368M parameters; AF3 scale is matched but not disclosed.

Inference budget: comparisons use similar sampling budgets and runtime constraints.

https://github.com/bytedance/Protenix

On challenging targets such as antigen–antibody complexes, increasing the number of sampled candidates from several to hundreds yields consistent log-linear improvements in accuracy. This gives a clear and documented inference-time scaling behavior rather than a single fixed operating point.

PXMeter v1.0.0: Evaluation for 6k+ complexes

To support these claims, the research team released PXMeter v1.0.0, an open-source toolkit for reproducible structure prediction benchmarks.

PXMeter provides:

A manually curated benchmark dataset, with non-biological artifacts and problematic entries removed

Time-split and domain-specific subsets (for example, antibody–antigen, protein–RNA, ligand complexes)

A unified evaluation framework that computes metrics such as complex LDDT and DockQ across models

The associated PXMeter research paper, ‘Revisiting Structure Prediction Benchmarks with PXMeter,‘ evaluates Protenix, AlphaFold3, Boltz-1 and Chai-1 on the same curated tasks, and shows how different dataset designs affect model ranking and perceived performance.

How Protenix fits into the broader stack?

Protenix is part of a small ecosystem of related projects:

PXDesign: a binder design suite built on the Protenix foundation model. It reports 20–73% experimental hit rates and 2–6× higher success than methods such as AlphaProteo and RFdiffusion, and is accessible via the Protenix Server.

Protenix-Dock: a classical protein–ligand docking framework that uses empirical scoring functions rather than deep nets, tuned for rigid docking tasks.

Protenix-Mini and follow-on work such as Protenix-Mini+: lightweight variants that reduce inference cost using architectural compression and few-step diffusion samplers, while keeping accuracy within a few percent of the full model on standard benchmarks.

Together, these components cover structure prediction, docking, and design, and share interfaces and formats, which simplifies integration into downstream pipelines.

Key Takeaways

AF3-class, fully open model: Protenix-v1 is an AF3-style all-atom biomolecular structure predictor with open code and weights under Apache 2.0, targeting proteins, DNA, RNA and ligands.

Strict AF3 alignment for fair comparison: Protenix-v1 matches AlphaFold3 on critical axes: training data cutoff (2021-09-30), model scale class and comparable inference budget, enabling fair AF3-level performance claims.

Transparent benchmarking with PXMeter v1.0.0: PXMeter provides a curated benchmark suite over 6k+ complexes with time-split and domain-specific subsets plus unified metrics (for example, complex LDDT, DockQ) for reproducible evaluation.

Verified inference-time scaling behavior: Protenix-v1 shows log-linear accuracy gains as the number of sampled candidates increases, giving a documented latency–accuracy trade-off rather than a single fixed operating point.

Check out the Repo and Try it here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post ByteDance Releases Protenix-v1: A New Open-Source Model Achieving AF3-Level Performance in Biomolecular Structure Prediction appeared first on MarkTechPost.