Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Exp …

Tencent Hunyuan has released HunyuanOCR, a 1B parameter vision language model that is specialized for OCR and document understanding. The model is built on Hunyuan’s native multimodal architecture and runs spotting, parsing, information extraction, visual question answering, and text image translation through a single end to end pipeline.

HunyuanOCR is a lightweight alternative to general VLMs such as Gemini 2.5 and Qwen3 VL that still matches or surpasses them on OCR centric tasks. It targets production use cases like document parsing, card and receipt extraction, video subtitle extraction, and multilingual document translation.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Architecture, Native Resolution ViT plus Lightweight LLM

HunyuanOCR uses 3 main modules, a Native Resolution Visual Encoder called Hunyuan ViT, an Adaptive MLP Connector, and a Lightweight Language Model. The encoder is based on SigLIP-v2-400M and is extended to support arbitrary input resolutions through adaptive patching that preserves the original aspect ratio. Images are split into patches according to their native proportions and processed with global attention, which improves recognition on long text lines, long documents, and low quality scans.

The Adaptive MLP Connector performs learnable pooling on the spatial dimension. It compresses the dense visual tokens into a shorter sequence, while keeping information from text dense regions. This reduces sequence length passed to the language model and lowers compute, while preserving OCR relevant details.

The language model is based on the densely architected Hunyuan 0.5B model and uses XD RoPE. XD RoPE splits rotary position embeddings into 4 subspaces for text, height, width, and time. This gives the model a native way to align 1D token order with 2D layout and 3D spatiotemporal structure. As a result, the same stack can handle multi column pages, cross page flows, and sequences of video frames.

Training and inference follow a fully end to end paradigm. There is no external layout analysis or post processing model in the loop. All tasks are expressed as natural language prompts and handled in a single forward pass. This design removes error propagation across pipeline stages and simplifies deployment.

Data and Pre Training Recipe

The data pipeline builds more than 200M image text pairs, across 9 real world scenarios, including street views, documents, advertisements, handwritten text, screenshots, cards and certificates and invoices, game interfaces, video frames, and artistic typography. The corpus covers more than 130 languages.

Synthetic data comes from a multilingual generator that supports right to left scripts and paragraph level rendering. The pipeline controls font, language, rotation, and RGB values, and applies warping, blur, and local lighting changes to simulate mobile captures and other hard conditions.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Pre training follows 4 stages. Stage-1 performs vision language alignment with pure text, synthetic parsing and recognition data, and general caption data, using 50B tokens and 8k context. Stage-2 runs multimodal pre training on 300B tokens that mix pure text with synthetic spotting, parsing, translation, and VQA samples. Stage-3 extends context length to 32k with 80B tokens focused on long documents and long text. Stage-4 is application oriented supervised fine tuning on 24B tokens of human annotated and hard negative data, keeping 32k context and unified instruction templates.

Reinforcement Learning with Verifiable Rewards

After supervised training, HunyuanOCR is further optimized with reinforcement learning. The research team use Group Relative Policy Optimization GRPO and a Reinforcement Learning with Verifiable Rewards setup for structured tasks. For text spotting, the reward is based on intersection over union matching of boxes combined with normalized edit distance over text. For document parsing, the reward uses normalized edit distance between the generated structure and the reference.

For VQA and translation, the system uses an LLM as a judge. VQA uses a binary reward that checks semantic match. Translation uses a COMET style scoring LLM with scores in [0, 5], normalized to [0, 1]. The training framework enforces length limits and strict formats, and assigns zero reward when outputs overflow or break schema, which stabilizes optimization and encourages valid JSON or structured outputs.

Benchmark Results, a 1B Model Competing with Larger VLMs

On the internal text spotting benchmark of 900 images across 9 categories, HunyuanOCR reaches an overall score of 70.92. It outperforms traditional pipeline methods like PaddleOCR and BaiduOCR and also general VLMs such as Gemini 2.5 Pro, Qwen3 VL 2B, Qwen3 VL 235B, and Seed 1.6 Vision, despite using far fewer parameters.

On OmniDocBench, HunyuanOCR achieves 94.10 overall, with 94.73 on formulas and 91.81 on tables. On the Wild OmniDocBench variant, which prints and recaptures documents under folds and lighting changes, it scores 85.21 overall. On DocML, a multilingual parsing benchmark across 14 non Chinese and non English languages, it reaches 91.03, and the paper reports state of the art results across all 14 languages.

For information extraction and VQA, HunyuanOCR reaches 92.29 accuracy on cards, 92.53 on receipts, and 92.87 on video subtitles. On OCRBench, it scores 860, higher than DeepSeek OCR at similar scale and close to larger general VLMs like Qwen3 VL 2B Instruct and Gemini 2.5 Pro.

In text image translation, HunyuanOCR uses the DoTA benchmark and a DocML based internal set. It achieves a strong COMET score on DoTA for English to Chinese document translation, and the model wins first place in Track 2.2 OCR free Small Model of the ICDAR 2025 DIMT competition.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Key Takeaways

Compact end to end OCR VLM: HunyuanOCR is a 1B parameter OCR focused vision language model that connects a 0.4B native resolution ViT to a 0.5B Hunyuan language model through an MLP adapter, and runs spotting, parsing, information extraction, VQA and translation in one end to end instruction driven pipeline without external layout or detection modules.

Unified support for diverse OCR scenarios: The model is trained on more than 200M image text pairs across 9 scenarios, including documents, street views, advertisements, handwritten content, screenshots, cards and invoices, game interfaces and video frames, with coverage of over 130 languages in training and support for more than 100 languages in deployment.

Data pipeline plus reinforcement learning: Training uses a 4 stage recipe, vision language alignment, multimodal pre training, long context pre training and application oriented supervised fine tuning, followed by reinforcement learning with group relative policy optimization and verifiable rewards for spotting, parsing, VQA and translation.

Strong benchmark results for sub 3B modelsHunyuanOCR reaches 94.1 on OmniDocBench for document understanding, and achieves 860 on OCRBench, which is reported as state of the art among vision language models with fewer than 3B parameters, while also outperforming several commercial OCR APIs and larger open models such as Qwen3 VL 4B on core OCR benchmarks.

Editorial Notes

HunyuanOCR is a strong signal that OCR specific VLMs are maturing into practical infrastructure, not just benchmarks. Tencent combines a 1B parameter end to end architecture with Native Vision Transformer, Adaptive MLP Connector and RL with verifiable rewards to deliver a single model that covers spotting, parsing, IE, VQA and translation across more than 100 languages, and it does so while reaching leading scores on OCRBench for sub 3B models and 94.1 on OmniDocBench. Overall, HunyuanOCR marks an important shift toward compact, instruction driven OCR engines that are realistic for production deployment.

Check out the Paper, Model weight and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Expert VLM appeared first on MarkTechPost.

How to Implement Functional Components of Transformer and Mini-GPT Mod …

In this tutorial, we explore how to build neural networks from scratch using Tinygrad while remaining fully hands-on with tensors, autograd, attention mechanisms, and transformer architectures. We progressively build every component ourselves, from basic tensor operations to multi-head attention, transformer blocks, and, finally, a working mini-GPT model. Through each stage, we observe how Tinygrad’s simplicity helps us understand what happens under the hood when models train, optimize, and fuse kernels for performance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess, sys, os
print(“Installing dependencies…”)
subprocess.check_call([“apt-get”, “install”, “-qq”, “clang”], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, “git+https://github.com/tinygrad/tinygrad.git”])

import numpy as np
from tinygrad import Tensor, nn, Device
from tinygrad.nn import optim
import time

print(f” Using device: {Device.DEFAULT}”)
print(“=” * 60)

print(“n PART 1: Tensor Operations & Autograd”)
print(“-” * 60)

x = Tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = Tensor([[2.0, 0.0], [1.0, 2.0]], requires_grad=True)

z = (x @ y).sum() + (x ** 2).mean()
z.backward()

print(f”x:n{x.numpy()}”)
print(f”y:n{y.numpy()}”)
print(f”z (scalar): {z.numpy()}”)
print(f”∂z/∂x:n{x.grad.numpy()}”)
print(f”∂z/∂y:n{y.grad.numpy()}”)

We set up Tinygrad in our Colab environment and immediately begin experimenting with tensors and automatic differentiation. We create a small computation graph and observe how gradients flow through matrix operations. As we print the outputs, we gain an intuitive understanding of how Tinygrad handles backpropagation under the hood. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 2: Building Custom Layers”)
print(“-” * 60)

class MultiHeadAttention:
def __init__(self, dim, num_heads):
self.num_heads = num_heads
self.dim = dim
self.head_dim = dim // num_heads
self.qkv = Tensor.glorot_uniform(dim, 3 * dim)
self.out = Tensor.glorot_uniform(dim, dim)

def __call__(self, x):
B, T, C = x.shape[0], x.shape[1], x.shape[2]
qkv = x.reshape(B * T, C).dot(self.qkv).reshape(B, T, 3, self.num_heads, self.head_dim)
q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
scale = (self.head_dim ** -0.5)
attn = (q @ k.transpose(-2, -1)) * scale
attn = attn.softmax(axis=-1)
out = (attn @ v).transpose(1, 2).reshape(B, T, C)
return out.reshape(B * T, C).dot(self.out).reshape(B, T, C)

class TransformerBlock:
def __init__(self, dim, num_heads):
self.attn = MultiHeadAttention(dim, num_heads)
self.ff1 = Tensor.glorot_uniform(dim, 4 * dim)
self.ff2 = Tensor.glorot_uniform(4 * dim, dim)
self.ln1_w = Tensor.ones(dim)
self.ln2_w = Tensor.ones(dim)

def __call__(self, x):
x = x + self.attn(self._layernorm(x, self.ln1_w))
ff = x.reshape(-1, x.shape[-1])
ff = ff.dot(self.ff1).gelu().dot(self.ff2)
x = x + ff.reshape(x.shape)
return self._layernorm(x, self.ln2_w)

def _layernorm(self, x, w):
mean = x.mean(axis=-1, keepdim=True)
var = ((x – mean) ** 2).mean(axis=-1, keepdim=True)
return w * (x – mean) / (var + 1e-5).sqrt()

We design our own multi-head attention module and a transformer block entirely from scratch. We implement the projections, attention scores, softmax, feedforward layers, and layer normalization manually. As we run this code, we see how each component contributes to a transformer layer’s overall behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n PART 3: Mini-GPT Architecture”)
print(“-” * 60)

class MiniGPT:
def __init__(self, vocab_size=256, dim=128, num_heads=4, num_layers=2, max_len=32):
self.vocab_size = vocab_size
self.dim = dim
self.tok_emb = Tensor.glorot_uniform(vocab_size, dim)
self.pos_emb = Tensor.glorot_uniform(max_len, dim)
self.blocks = [TransformerBlock(dim, num_heads) for _ in range(num_layers)]
self.ln_f = Tensor.ones(dim)
self.head = Tensor.glorot_uniform(dim, vocab_size)

def __call__(self, idx):
B, T = idx.shape[0], idx.shape[1]
tok_emb = self.tok_emb[idx.flatten()].reshape(B, T, self.dim)
pos_emb = self.pos_emb[:T].reshape(1, T, self.dim)
x = tok_emb + pos_emb
for block in self.blocks:
x = block(x)
mean = x.mean(axis=-1, keepdim=True)
var = ((x – mean) ** 2).mean(axis=-1, keepdim=True)
x = self.ln_f * (x – mean) / (var + 1e-5).sqrt()
return x.reshape(B * T, self.dim).dot(self.head).reshape(B, T, self.vocab_size)

def get_params(self):
params = [self.tok_emb, self.pos_emb, self.ln_f, self.head]
for block in self.blocks:
params.extend([block.attn.qkv, block.attn.out, block.ff1, block.ff2, block.ln1_w, block.ln2_w])
return params

model = MiniGPT(vocab_size=256, dim=64, num_heads=4, num_layers=2, max_len=16)
params = model.get_params()
total_params = sum(p.numel() for p in params)
print(f”Model initialized with {total_params:,} parameters”)

We assemble the full MiniGPT architecture using the components built earlier. We embed tokens, add positional information, stack multiple transformer blocks, and project the final outputs back to vocab logits. As we initialize the model, we begin to appreciate how a compact transformer can be built with surprisingly few moving parts. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 4: Training Loop”)
print(“-” * 60)

def gen_data(batch_size, seq_len):
x = np.random.randint(0, 256, (batch_size, seq_len))
y = np.roll(x, 1, axis=1)
y[:, 0] = x[:, 0]
return Tensor(x, dtype=’int32′), Tensor(y, dtype=’int32′)

optimizer = optim.Adam(params, lr=0.001)
losses = []

print(“Training to predict previous token in sequence…”)
with Tensor.train():
for step in range(20):
start = time.time()
x_batch, y_batch = gen_data(batch_size=16, seq_len=16)
logits = model(x_batch)
B, T, V = logits.shape[0], logits.shape[1], logits.shape[2]
loss = logits.reshape(B * T, V).sparse_categorical_crossentropy(y_batch.reshape(B * T))
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss.numpy())
elapsed = time.time() – start
if step % 5 == 0:
print(f”Step {step:3d} | Loss: {loss.numpy():.4f} | Time: {elapsed*1000:.1f}ms”)

print(“nn PART 5: Lazy Evaluation & Kernel Fusion”)
print(“-” * 60)

N = 512
a = Tensor.randn(N, N)
b = Tensor.randn(N, N)

print(“Creating computation: (A @ B.T + A).sum()”)
lazy_result = (a @ b.T + a).sum()
print(“→ No computation done yet (lazy evaluation)”)

print(“nCalling .realize() to execute…”)
start = time.time()
realized = lazy_result.realize()
elapsed = time.time() – start

print(f”✓ Computed in {elapsed*1000:.2f}ms”)
print(f”Result: {realized.numpy():.4f}”)
print(“nNote: Operations were fused into optimized kernels!”)

We train the MiniGPT model on simple synthetic data and observe the loss decreasing across steps. We also explore Tinygrad’s lazy execution model by creating a fused kernel that executes only when it is realized. As we monitor timings, we understand how kernel fusion improves performance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 6: Custom Operations”)
print(“-” * 60)

def custom_activation(x):
return x * x.sigmoid()

x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]], requires_grad=True)
y = custom_activation(x)
loss = y.sum()
loss.backward()

print(f”Input: {x.numpy()}”)
print(f”Swish(x): {y.numpy()}”)
print(f”Gradient: {x.grad.numpy()}”)

print(“nn” + “=” * 60)
print(” Tutorial Complete!”)
print(“=” * 60)
print(“””
Key Concepts Covered:
1. Tensor operations with automatic differentiation
2. Custom neural network layers (Attention, Transformer)
3. Building a mini-GPT language model from scratch
4. Training loop with Adam optimizer
5. Lazy evaluation and kernel fusion
6. Custom activation functions
“””)

We implement a custom activation function and verify that gradients propagate correctly through it. We then print a summary of all major concepts covered in the tutorial. As we finish, we reflect on how each section builds our ability to understand, modify, and extend deep learning internals using Tinygrad.

In conclusion, we reinforce our understanding of how neural networks truly operate beneath modern abstractions, and we experience firsthand how Tinygrad empowers us to tinker with every internal detail. We have built a transformer, trained it on synthetic data, experimented with lazy evaluation and kernel fusion, and even created custom operations, all within a minimal, transparent framework. At last, we recognize how this workflow prepares us for deeper experimentation, whether we extend the model, integrate real datasets, or continue exploring Tinygrad’s low-level capabilities.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Implement Functional Components of Transformer and Mini-GPT Model from Scratch Using Tinygrad to Understand Deep Learning Internals appeared first on MarkTechPost.

Black Forest Labs Releases FLUX.2: A 32B Flow Matching Transformer for …

Black Forest Labs has released FLUX.2, its second generation image generation and editing system. FLUX.2 targets real world creative workflows such as marketing assets, product photography, design layouts, and complex infographics, with editing support up to 4 megapixels and strong control over layout, logos, and typography.

FLUX.2 product family and FLUX.2 [dev]

The FLUX.2 family spans hosted APIs and open weights:

FLUX.2 [pro] is the managed API tier. It targets state of the art quality relative to closed models, with high prompt adherence and low inference cost, and is available in the BFL Playground, BFL API, and partner platforms.

FLUX.2 [flex] exposes parameters such as number of steps and guidance scale, so developers can trade off latency, text rendering accuracy, and visual detail.

FLUX.2 [dev] is the open weight checkpoint, derived from the base FLUX.2 model. It is described as the most powerful open weight image generation and editing model, combining text to image and multi image editing in one checkpoint, with 32 billion parameters.

FLUX.2 [klein] is a coming open source Apache 2.0 variant, size distilled from the base model for smaller setups, with many of the same capabilities.

All variants support image editing from text and multiple references in a single model, which removes the need to maintain separate checkpoints for generation and editing.

Architecture, latent flow, and the FLUX.2 VAE

FLUX.2 uses a latent flow matching architecture. The core design couples a Mistral-3 24B vision language model with a rectified flow transformer that operates on latent image representations. The vision language model provides semantic grounding and world knowledge, while the transformer backbone learns spatial structure, materials, and composition.

The model is trained to map noise latents to image latents under text conditioning, so the same architecture supports both text driven synthesis and editing. For editing, latents are initialized from existing images, then updated under the same flow process while preserving structure.

A new FLUX.2 VAE defines the latent space. It is designed to balance learnability, reconstruction quality, and compression, and is released separately on Hugging Face under an Apache 2.0 license. This autoencoder is the backbone for all FLUX.2 flow models and can also be reused in other generative systems.

https://bfl.ai/blog/flux-2

Capabilities for production workflows

The FLUX.2 Docs and Diffusers integration highlight several key capabilities:

Multi reference support: FLUX.2 can combine up to 10 reference images to maintain character identity, product appearance, and style across outputs.

Photoreal detail at 4MP: the model can edit and generate images up to 4 megapixels, with improved textures, skin, fabrics, hands, and lighting suitable for product shots and photo like use cases.

Robust text and layout rendering: it can render complex typography, infographics, memes, and user interface layouts with small legible text, which is a common weakness in many older models.

World knowledge and spatial logic: the model is trained for more grounded lighting, perspective, and scene composition, which reduces artifacts and the synthetic look.

https://bfl.ai/blog/flux-2

Key Takeaways

FLUX.2 is a 32B latent flow matching transformer that unifies text to image, image editing, and multi reference composition in a single checkpoint.

FLUX.2 [dev] is the open weight variant, paired with the Apache 2.0 FLUX.2 VAE, while the core model weights use the FLUX.2-dev Non Commercial License with mandatory safety filtering.

The system supports up to 4 megapixel generation and editing, robust text and layout rendering, and up to 10 visual references for consistent characters, products, and styles.

Full precision inference requires more than 80GB VRAM, but 4 bit and FP8 quantized pipelines with offloading make FLUX.2 [dev] usable on 18GB to 24GB GPUs and even 8GB cards with sufficient system RAM.

Editorial Notes

FLUX.2 is an important step for open weight visual generation, since it combines a 32B rectified flow transformer, a Mistral 3 24B vision language model, and the FLUX.2 VAE into a single high fidelity pipeline for text to image and editing. The clear VRAM profiles, quantized variants, and strong integrations with Diffusers, ComfyUI, and Cloudflare Workers make it practical for real workloads, not only benchmarks. This release pushes open image models closer to production grade creative infrastructure.

Check out the Technical details, Model weight and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Black Forest Labs Releases FLUX.2: A 32B Flow Matching Transformer for Production Image Pipelines appeared first on MarkTechPost.

How Myriad Genetics achieved fast, accurate, and cost-efficient docume …

This post was written with Martyna Shallenberg and Brode Mccrady from Myriad Genetics.
Healthcare organizations face challenges in processing and managing high volumes of complex medical documentation while maintaining quality in patient care. These organizations need solutions to process documents effectively to meet growing demands. Myriad Genetics, a provider of genetic testing and precision medicine solutions serving healthcare providers and patients worldwide, addresses this challenge.
Myriad’s Revenue Engineering Department processes thousands of healthcare documents daily across Women’s Health, Oncology, and Mental Health divisions. The company classifies incoming documents into classes such as Test Request Forms, Lab Results, Clinical Notes, and Insurance to automate Prior Authorization workflows. The system routes these documents to appropriate external vendors for processing based on their identified document class. They manually perform Key Information Extraction (KIE) including insurance details, patient information, and test results to determine Medicare eligibility and support downstream processes.
As document volumes increased, Myriad faced challenges with its existing system. The automated document classification solution worked but was costly and time-consuming. Information extraction remained manual due to complexity. To address high costs and slow processing, Myriad needed a better solution.
This post explores how Myriad Genetics partnered with the AWS Generative AI Innovation Center (GenAIIC) to transform their healthcare document processing pipeline using Amazon Bedrock and Amazon Nova foundation models. We detail the challenges with their existing solution, and how generative AI reduced costs and improved processing speed.
We examine the technical implementation using AWS’s open source GenAI Intelligent Document Processing (GenAI IDP) Accelerator solution, the optimization strategies used for document classification and key information extraction, and the measurable business impact on Myriad’s prior authorization workflows. We cover how we used prompt engineering techniques, model selection strategies, and architectural decisions to build a scalable solution that processes complex medical documents with high accuracy while reducing operational costs.
Document processing bottlenecks limiting healthcare operations
Myriad Genetics’ daily operations depend on efficiently processing complex medical documents containing critical information for patient care workflows and regulatory compliance. Their existing solution combined Amazon Textract for Optical Character Recognition (OCR) with Amazon Comprehend for document classification.
Despite 94% classification accuracy, this solution had operational challenges:

Operational costs: 3 cents per page resulting in $15,000 monthly expenses per business unit
Classification latency: 8.5 minutes per document, delaying downstream prior authorization workflows

Information extraction was entirely manual, requiring contextual understanding to differentiate critical clinical distinctions (like “is metastatic” versus “is not metastatic”) and to locate information like insurance numbers and patient information across varying document formats. This processing burden was substantial, with Women’s Health customer service requiring up to 10 full-time employees contributing 78 hours daily in the Women’s Health business unit alone.
Myriad needed a solution to:

Reduce document classification costs while maintaining or improving accuracy
Accelerate document processing to eliminate workflow bottlenecks
Automate information extraction for medical documents
Scale across multiple business units and document types

Amazon Bedrock and generative AI
Modern large language models (LLMs) process complex healthcare documents with high accuracy due to pre-training on massive text corpora. This pre-training enables LLMs to understand language patterns and document structures without feature engineering or large labeled datasets. Amazon Bedrock is a fully managed service that offers a broad range of high-performing LLMs from leading AI companies. It provides the security, privacy, and responsible AI capabilities that healthcare organizations require when processing sensitive medical information. For this solution, we used Amazon’s newest foundation models:

Amazon Nova Pro: A cost-effective, low-latency model ideal for document classification
Amazon Nova Premier: An advanced model with reasoning capabilities for information extraction

Solution overview
We implemented a solution with Myriad using AWS’s open source GenAI IDP Accelerator. The accelerator provides a scalable, serverless architecture that converts unstructured documents into structured data. The accelerator processes multiple documents in parallel through configurable concurrency limits without overwhelming downstream services. Its built-in evaluation framework lets users provide expected output through the user interface (UI) and evaluate generated results to iteratively customize configuration and improve accuracy.

The accelerator offers 1-click deployment with a choice of pre-built patterns optimized for different workloads with different configurability, cost, and accuracy requirements:

Pattern 1 – Uses Amazon Bedrock Data Automation, a fully managed service that offers rich out-of-the-box features, ease of use, and straightforward per-page pricing. This pattern is recommended for most use cases.
Pattern 2 – Uses Amazon Textract and Amazon Bedrock with Amazon Nova, Anthropic’s Claude, or custom fine-tuned Amazon Nova models. This pattern is ideal for complex documents requiring custom logic.
Pattern 3 – Uses Amazon Textract, Amazon SageMaker with a fine-tuned model for classification, and Amazon Bedrock for extraction. This pattern is ideal for documents requiring specialized classification.

Pattern 2 proved most suitable for this project, meeting the critical requirement of low cost while offering flexibility to optimize accuracy through prompt engineering and LLM selection. This pattern offers a no-code configuration – customize document types, extraction fields, and processing logic through configuration, editable in the web UI.
We customized the definitions of document classes, key attributes and their definitions per document class, LLM choice, LLM hyperparameters, and classification and extraction LLM prompts via Pattern 2’s config file. In production, Myriad integrated this solution into their existing event-driven architecture. The following diagram illustrates the production pipeline:

Document Ingestion: Incoming order events trigger document retrieval from source document management systems, with cache optimization for previously processed documents.
Concurrency Management: DynamoDB tracked concurrent AWS Step Function jobs while Amazon Simple Queue Service (SQS) queues files exceeding concurrency limits for document processing.
Text Extraction: Amazon Textract extracted text, layout information, tables and forms from the normalized documents.
Classification: The configured LLM analyzed the extracted content based on the customized document classification prompt provided in the config file and classifies documents into appropriate categories.
Key Information Extraction: The configured LLM extracted medical information using extraction prompt provided in the config file.
Structured Output: The pipeline formatted the results in a structured manner and delivered to Myriad’s Authorization System via RESTful operations.

Document classification with generative AI
While Myriad’s existing solution achieved 94% accuracy, misclassifications occurred due to structural similarities, overlapping content, and shared formatting patterns across document types. This semantic ambiguity made it difficult to distinguish between similar documents. We guided Myriad on prompt optimization techniques that used LLM’s contextual understanding capabilities. This approach moved beyond pattern matching to enable semantic analysis of document context and purpose, identifying distinguishing features that human experts recognize but previous automated systems missed.
AI-driven prompt engineering for document classification
We developed class definitions with distinguishing characteristics between similar document types. To identify these differentiators, we provided document samples from each class to Anthropic Claude Sonnet 3.7 on Amazon Bedrock with model reasoning enabled (a feature that allows the model to demonstrate its step-by-step analysis process). The model identified distinguishing features between similar document classes, which Myriad’s subject matter experts refined and incorporated into the GenAI IDP Accelerator’s Pattern 2 config file for document classification prompts.
Format-based classification strategies
We used document structure and formatting as key differentiators to distinguish between similar document types that shared comparable content but differed in structure. We enabled the classification models to recognize format-specific characteristics such as layout structures, field arrangements, and visual elements, allowing the system to differentiate between documents that textual content alone cannot distinguish. For example, lab reports and test results both contain patient information and medical data, but lab reports display numerical values in tabular format while test results follow a narrative format. We instructed the LLM: “Lab reports contain numerical results organized in tables with reference ranges and units. Test results present findings in paragraph format with clinical interpretations.”
Implementing negative prompting for enhanced accuracy
We implemented negative prompting techniques to resolve confusion between similar documents by explicitly instructing the model what classifications to avoid. This approach added exclusionary language to classification prompts, specifying characteristics that should not be associated with each document type. Initially, the system frequently misclassified Test Request Forms (TRFs) as Test Results due to confusion between patient medical history and lab measurements. Adding a negative prompt like “These forms contain patient medical history. DO NOT confuse them with test results which contain current/recent lab measurements” to the TRF definition improved the classification accuracy by 4%. By providing explicit guidance on common misclassification patterns, the system avoided typical errors and confusion between similar document types.
Model selection for cost and performance optimization
Model selection drives optimal cost-performance at scale, so we conducted comprehensive benchmarking using the GenAI IDP Accelerator’s evaluation framework. We tested four foundation models—Amazon Nova Lite, Amazon Nova Pro, Amazon Nova Premier, and Anthropic Claude Sonnet 3.7—using 1,200 healthcare documents across three document classes: Test Request Forms, Lab Results, and Insurance. We assessed each model using three critical metrics: classification accuracy, processing latency, and cost per document. The accelerator’s cost tracking enabled direct comparison of operational expenses across different model configurations, ensuring performance improvements translate into measurable business value at scale.
The evaluation results demonstrated that Amazon Nova Pro achieved optimal balance for Myriad’s use case. We transitioned from Myriad’s Amazon Comprehend implementation to Amazon Nova Pro with optimized prompts for document classification, achieving significant improvements: classification accuracy increased from 94% to 98%, processing costs decreased by 77%, and processing speed improved by 80%—reducing classification time from 8.5 minutes to 1.5 minutes per document.
Automating Key Information Extraction with generative AI
Myriad’s information extraction was manual, requiring up to 10 full-time employees contributing 78 hours daily in the Women’s Health unit alone, which created operational bottlenecks and scalability constraints. Automating healthcare KIE presented challenges: checkbox fields required distinguishing between marking styles (checkmarks, X’s, handwritten marks); documents contained ambiguous visual elements like overlapping marks or content spanning multiple fields; extraction needed contextual understanding to differentiate clinical distinctions and locate information across varying document formats. We worked with Myriad to develop an automated KIE solution, implementing the following optimization techniques to address extraction complexity.
Enhanced OCR configuration for checkbox recognition
To address checkbox identification challenges, we enabled Amazon Textract’s specialized TABLES and FORMS features on the GenAI IDP Accelerator portal as shown in the following image, to improve OCR discrimination between selected and unselected checkbox elements. These features enhanced the system’s ability to detect and interpret marking styles found in medical forms.

We enhanced accuracy by incorporating visual cues into the extraction prompts. We updated the prompts with instructions such as “look for visible marks in or around the small square boxes (✓, x, or handwritten marks)” to guide the language model in identifying checkbox selections. This combination of enhanced OCR capabilities and targeted prompting improved checkbox extraction in medical forms.
Visual context learning through few-shot examples
Configuring Textract and improving prompts alone could not handle complex visual elements effectively. We implemented a multimodal approach that sent both document images and extracted text from Textract to the foundation model, enabling simultaneous analysis of visual layout and textual content for accurate extraction decisions. We implemented few-shot learning by providing example document images paired with their expected extraction outputs to guide the model’s understanding of various form layouts and marking styles. Multiple document image examples with their correct extraction patterns create lengthy LLM prompts. We leveraged the GenAI IDP Accelerator’s built-in integration with Amazon Bedrock’s prompt caching feature to reduce costs and latency. Prompt caching stores lengthy few-shot examples in memory for 5 minutes—when processing multiple similar documents within that timeframe, Bedrock reuses cached examples instead of reprocessing them, reducing both cost and processing time.
Chain of thought reasoning for complex extraction
While this multimodal approach improved extraction accuracy, we still faced challenges with overlapping and ambiguous tick marks in complex form layouts. To perform well in ambiguous and complex situations, we used Amazon Nova Premier and implemented Chain of Thought reasoning to have the model think through extraction decisions step-by-step using thinking tags. For example:

Analyze the checkbox marks in this form:

<thinking>
1. What checkboxes are present? [List all visible options]
2. Where are the marks positioned? [Describe mark locations]
3. Which marks are clear vs ambiguous? [Assess mark quality]
4. For overlapping marks: Which checkbox contains most of the mark?
5. Are marks positioned in the center or touching edges? [Prioritize center positioning]
</thinking>

Additionally, we included reasoning explanations in the few-shot examples, demonstrating how we reached conclusions in ambiguous cases. This approach enabled the model to work through complex visual evidence and contextual clues before making final determinations, improving performance with ambiguous tick marks.
Testing across 32 document samples with varying complexity levels via the GenAI IDP Accelerator revealed that Amazon Textract with Layout, TABLES, and FORMS features enabled, paired with Amazon Nova Premier’s advanced reasoning capabilities and the inclusion of few-shot examples, delivered the best results. The solution achieved 90% accuracy (same as human evaluator baseline accuracy) while processing documents in approximately 1.3 minutes each.
Results and business impact
Through our new solution, we delivered measurable improvements that met the business goals established at the project outset:
Document classification performance:

We increased accuracy from 94% to 98% through prompt optimization techniques for Amazon Nova Pro, including AI-driven prompt engineering, document-format based classification strategies, and negative prompting.
We reduced classification costs by 77% (from 3.1 to 0.7 cents per page) by migrating from Amazon Comprehend to Amazon Nova Pro with optimized prompts.
We reduced classification time by 80% (from 8.5 to 1.5 minutes per document) by choosing Amazon Nova Pro to provide a low-latency and cost-effective solution.

New automated Key Information Extraction performance:

We achieved 90% extraction accuracy (same as the baseline manual process): Delivered through a combination of Amazon Textract’s document analysis capabilities, visual context learning through few-shot examples and Amazon Nova Premier’s reasoning for complex data interpretation.
We achieved processing costs of 9 cents per page and processing time of 1.3 minutes per document compared to manual baseline requiring up to 10 full-time employees working 78 hours daily per business unit.

Business impact and rollout
Myriad has planned a phased rollout beginning with document classification. They plan to launch our new classification solution in the Women’s Health business unit, followed by Oncology and Mental Health divisions. As a result of our work, Myriad will realize up to $132K in annual savings in their document classification costs. The solution reduces each prior authorization submission time by 2 minutes—specialists now complete orders in four minutes instead of six minutes due to faster access to tagged documents. This improvement saves 300 hours monthly across 9,000 prior authorizations in Women’s Health alone, equivalent to 50 hours per prior authorization specialist.
These measurable improvements have transformed Myriad’s operations, as their engineering leadership confirms:

“Partnering with the GenAIIC to migrate our Intelligent Document Processing solution from AWS Comprehend to Bedrock has been a transformative step forward. By improving both performance and accuracy, the solution is projected to deliver savings of more than $10,000 per month. The team’s close collaboration with Myriad’s internal engineering team delivered a high-quality, scalable solution, while their deep expertise in advanced language models has elevated our capabilities. This has been an excellent example of how innovation and partnership can drive measurable business impact.” – Martyna Shallenberg, Senior Director of Software Engineering, Myriad Genetics

Conclusion
The AWS GenAI IDP Accelerator enabled Myriad’s rapid implementation, providing a flexible framework that reduced development time. Healthcare organizations need tailored solutions—the accelerator delivers extensive customization capabilities that let users adapt solutions to specific document types and workflows without requiring extensive code changes or frequent redeployment during development. Our approach demonstrates the power of strategic prompt engineering and model selection. We achieved high accuracy in a specialized domain by focusing on prompt design, including negative prompting and visual cues. We optimized both cost and performance by selecting Amazon Nova Pro for classification and Nova Premier for complex extraction—matching the right model to each specific task.
Explore the solution for yourself
Organizations looking to improve their document processing workflows can experience these benefits firsthand. The open source GenAI IDP Accelerator that powered Myriad’s transformation is available to deploy and test in your environment. The accelerator’s straightforward setup process lets users quickly evaluate how generative AI can transform document processing challenges.
Once you’ve explored the accelerator and seen its potential impact on your workflows, reach out to the AWS GenAIIC team to explore how the GenAI IDP Accelerator can be customized and optimized for your specific use case. This hands-on approach ensures you can make informed decisions about implementing intelligent document processing in your organization.

About the authors
Priyashree Roy is a Data Scientist II at the AWS Generative AI Innovation Center, where she applies her expertise in machine learning and generative AI to develop innovative solutions for strategic AWS customers. She brings a rigorous scientific approach to complex business challenges, informed by her PhD in experimental particle physics from Florida State University and postdoctoral research at the University of Michigan.
Mofijul Islam is an Applied Scientist II and Tech Lead at the AWS Generative AI Innovation Center, where he helps customers tackle customer-centric research and business challenges using generative AI, large language models (LLM), multi-agent learning, code generation, and multimodal learning. He holds a PhD in machine learning from the University of Virginia, where his work focused on multimodal machine learning, multilingual natural language processing (NLP), and multitask learning. His research has been published in top-tier conferences like NeurIPS, International Conference on Learning Representations (ICLR), Empirical Methods in Natural Language Processing (EMNLP), Society for Artificial Intelligence and Statistics (AISTATS), and Association for the Advancement of Artificial Intelligence (AAAI), as well as Institute of Electrical and Electronics Engineers (IEEE) and Association for Computing Machinery (ACM) Transactions.
Nivedha Balakrishnan is a Deep Learning Architect II at the AWS Generative AI Innovation Center, where she helps customers design and deploy generative AI applications to solve complex business challenges. Her expertise spans large language models (LLMs), multimodal learning, and AI-driven automation. She holds a Master’s in Applied Data Science from San Jose State University and a Master’s in Biomedical Engineering from Linköping University, Sweden. Her previous research focused on AI for drug discovery and healthcare applications, bridging life sciences with machine learning.
Martyna Shallenberg is a Senior Director of Software Engineering at Myriad Genetics, where she leads cross-functional teams in building AI-driven enterprise solutions that transform revenue cycle operations and healthcare delivery. With a unique background spanning genomics, molecular diagnostics, and software engineering, she has scaled innovative platforms ranging from Intelligent Document Processing (IDP) to modular LIMS solutions. Martyna is also the Founder & President of BioHive’s HealthTech Hub, fostering cross-domain collaboration to accelerate precision medicine and healthcare innovation.
Brode Mccrady is a Software Engineering Manager at Myriad Genetics, where he leads initiatives in AI, revenue systems, and intelligent document processing. With over a decade of experience in business intelligence and strategic analytics, Brode brings deep expertise in translating complex business needs into scalable technical solutions. He holds a degree in Economics, which informs his data-driven approach to problem-solving and business strategy.
Randheer Gehlot is a Principal Customer Solutions Manager at AWS who specializes in healthcare and life sciences transformation. With a deep focus on AI/ML applications in healthcare, he helps enterprises design and implement efficient cloud solutions that address real business challenges. His work involves partnering with organizations to modernize their infrastructure, enable innovation, and accelerate their cloud adoption journey while ensuring practical, sustainable outcomes.
Acknowledgements
We would like to thank Bob Strahan, Kurt Mason, Akhil Nooney and Taylor Jensen for their significant contributions, strategic decisions and guidance throughout.

How CBRE powers unified property management search and digital assista …

This post was written with Lokesha Thimmegowda, Muppirala Venkata Krishna Kumar, and Maraka Vishwadev of CBRE.
CBRE is the world’s largest commercial real estate services and investment firm. The company serves clients in more than 100 countries and offers services ranging from capital markets and leasing advisory to investment management, project management and facilities management.
CBRE uses AI to improve commercial real estate solutions with advanced analytics, automated workflows, and predictive insights. The chance to unlock value with AI in the commercial real estate lifecycle begins with data at scale. With the industry’s largest dataset and a comprehensive suite of enterprise-grade technology, the company has implemented a range of AI solutions to boost individual productivity and support broad-scale transformation.
This blog post describes how CBRE and AWS partnered to transform how property management professionals access information, creating a next-generation search and digital assistant experience that unifies access across many types of property data using Amazon Bedrock, Amazon OpenSearch Service, Amazon Relational Database Service, Amazon Elastic Container Service, and AWS Lambda.
Unified property management search challenges
CBRE’s proprietary PULSE system consolidates a wide range of essential property data—covering structured data from relational databases that record transactions and unstructured data stored in document repositories containing everything from lease agreements to property inspections. In the past, property management professionals had to sift through millions of documents and switch between multiple different systems to locate property maintenance details. Data was scattered across 10 distinct sources and four separate databases, which made it hard to get complete answers. This fragmented setup reduced productivity and made it difficult to uncover key insights about property operations.
Experts in property management, not database syntax, needed to ask complex questions in natural language, quickly synthesize disparate information, and avoid manual review of lengthy documents.
The challenge: deliver an intuitive, unified search solution bridging structured and unstructured content, with robust security, enterprise-grade performance and reliability.
Solution architecture
CBRE implemented a global search solution within PULSE, powered by Amazon Bedrock, to address these challenges. The search architecture is designed for a seamless, intelligent, and secure information retrieval experience across diverse data types. It orchestrates an interplay of user interaction, AI-driven processing, and robust data storage.
CBRE’s PULSE search solution uses Amazon Bedrock for the rapid deployment of generative AI capabilities by using multiple foundation models through a single API. CBRE’s implementation uses Amazon Nova Pro for SQL query generation, achieving a 67% reduction in processing time, while Claude Haiku powers intelligent document interactions. The solution maintains enterprise-grade security for all property data. By combining Amazon Bedrock capabilities with Retrieval Augmented Generation (RAG) and Amazon OpenSearch Service, CBRE created a unified search experience across more than eight million documents and multiple databases, fundamentally transforming how property professionals access and analyze business-critical information.
The following diagram illustrates the architecture for the solution that CBRE implemented in AWS:

Let us go through the flow for the solution:

Property Manager and PULSE UI: Property managers interact through the intuitive PULSE user interface, which serves as the gateway for both traditional keyword searches and natural language queries (NLQ). The UI displays search results, supports document conversations, and presents intelligent summaries in desktop and mobile.
Dynamic search execution: When users submit requests, the system first retrieves user-specific permissions from Amazon ElastiCache for Redis, chosen for its low latency and high throughput. Search operations across Amazon OpenSearch and transactional databases are then constrained by these user-specific permissions, making sure users only access authorized results with real-time granular control.
Orchestration layer: This central control hub serves as the application’s brain, receiving user requests from PULSE UI and intelligently routing them to appropriate backend services. Key responsibilities include:

Routing queries to relevant data systems (structured databases, unstructured documents, or both for deep search).
Initiating parallel searches across SQL Interact and Doc Interact components.
Merging, de-duplicating, and ranking results from disparate sources for unified outcomes.
Managing conversation history through Amazon DynamoDB integration.

SQL interact component (structured data search): This pathway manages interactions with structured relational databases (RDBMS) through these key steps:

4.1 Database metadata retrieval: Dynamically fetches schema details (for example, table names, column names, data types, relationships, constraints) for entities like property, contacts, and tenants from an Amazon OpenSearch index.
4.2 Amazon Bedrock LLM (Amazon Nova Pro): Interprets the user’s natural language query alongside schema metadata, translating it into accurate, optimized SQL queries tailored to the database. The solution reduced SQL query generation time from an average of 12 seconds earlier to 4 seconds using Amazon Nova Pro.
4.3 RDBMS systems (PostgreSQL, MS SQL): Actual transactional databases, such as PostgreSQL and MS SQL, which house the core structured property management data (for example, properties, contacts, tenants, K2 forms). They execute the LLM-generated SQL queries and return the structured tabular results back to the SQL Interact component.

DocInteract Component (Unstructured Document Search): This pathway is specifically designed for intelligent search and interaction with unstructured documents.

5.1 Vector Store (OpenSearch Cluster): Stores documents, including those from OpenText, as high-dimensional vectors for efficient semantic search using techniques like k-Nearest Neighbors while prioritizing speed and accuracy with metadata filtering.
5.2 Amazon Bedrock LLM (Claude Haiku): Interprets NLQs and translates them into optimized OpenSearch DSL queries, while powering the “Chat With AI” feature for direct document interaction, generating concise, conversational responses including answers, summaries, and natural dialogue.

Having established the core architecture with both SQL Interact and DocInteract components, the following sections explore the specific optimizations and innovations implemented for each data type, beginning with structured data search enhancements.
Structured data search
Building on the SQL interact component outlined in the architecture, the PULSE Search application offers two search methods for accessing structured data in PostgreSQL and MS SQL. Keyword Search scans the fields and schemas for specific terms, facilitating comprehensive coverage of the entire data system. With Natural Language Query (NLQ) Search users can interact with the databases using everyday language, translating queries into database queries. Both methods support property managers to efficiently locate and retrieve information across the database modules.
Database layer search performance enhancement at the SQL level
Our unique challenge involved implementing application-wide keyword searches that needed to scan across the columns in database tables – a non-conventional requirement compared to traditional indexed column-specific searches in RDBMS systems. This universal search capability was essential for user experience, allowing information discovery without knowing specific column names or data structures.
We leveraged native full-text search capabilities in both PostgreSQL and MS SQL Server databases:

PostgreSQL Implementation:

SELECT * FROM dbo.pg_db_view_name bd WHERE textsearchable_all_col @@ to_tsquery(‘english’, ‘keyword’)

Microsoft SQL Server Implementation:

SELECT * FROM [dbo].ms_db_view_name WHERE CONTAINS(*, ‘8384F’)

Note: Our implementation uses specialized text search columns (textsearchable_all_col) concatenating the searchable fields from the view pd_db_view_name, while ms_db_view_name represents a view created with full-text search indexing.
This optimization delivered an 80% improvement in query performance by harnessing native database capabilities while balancing comprehensive search coverage with optimal database performance through specialized indexing algorithms.
Database layer search performance enhancement at the SQL interact API level
We implemented several optimizations in database search functionality targeting three key performances (KPIs): Accuracy (precision of results), Consistency (reproducible outcomes), and Relevancy (making sure results align with user intent). The enhancements reduced response latency while simultaneously boosting these ACR metrics, resulting in faster and more dependable search results.
Prompt Engineering Changes: We implemented a comprehensive approach to prompt management and optimization, focusing on the following factors.

Configurability: We implemented modular prompt templates stored in external files to enable version control, simplified management, and reduced prompt size, improving performance and maintainability.
Dynamic field selection for context window reduction: The system uses KNN-based similarity search to filter and select only the most relevant schema fields aligned with user intent, reducing context window size and optimizing prompt effectiveness.
Dynamic few-shot example: The system intelligently selects the most relevant few-shot example from a configuration file using KNN-based similarity search for the SQL generation. This smart, context-aware approach makes sure that only the most pertinent example is included in the prompt, minimizing unnecessary data overhead. This approach helped in getting consistent and accurate SQL generation from LLM.
Business rule integration: The system maintains a centralized repository of business rules in a dedicated schema wise configuration file, making rule management and updates streamlined and efficient. During prompt generation, relevant business rules are dynamically integrated into prompts, facilitating consistency in rule application while providing flexibility for updates and maintenance.
LLM score-based relevancy: We added a fourth LLM call to evaluate and reorder schema relevance after initial KNN retrieval, addressing challenges where vector search returned irrelevant or poorly ordered schemas.For example, when processing a user query about property or contact information, the vector search might return three schemas, but:

The third schema might be irrelevant to the query.
The ordering of the two relevant schemas might not reflect their true relevancy to the query.
To address these challenges, we introduced an additional LLM processing (4th LLM parallel call) step that:

Evaluates the relevance of each schema to the user query.
Assigns relevancy scores to determine schema importance.
Reorders schemas based on their actual relevance to the query.
This enhancement improved our schema selection process by:

Making sure only truly relevant schemas are selected.
Maintaining proper relevancy ordering.
Providing more accurate context for subsequent query processing.

These enhancements improved schema selection by verifying only truly relevant schemas are processed, maintaining proper relevancy ordering, and providing more accurate context for query processing. The result was more precise, contextually appropriate responses and improved overall application performance.
Parallel LLM inference for SQL generation with Amazon Nova Pro
We implemented a comprehensive parallel processing architecture for NLQ to SQL conversion, enhancing system performance and efficiency. The solution introduces concurrent schema-based API calls to the LLM inference engine, with asynchronous processing for multiple schema evaluations. Our security-first approach authenticates and validates user entitlements while performing context-aware schema identification that incorporates similarity search and enforces access permissions. The system only processes schemas for which the user has explicit authorization, facilitating foundational data security. Following authentication, the system dynamically generates prompts (as detailed in our prompt engineering framework) and initiates concurrent processing of the most relevant schemas through parallel LLM inference calls. Before execution, it enhances the generated SQL queries with mandatory security joins that enforce building-level access controls, restricting users to their authorized buildings only.
Finalized SQL queries are executed on respective database systems (PostgreSQL or SQL Server). The system processes the query results and returns them as a structured API response, maintaining security and data integrity throughout the entire workflow. This architecture facilitates both optimal performance through parallel processing and comprehensive security through multi-layered access controls.
This integrated approach incorporates concurrent validation of generated SQL queries, resulting in reduced processing time and improved system throughput and reduced inference latency with Amazon Nova Pro. With introduction of Nova Pro there was significant improvement in inference latency. The framework’s architecture facilitates efficient resource utilization while maintaining high accuracy in SQL query generation, making it particularly effective for handling complex database operations and high-volume query processing requirements.

Enhancing unstructured data search
The PULSE document search uses two main methods, enhanced by purpose-built specialized search functions. Users can use the streamlined Keyword Search to precisely locate terms within documents and metadata for fast retrieval when precise search terms are known. This straightforward approach makes sure users can quickly locate exact matches across the entire document landscape. The second method, Natural Language Query (NLQ) Search, supports interaction with documents using everyday language, interpreting intent and converting queries into search parameters—particularly powerful for complex or concept -based queries. Complementing these core search methods, the system offers specialized search capabilities including Favorites and Collections search so users can efficiently navigate their personally curated document sets and shared collections. Additionally, the system provides intelligent document upload search functionality that helps users quickly locate appropriate document categories and upload locations based on document types and property contexts.
The search infrastructure supports comprehensive file formats including PDFs, Microsoft Office documents (Word, Excel, PowerPoint), emails (MSG), images (JPG, PNG), text files, HTML files, and various other document types, facilitating comprehensive coverage across the document categories in the property management environment.
Prompt engineering and management optimization
Our Document Search system incorporates advanced prompt engineering techniques to enhance search accuracy, efficiency, and maintainability. Let’s explore the key features of our prompt management system and the value they bring to the search experience.
Two-stage prompt architecture and modular prompt management:
At the core of our system is a two-stage prompt architecture. This design separates tool selection from task execution for more efficient and accurate query processing.

# Modular prompt loading from configuration
get_doc_detect_prompt = get_prompts(“doc_prompts/tool_detect/Get_Document_data_detect”)
get_doc_prompt = get_prompts(“doc_prompts/prepare_prompt/Get_Document_data_prompt”)
keyword_search_detect_prompt = get_prompts(“doc_prompts/tool_detect/keyword_search_detect”)

def detect_tool(user_prompt):
tool_descriptions = {
“Get_Document_data”: get_doc_detect_prompt,
“keyword_search”: keyword_search_detect_prompt,
“Get_Fawdocs_collections”: faw_collection_detect_prompt,
“upload_documents”: upload_document_detect_prompt
}

messages = [
{“role”: “system”, “content”: “You are an AI assistant that determines the most appropriate tool…”},
{“role”: “user”, “content”: f”Here are the tool descriptions:n{json.dumps(tool_descriptions, indent=2)}nnUser query: {user_prompt}nnWhich tool should be used?”}
]

This architecture reduces token usage by up to 60% by loading only necessary prompts per query processing stage. The lightweight initial stage quickly routes queries to appropriate tools, while specialized prompts handle the actual execution with focused context, improving both performance and accuracy in tool selection and query execution.
Our modular prompt management system stores prompts in external configuration files for dynamic loading based on context and supporting personalization. It supports prompt updates without code deployments, cutting update cycles from hours to minutes. This architecture facilitates A/B testing of different prompt variations and quick rollbacks, enhancing system adaptability and reliability.

def prepare_tool_prompt(detected_tool, userid):
tool_prompts = {
“keyword_search”: keyword_search_prompt,
“Get_Document_data”: get_doc_prompt.replace(“userid”, userid),
“upload_documents”: upload_document_prompt,
“Get_Favdocs_collections”: fav_collection_prompt
}
return tool_prompts[detected_tool]

The system implements context-aware prompt selection, adapting to query types, document characteristics, and search contexts. This approach makes sure that the most appropriate prompt and query structure are used for each unique search scenario. For example, the system distinguishes between different question types (for example, ‘list_question’) for tailored processing of various query intents.
Search algorithm optimization
Our document search system implements search algorithms that combine vector-based semantic search with traditional text-based approaches to search across document metadata and content. We use different query strategies optimized for specific search scenarios.
Keyword search:
Keyword search uses a dual strategy combining both metadata and content searches using phrase matching. A fixed query template structure facilitates efficiency and consistency, incorporating predefined metadata, content, permission rules, and building ID constraints, while dynamically integrating user-specific terms and roles. This approach allows for fast and reliable searches while maintaining proper access controls and relevance.
User queries like “lease agreement” or “property tax 2023” are parsed into component words, each requiring a match in the document content for relevancy, facilitating precise results.

“bool”: {
“must”: [
{“match_phrase”: {“srccontent”: word}} for word in search_words
]
}

Similarly, for metadata searches, the system uses phrase searching across metadata fields:

“multi_match”: {
“query”: search_words,
“type”: “phrase”,
“fields”: [“srcmetadata”]
}

This approach provides exact matching capabilities across document metadata, facilitating precise results when users are searching for specific document properties. The system executes both search types concurrently and results from both searches are then merged and deduplicated, with scoring normalized across both result sets.
Natural language query search:
Our NLQ search combines LLM-generated queries with vector-based semantic search through two main components. The metadata search uses an LLM to generate OpenSearch queries from natural language input. For instance, “Find lease agreements mentioning early termination for tech companies from last year” is transformed into a structured query that searches across document types, dates, property names and other metadata fields.
For content searches, we employ KNN vector search with a K-factor of 5 to identify semantically similar content. The system converts queries into vector embeddings and executes both metadata and content searches simultaneously, combining results while minimizing duplicates.
Chat with Document (digital assistant for in-depth document interaction):
The Chat with Document feature supports natural conversation with specific documents after initial search. Users can ask questions, request summaries, or seek specific information from selected documents through a straightforward interaction process.
When engaged, the system retrieves the complete document content using its node identifier and processes user queries through a streamlined pipeline. Each query is handled by an LLM using carefully constructed prompts that combine the user’s question with relevant document context.
With this capability users can extract information from complex documents efficiently. For example, property managers can quickly understand lease terms or payment schedules without manually scanning lengthy agreements. The feature provides instant summaries and explanations for rapid information access and decision-making in document-intensive workflows.
Scaling document ingestion
To handle high-throughput document processing and large-scale enterprise ingestion, our ingestion pipeline uses asynchronous Amazon Textract for scalable, parallel text extraction. The architecture efficiently processes diverse file types-PDFs, PPTs, Word documents, Excel files and images-even with hundreds of pages or high-resolution content. Once a document is uploaded to an Amazon S3 bucket, a message triggers an SQS queue, invoking a Lambda function that initiates an asynchronous Textract job, offloading heavy extraction and OCR tasks without blocking execution.
For text documents, the system reads the file from Amazon S3 and submits it to Amazon Textract’s asynchronous API, which processes the document in the background. Once the job completes, the results are retrieved and parsed to extract structured text. This text is then chunked intelligently—based on token count or semantic boundaries—and passed through a Bedrock embedding model (For example, Amazon Titan Text embeddings v2). Each chunk is enriched with metadata and indexed into Amazon OpenSearch for fast and context-aware search capabilities. Once ingested, our intelligent query strategy, driven by user and CBRE market lookups, dynamically directs searches to the relevant OpenSearch indexes.
Image files follow a similar flow but use Amazon Bedrock Claude 3 Haiku for OCR after base64 conversion. Extracted text is then chunked, embedded, and indexed like standard text documents.
Security and access control
User authentication and authorization occurs through a multi-layered security process:

Access token validation: The system verifies the user’s identity by validating the user identity in Microsoft B2C and their access token against each request. The user is also checked for their authorization to access application.
Entitlement verification: Simultaneously, the system checks the user’s permissions in a Redis database to verify they have the appropriate access rights to specific modules in application and database schemas (entitlements) they’re authorized to query on.
Property access validation: The system also retrieves their authorized building list from Redis database (building id list to which the user is mapped), making sure they can only access data related to their properties within their business portfolio.

This parallel validation process facilitates more secure and appropriate access while maintaining optimal performance through Redis’s high-speed data retrieval capabilities. Redis is populated during the application load through mapping user entitlement and building mapping maintained in the database. If the user details are not found in Redis an API is invoked to replenish the Redis database.

Results and impact
CBRE’s experience with this initiative has led to enhanced operational efficiency and data reliability, directly translating into tangible business benefits:

Cost savings and resource optimization: By reducing hours of manual effort annually per user, the business can realize substantial cost savings (for example, in labor costs, reduced overtime, or reallocated personnel). This frees up valuable user time so that the team can focus on more strategic, high-value tasks that drive building performance, innovation and growth rather than repetitive manual processes.
Improved decision-making and risk mitigation: Delivering results with 95% accuracy for business decisions that are based on highly reliable data. This minimizes the risk of errors, leading to more informed strategies, fewer costly mistakes, and ultimately, better business outcomes.
Increased productivity and throughput: With less time spent on manual tasks and a higher assurance of data quality, workflows can become smoother and faster. This translates to increased overall productivity and potentially higher throughput for related processes, enhancing service delivery.

Lessons learned and best practices
The following are our lessons learned and best practices based on our experience building this solution:

Use prompt modularization: Prompt engineering is essential for optimizing application performance and maintaining consistent results. Breaking prompts into modular components helped in better prompt management, enhanced control and maintainability through streamlined version control, simplified testing and validation processes, and improved performance tracking capabilities. The modular approach to prompt design reduced token usage, which in turn decreased LLM response times and improved overall system performance. Module approach also helps in enhanced SQL generation efficiency through faster troubleshooting, reduced implementation time, and more reliable query generation, resulting in quicker resolution of edge cases and business rule updates.
Provide accurate few shot example: For increased accuracy and consistency of SQL generation, use dynamic few shot example with modular components for seamless updates to example repository.

Include examples covering common use cases and edge scenarios.
Maintain a diverse set of high-quality example pairs covering various business scenarios.
Keep examples concise and focused on specific patterns.
Regularly update examples based on new business requirements. Remove or update outdated examples.
Limit to top-1 or top-2 most relevant examples to manage token usage.
Regularly validate the relevance of selected examples.
Set up feedback loops to continuously improve example matching accuracy.
Fine-tune similarity thresholds for optimal example matching.

Reduce the context window: For reducing the context window size of the context passed, select only the top-N KNN fields from the schema definition along with key/mandatory fields. Only apply the dynamic context field selection for schema where high number of fields are present and increasing the context window size.
Improve relevancy: LLM Scoring mechanism helped us in getting the right relevant set of schemas (modules). Harnessing LLM intelligence over the KNN result of relevant module helped us get the most relevant ordered results. Also consider:

Vector similarity alone may not capture true semantic relevance.
Top-K nearest neighbors don’t always guarantee contextual accuracy.
Order of results may not reflect actual relevance to the query.
Use of LLM Scoring provided a more accurate schema relevancy determination.

Conclusion
CBRE Property Management and AWS together demonstrated how innovative cloud AI solutions can unlock real business value at scale. By using AWS services and best practices, enterprises can reimagine how they access, manage, and derive insight from their data and take real action.
To learn how your organization can accelerate digital transformation with AWS, contact your AWS account team or start exploring AWS AI and data analytics services today.
Further reading on AWS services featured in this solution:

Amazon Bedrock: Foundation Model Service
Amazon Nova
Amazon OpenSearch Service documentation

About the authors
Lokesha Thimmegowda is a Senior Principal Software Engineer at CBRE, specializing in artificial intelligence and AWS. With four AWS certifications, including Solutions Architect Professional and AWS AI Practitioner, he excels at guiding teams through complex challenges with innovative solutions. Lokesha is passionate about designing transformative solution architectures that drive efficiency. Outside of work, he enjoys daily tennis with his daughters and weekend cricket.
Muppirala Venkata Krishna Kumar Principal Software Engineer at CBRE with over 18 years of expertise in leading technical teams and designing end-to-end solutions across diverse domains. A strategic technical lead with a strong command over both front-end and back-end technologies, cloud architecture using AWS, and AI/ML-driven innovations. Passionate about staying at the forefront of technology, continuously learning, and implementing modern tools to drive impactful results. Outside of work, values quality time with family and enjoys spiritual travel experiences that bring balance and inspiration.
Maraka Vishwadev is a Senior Staff Engineer at CBRE with 18 years of experience in enterprise software development, specializing in backend–frontend technologies and AWS Cloud. He leads impactful initiatives in Generative AI, leveraging Large Language Models to drive intelligent automation, enhance user experiences, and unlock new business capabilities. He is deeply involved in architecting and delivering scalable, secure, and cloud-native solutions, aligning technology with business strategy. Vishwa balances his professional life with cooking, movies, and quality family time.
Chanpreet Singh is a Senior Consultant at AWS with 18+ years of industry experience, specializing in Data Analytics and AI/ML solutions. He partners with enterprise customers to architect and implement cutting-edge solutions in Big Data, Machine Learning, and Generative AI using AWS native services, partner solutions and open-source technologies. A passionate technologist and problem solver, he balances his professional life with nature exploration, reading, and quality family time.
Sachin Khanna is a Lead Consultant specializing in Artificial Intelligence and Machine Learning (AI/ML) within the AWS Professional Services team. With a strong background in data management, generative AI, large language models, and machine learning, he brings extensive expertise to projects involving data, databases, and AI-driven solutions. His proficiency in cloud migration and cost optimization has enabled him to guide customers through successful cloud adoption journeys, delivering tailored solutions and strategic insights.
Dwaragha Sivalingam is a Senior Solutions Architect specializing in generative AI at AWS, serving as a trusted advisor to customers on cloud transformation and AI strategy. With seven AWS certifications including ML Specialty, he has helped customers in many industries, including insurance, telecom, utilities, engineering, construction, and real estate. A machine learning enthusiast, he balances his professional life with family time, enjoying road trips, movies, and drone photography.

Managed Tiered KV Cache and Intelligent Routing for Amazon SageMaker H …

Modern AI applications demand fast, cost-effective responses from large language models, especially when handling long documents or extended conversations. However, LLM inference can become prohibitively slow and expensive as context length increases, with latency growing exponentially and costs mounting with each interaction.
LLM inference requires recalculating attention mechanisms for the previous tokens when generating each new token. This creates significant computational overhead and high latency for long sequences. Key-value (KV) caching addresses this bottleneck by storing and reusing key-value vectors from previous computations, reducing inference latency and time-to-first-token (TTFT). Intelligent routing in LLMs is a technique that sends requests with shared prompts to the same inference instance to maximize the efficiency of the KV cache. It routes a new request to an instance that has already processed the same prefix, allowing it to reuse the cached KV data to accelerate processing and reduce latency. However, customers have told us that setting up and configuring the right framework for KV caching and intelligent routing at production scale is challenging and takes long experimental cycles.
Today we’re excited to announce that Amazon SageMaker HyperPod now supports Managed Tiered KV Cache and Intelligent Routing capabilities through the HyperPod Inference Operator. These new capabilities can deliver significant performance improvements for LLM inference workloads by reducing time to first token (TTFT) by up to 40%, increasing throughput, and lowering compute costs by up to 25% when used for long context prompts and multi-turn chat conversations using our internal tools. These capabilities are available for use with the HyperPod Inference Operator, which automatically manages the routing and distributed KV caching infrastructure, significantly reducing operational overhead while delivering enterprise-grade performance for production LLM deployments. By using the new Managed Tiered KV Cache feature you can efficiently offload attention caches to CPU memory (L1 cache) and distribute L2 cache for cross-instance sharing through a tiered storage architecture in HyperPod for optimal resource utilization and cost efficiency at scale.
Efficient KV caching combined with intelligent routing maximizes cache hits across workers so you can achieve higher throughput and lower costs for your model deployments. These features are particularly beneficial in applications that are processing long documents where the same context or prefix is referenced, or in multi-turn conversations where context from previous exchanges needs to be maintained efficiently across multiple interactions.
For example, legal teams analyzing 200 page contracts can now receive instant answers to follow-up questions instead of waiting 5+ seconds per query, healthcare chatbots maintain natural conversation flow across 20+ turn patient dialogues, and customer service systems process millions of daily requests with both better performance and lower infrastructure costs. These optimizations make document analysis, multi-turn conversations, and high-throughput inference applications economically viable at enterprise scale.
Optimizing LLM inference with Managed Tiered KV Cache and Intelligent Routing
Let’s break down the new features:

Managed Tiered KV Cache: Automatic management of attention states across CPU memory (L1) and distributed tiered storage (L2) with configurable cache sizes and eviction policies. SageMaker HyperPod handles the distributed cache infrastructure through the newly launched tiered storage, alleviating operational overhead for cross node cache sharing across clusters. KV cache entries are accessible cluster-wide (L2) so that a node can benefit from computations performed by other nodes.
Intelligent Routing: Configurable request routing to maximize cache hits using strategies like prefix-aware, KV-aware, and round-robin routing.
Observability: Built-in HyperPod Observability integration for observability of metrics and logs for Managed Tiered KV Cache and Intelligent Routing in Amazon Managed Grafana.

Sample flow for inference requests with KV caching and Intelligent Routing
As a user sends an inference request to HyperPod Load Balancer, it forwards the request to the Intelligent Router within the HyperPod cluster. The Intelligent Router dynamically distributes requests to the most appropriate mode pod (Instance A or Instance B) based on the routing strategy to maximize KV cache hit and minimize inference latency. As the request reaches the model pod, the pod first checks L1 cache (CPU) for frequently used key-value pairs, then queries the shared L2 cache (Managed Tiered KV Cache) if needed, before performing full computation of the token. Newly generated KV pairs are stored in both cache tiers for future reuse. After computation completes, the inference result flows back through the Intelligent Router and Load Balancer to the user.

Managed Tiered KV Cache
Managed Tiered KV Cache and Intelligent Routing are configurable opt-in features. When enabling Managed KV Cache, L1 cache is enabled by default, while both L1 and L2 cache can be configured to be enabled or disabled. The L1 cache resides locally on each inference node utilizing CPU memory. This local cache provides significantly fast access, making it ideal for frequently accessed data within a single model instance. The cache automatically manages memory allocation and eviction policies to optimize for the most valuable cached content. The L2 cache operates as a distributed cache layer spanning the entire cluster, enabling cache sharing across multiple model instances. We support two backend options for L2 cache, each with the following benefits:

Managed Tiered KV Cache (Recommended): A HyperPod disaggregated memory solution that offers excellent scalability to Terabyte pools, low latency, AWS network optimized, GPU-aware design with zero-copy support, and cost efficiency at scale.
Redis: Simple to set up, works well for small to medium workloads, and offers a rich environment of tools and integrations.

The two-tier architecture works together seamlessly. When a request arrives, the system first checks the L1 cache for the required KV pairs. If found, they are used immediately with minimal latency. If not found in L1, the system queries the L2 cache. If found there, the data is retrieved and optionally promoted to L1 for faster future access. Only if the data is not present in either cache does the system perform the full computation, storing the results in both L1 and L2 for future reuse.
Intelligent Routing
Our Intelligent Routing system offers four configurable strategies to optimize request distribution based on your workload characteristics, with the routing strategy being user-configurable at deployment time to match your application’s specific requirements.

Prefix-aware routing serves as the default strategy, maintaining a tree structure to track which prefixes are cached on which endpoints, delivering strong general-purpose performance for applications with common prompt templates such as multi-turn conversations, customer service bots with standard greetings, and code generation with common imports.
KV-aware routing provides the most sophisticated cache management through a centralized controller that tracks cache locations and handles eviction events in real-time, excelling at long conversation threads, document processing workflows, and extended coding sessions where maximum cache efficiency is critical.
Round-robin routing offers the most straightforward approach, distributing requests evenly across the available workers, best suited for scenarios where requests are independent, such as batch inference jobs, stateless API calls, and load testing scenarios.

Strategy
Best for

Prefix-aware routing (default)
Multi-turn conversations, customer service bots, code generation with common headers

KV-aware routing
Long conversations, document processing, extended coding sessions

Round-robin routing
Batch inference, stateless API calls, load testing

Deploying the Managed Tiered KV Cache and Intelligent Routing solution
Prerequisites
Create a HyperPod cluster with Amazon EKS as an orchestrator.

In Amazon SageMaker AI console, navigate to HyperPod Clusters, then Cluster Management.
On the Cluster Management page, select Create HyperPod cluster, then Orchestrated by Amazon EKS.
You can use one-click deployment from the SageMaker AI console. For cluster set up details see Creating a SageMaker HyperPod cluster with Amazon EKS orchestration.
Verify that the HyperPod cluster status is InService.

Verify that the inference operator is up and running. The Inference add-on is installed as a default option when you create the HyperPod cluster from the console. If you want to use an existing EKS cluster, see Setting up your HyperPod clusters for model deployment to manually install the inference operator.

From the command line, run the following command: 

kubectl get pods -n hyperpod-inference-system

Output:

hyperpod-inference-operator-conroller-manager-xxxxxx pod is in running state in namespace hyperpod-inference-system

Or, verify that the operator is running from console. Navigate to EKS cluster, Resources, Pods, Pick namespace, hyperpod-inference-system.

Preparing your model deployment manifest files
You can enable these features by adding configurations to your InferenceEndpointConfig custom CRD file.
For the complete example, visit the AWS samples GitHub repository.

export MODEL_NAME=”Llama-3.1-8B-Instruct”
export INSTANCE_TYPE=”ml.g5.24xlarge”
export MODEL_IMAGE=”public.ecr.aws/deep-learning-containers/vllm:0.11.1-gpu-py312-cu129-ubuntu22.04-ec2-v1.0″
export S3_BUCKET=”my-model-bucket”
export S3_MODEL_PATH=”models/Llama-3.1-8B-Instruct”
export AWS_REGION=”us-west-2″
export CERT_S3_URI=”s3://my-bucket/certs/”
export NAMESPACE=”default”
export NAME=”demo”

cat << EOF > inference_endpoint_config.yaml
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
name: ${NAME}
namespace: ${NAMESPACE}
spec:
modelName: ${MODEL_NAME}
instanceType: ${INSTANCE_TYPE}
replicas: 1
invocationEndpoint: v1/chat/completions
modelSourceConfig:
modelSourceType: s3
s3Storage:
bucketName: ${S3_BUCKET}
region: ${AWS_REGION}
modelLocation: ${S3_MODEL_PATH}
prefetchEnabled: false
kvCacheSpec:
enableL1Cache: true
enableL2Cache: true
l2CacheSpec:
l2CacheBackend: “tieredstorage” # can also be “redis”
# Set l2CacheLocalUrl if selecting “redis”
# l2CacheLocalUrl: “redis:redisdefaultsvcclusterlocal:6379”
intelligentRoutingSpec:
enabled: true
routingStrategy: prefixaware
tlsConfig:
tlsCertificateOutputS3Uri: ${CERT_S3_URI}
metrics:
enabled: true
modelMetrics:
port: 8000
loadBalancer:
healthCheckPath: /health
worker:
resources:
limits:
nvidia.com/gpu: “4”
requests:
cpu: “6”
memory: 30Gi
nvidia.com/gpu: “4”
image: ${MODEL_IMAGE}
args:
– “–model”
– “/opt/ml/model”
– “–max-model-len”
– “20000”
– “–tensor-parallel-size”
– “4”
modelInvocationPort:
containerPort: 8000
name: http
modelVolumeMount:
name: model-weights
mountPath: /opt/ml/model
environmentVariables:
– name: OPTION_ROLLING_BATCH
value: “vllm”
– name: SAGEMAKER_SUBMIT_DIRECTORY
value: “/opt/ml/model/code”
– name: MODEL_CACHE_ROOT
value: “/opt/ml/model”
– name: SAGEMAKER_MODEL_SERVER_WORKERS
value: “1”
– name: SAGEMAKER_MODEL_SERVER_TIMEOUT
value: “3600”
EOF

kubectl apply -f inference_endpoint_config.yaml

# Check inferenceendpointconfig status
kubectl get inferenceendpointconfig ${NAME} -n ${NAMESPACE}
NAME AGE
demo 8s

# Check pods status – you should see worker pods
kubectl get pods -n ${NAMESPACE}
NAME READY STATUS RESTARTS AGE
demo-675886c7bb-7bhhg 3/3 Running 0 30s

# Router pods are under hyperpod-inference-system namespace
kubectl get pods -n hyperpod-inference-system
NAME READY STATUS RESTARTS AGE
hyperpod-inference-operator-controller-manager-dff64b947-m5nqk 1/1 Running 0 5h49m
demo-default-router-8787cf46c-jmgqd 2/2 Running 0 2m16s

Observability
You can monitor Managed KV Cache and Intelligent Routing metrics through the SageMaker HyperPod Observability features. For more information, see Accelerate foundation model development with one-click observability in Amazon SageMaker HyperPod.
KV Cache Metrics are available in the Inference dashboard.

Benchmarking
We conducted comprehensive benchmarking to validate real-world performance improvements for production LLM deployments. Our benchmarks were run with Managed Tiered KV Cache and Intelligent Routing feature using the Llama-3.1-70B-Instruct model deployed across 7 replicas on p5.48xlarge instances (each equipped with eight NVIDIA GPUs), under a steady-load traffic pattern. The benchmark environment used a dedicated client node group—with one c5.12xlarge instance per 100 concurrent requests to generate a controlled load, and a dedicated server node group, making sure model servers operated in isolation to help prevent resource contention under high concurrency.
Our benchmarks demonstrate that a combination of L1 and L2 Managed Tiered KV Cache and Intelligent Routing delivers substantial performance improvements across multiple dimensions. For medium context scenarios (8k tokens), we observed a 40% reduction in time to first token (TTFT) at P90, 72% reduction at P50, 24% increase in throughput, and 21% cost reduction compared to baseline configurations without optimization. The benefits are even more pronounced for long context workloads (64K tokens), achieving a 35% reduction in TTFT at P90, 94% reduction at P50, 38% throughput increase, and 28% cost savings. The optimization benefits scale dramatically with context length. While 8K token scenarios demonstrate solid improvements across the metrics, 64K token workloads experience transformative gains that fundamentally change the user experience. Our testing also confirmed that AWS-managed tiered storage consistently outperformed Redis-based L2 caching across the scenarios. The tiered storage backend delivered better latency and throughput without requiring the operational overhead of managing separate Redis infrastructure, making it the recommended choice for most deployments. Finally, unlike traditional performance optimizations that require tradeoffs between cost and speed, this solution delivers both simultaneously.
TTFT (P90)

TTFT (P50)

Throughput (TPS)

Cost/1000 token ($)

Conclusion
Managed Tiered KV Cache and Intelligent Routing in Amazon SageMaker HyperPod Model Deployment help you optimize LLM inference performance and costs through efficient memory management and smart request routing. You can get started today by adding these configurations to your HyperPod model deployments in the AWS Regions where SageMaker HyperPod is available.
To learn more, visit the Amazon SageMaker HyperPod documentation or follow the model deployment getting started guide.

About the authors
Chaitanya Hazarey is the Software Development Manager for SageMaker HyperPod Inference at Amazon, bringing extensive expertise in full-stack engineering, ML/AI, and data science. As a passionate advocate for responsible AI development, he combines technical leadership with a deep commitment to advancing AI capabilities while maintaining ethical considerations. His comprehensive understanding of modern product development drives innovation in machine learning infrastructure.
Pradeep Cruz is a Senior SDM at Amazon Web Services (AWS), driving AI infrastructure and applications at enterprise scale. Leading cross-functional organizations at Amazon SageMaker AI, he has built and scaled multiple high-impact services for enterprise customers including SageMaker HyperPod-EKS Inference, Task Governance, Feature Store, AIOps, and JumpStart Model Hub at AWS, alongside enterprise AI platforms at T-Mobile and Ericsson. His technical depth spans distributed systems, GenAI/ML, Kubernetes, cloud computing, and full-stack software development.
Vinay Arora is a Specialist Solution Architect for Generative AI at AWS, where he collaborates with customers in designing cutting-edge AI solutions leveraging AWS technologies. Prior to AWS, Vinay has over two decades of experience in finance—including roles at banks and hedge funds—he has built risk models, trading systems, and market data platforms. Vinay holds a master’s degree in computer science and business management.
Piyush Daftary is a Senior Software Engineer at AWS, working on Amazon SageMaker with a focus on building performant, scalable inference systems for large language models. His technical interests span AI/ML, databases, and search technologies, where he specializes in developing production-ready solutions that enable efficient model deployment and inference at scale. His work involves optimizing system performance, implementing intelligent routing mechanisms, and designing architectures that support both research and production workloads, with a passion for solving complex distributed systems challenges and making advanced AI capabilities more accessible to developers and organizations. Outside of work, he enjoys traveling, hiking, and spending time with family.
Ziwen Ning is a Senior Software Development Engineer at AWS, currently working on SageMaker Hyperpod Inference with a focus on building scalable infrastructure for large-scale AI model inference. His technical expertise spans container technologies, Kubernetes orchestration, and ML infrastructure, developed through extensive work across the AWS ecosystem. He has deep experience in container registries and distribution, container runtime development and open source contributions, and containerizing ML workloads with custom resource management and monitoring. Ziwen is passionate about designing production-grade systems that make advanced AI capabilities more accessible. In his free time, he enjoys kickboxing, badminton, and immersing himself in music.
Roman Blagovirnyy is a Sr. User Experience Designer on the SageMaker AI team with 19 years of diverse experience in interactive, workflow, and UI design, working on enterprise and B2B applications and features for the finance, healthcare, security, and HR industries prior to joining Amazon. At AWS Roman was a key contributor to the design of SageMaker AI Studio, SageMaker Studio Lab, data and model governance capabilities, and HyperPod. Roman’s currently works on new features and improvements to the administrator experience for HyperPod. In addition to this, Roman has a keen interest in design operations and process.
Caesar Chen is the Software Development Manager for SageMaker HyperPod at AWS, where he leads the development of cutting-edge machine learning infrastructure. With extensive experience in building production-grade ML systems, he drives technical innovation while fostering team excellence. His work in scalable model hosting infrastructure empowers data scientists and ML engineers to deploy and manage models with greater efficiency and reliability.
Chandra Lohit Reddy Tekulapally is a Software Development Engineer with the Amazon SageMaker HyperPod team. He is passionate about designing and building reliable, high-performance distributed systems that power large-scale AI workloads. Outside of work, he enjoys traveling and exploring new coffee spots.
Kunal Jha is a Principal Product Manager at AWS. He is focused on building Amazon SageMaker Hyperpod as the best-in-class choice for Generative AI model’s training and inference. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest.
Vivek Gangasani is a Worldwide Lead GenAI Specialist Solutions Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product strategy for SageMaker Inference. He also helps enterprises and startups deploy, manage, and scale their GenAI models with SageMaker and GPUs. Currently, he is focused on developing strategies and content for optimizing inference performance and GPU efficiency for hosting Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Salesforce AI Research Introduces xRouter: A Reinforcement Learning Ro …

When your application can call many different LLMs with very different prices and capabilities, who should decide which one answers each request? Salesforce AI research team introduces ‘xRouter’, a tool-calling–based routing system that targets this gap with a reinforcement learning based router and learns when to answer locally and when to call external models, while tracking cost at token level.

What is xRouter?

xRouter is a tool calling based orchestration system built on Qwen2.5-7B-Instruct as the router backbone. The router is an instruction tuned model with tool calling capabilities that decides which downstream model to invoke, how to prompt it, and whether to synthesize or select an answer. The implementation uses DAPO, Distributional Advantage Policy Optimization, inside the Verl reinforcement learning framework, and exposes an OpenAI compatible API.

The router operates over more than 20 LLM tools in the full system. These tools span premium, standard, budget and specialized tiers, including GPT-5, GPT-4.1, GPT-5-Mini, GPT-5-Nano, o3, Kimi K2, DeepSeek-R1, Qwen3-235B variants and GPT-OSS models. The offloading pool is a 12 model subset that includes GPT-5, GPT-5-Mini, GPT-5-Nano, GPT-4o, GPT-4.1, o3, o3-Pro, o4-Mini, GPT-OSS-120B, GPT-OSS-20B and two Gemini-2.5 variants.

https://arxiv.org/pdf/2510.08439

Cost Aware Reward and Success Gating

Routing is framed as a reinforcement learning problem. For each episode, the reward combines a binary success signal and a cost penalty. The research team defines a reward that gives a fixed bonus when the final answer is correct, then subtracts a term proportional to the total normalized cost of all model calls. If the answer is wrong, the reward is zero regardless of how cheap it was.

As per the Model weights page, reward = quality − λ × normalized_cost, where λ is a cost penalty coefficient. Episodes with failures effectively have zero quality. This ‘success gated, cost shaped’ objective forces the router to first achieve correctness, then optimize cost among successful strategies. In practice, training uses 3 cost penalty settings, which produce the xRouter-7B-1, xRouter-7B-2 and xRouter-7B-3 variants.

https://arxiv.org/pdf/2510.08439

Training Data and Signal Design

xRouter training data comes from Reasoning360, which includes math, code and general reasoning tasks with difficulty estimates derived from a strong reference model, Qwen3-32B. The research team stratify samples into easy, medium and hard bands, and add simpler chit chat, retrieval and factual questions to teach the router when it can answer directly without delegation. Each sample includes descriptions and prices for models from different tiers. The system also refreshes the model catalog and perturbs costs to avoid overfitting to a static price table.

Failed trajectories, such as wrong answers from expensive models or unnecessary calls when the router could have answered itself, still incur full cost and receive zero reward. This produces a clean learning signal, where correctness gates reward and cost shapes the routing policy.

How the Router Behaves at Inference Time?

The router supports three execution modes. It can answer directly from the backbone without calling tools. It can call one or more downstream models, then synthesize a response using its own reasoning over their outputs. It can also call downstream models and use a special select_response tool to pick one of the replies as the final answer. These modes are implemented through function calls in an OpenAI style interface, which the orchestration engine executes through LiteLLM and SGLang.

Empirically, trained xRouter instances use a mix of direct and synthesized responses. Off the shelf routers such as GPT-4o, GPT-4.1, GPT-5, Qwen2.5-7B and Qwen3-8B tend to respond directly most of the time, even when instructed to offload when uncertain. This is an important behavioral difference and explains part of the efficiency gain.

Quantitative Results and Cost Utility

On static routing baselines across Minerva, MATH-500, Olympiad Bench, AIME-24, AMC-23, Codeforces, Code-Contests and Human-EvalPlus, xRouter-7B variants consistently improve accuracy compared to using the same base model as an untrained router. xRouter-7B-2, for example, reaches near GPT-5 accuracy on Olympiad Bench while using about one eighth of the GPT-5 evaluation cost.

In the system level comparison on LiveCodeBenchv5, GPQADiamond, AIME25, MT-Bench, IFEval and LiveBench, xRouter-7B-3 achieves the highest average accuracy on LiveCodeBenchv5 among all tested systems, and does this with moderate cost. Across tasks such as GPQA, xRouter variants reach around 80 to 90 percent of GPT-5 accuracy while consuming less than one fifth of the cost. The research team summarize that their cost aware reward can reduce inference cost by up to 80 percent at similar completion rates. The model weights HF card reports up to 60 percent cost reduction for comparable quality under other settings.

The research team also defines ‘cost utility’ as accuracy divided by cost. Open source single models with very low API prices often reach higher cost utility, but with lower absolute accuracy. xRouter sits in the middle, trading some cost utility for stronger task performance, which is usually what production systems care about.

Key Takeaways

xRouter is a tool calling router built on Qwen2.5 7B Instruct that learns to select among 20 plus external LLMs with a reinforcement learning policy that is explicitly cost aware.

The router uses a success gated reward, tasks only get positive reward when the final answer is correct, and within successful trajectories it applies a cost penalty term λ times normalized cost, which yields three xRouter 7B variants with different cost accuracy trade offs.

Training on Reasoning360 with difficulty stratification and synthetic easy queries teaches xRouter when to answer directly and when to offload, while perturbing prices and model pools improves robustness to changing provider catalogs.

Across math, coding and reasoning benchmarks, xRouter 7B models achieve near GPT 5 accuracy on hard tasks like Olympiad Bench and around 80 to 90 percent of GPT 5 accuracy on GPQA, while cutting offloading cost by up to 60 to 80 percent depending on the evaluation setup.

Editorial Notes

xRouter is a practical step toward cost aware orchestration for heterogeneous LLM fleets. It shows that a mid size router, trained with DAPO on Reasoning360 using a success gated, cost shaped reward, can consistently approach GPT 5 accuracy while reducing offloading cost by up to 60 to 80 percent.

Check out the PAPER and Model Weight. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Salesforce AI Research Introduces xRouter: A Reinforcement Learning Router for Cost Aware LLM Orchestration appeared first on MarkTechPost.

Agent0: A Fully Autonomous AI Framework that Evolves High-Performing A …

Large language models need huge human datasets, so what happens if the model must create all its own curriculum and teach itself to use tools? A team of researchers from UNC-Chapel Hill, Salesforce Research and Stanford University introduce ‘Agent0’, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration

Agent0 targets mathematical and general reasoning. It shows that careful task generation and tool integrated rollouts can push a base model beyond its original capabilities, across ten benchmarks.

https://arxiv.org/pdf/2511.16043

Two agents from one base model

Agent0 starts from a base policy π_base, for example Qwen3 4B Base or Qwen3 8B Base. It clones this policy into:

a Curriculum Agent πθ that generates tasks,

an Executor Agent πϕ that solves those tasks with a Python tool.

Training proceeds in iterations with two stages per iteration:

Curriculum evolution: The curriculum agent generates a batch of tasks. For each task, the executor samples multiple responses. A composite reward measures how uncertain the executor is, how often it uses the tool and how diverse the batch is. πθ is updated with Group Relative Policy Optimization (GRPO) using this reward.

Executor evolution: The trained curriculum agent is frozen. It generates a large pool of tasks. Agent0 filters this pool to keep only tasks near the executor’s capability frontier, then trains the executor on these tasks using an ambiguity aware RL objective called Ambiguity Dynamic Policy Optimization (ADPO).

This loop creates a feedback cycle. As the executor becomes stronger by using the code interpreter, the curriculum must generate more complex, tool reliant problems to keep its reward high.

https://arxiv.org/pdf/2511.16043

How the curriculum agent scores tasks?

The curriculum reward combines three signals:

Uncertainty reward: For each generated task x, the executor samples k responses and majority votes a pseudo answer. Self consistency p̂(x) is the fraction of responses that agree with this majority. The reward is maximal when p̂ is close to 0.5 and low when tasks are too easy or too hard. This encourages tasks that are challenging but still solvable for the current executor.

Tool use reward: The executor can trigger a sandboxed code interpreter using python tags and receives results tagged as output. Agent0 counts the number of tool calls in a trajectory and gives a scaled, capped reward, with a cap C set to 4 in experiments. This favors tasks that actually require tool calls rather than pure mental arithmetic.

Repetition penalty: Within each curriculum batch, Agent0 measures pairwise similarity between tasks using a BLEU based distance. Tasks are clustered, and a penalty term increases with cluster size. This discourages the curriculum from generating many near duplicates.

A composite reward multiplies a format check with a weighted sum of uncertainty and tool rewards minus the repetition penalty. This composite value feeds into GRPO to update πθ.

How the executor learns from noisy self labels?

The executor is also trained with GRPO but on multi turn, tool integrated trajectories and pseudo labels instead of ground truth answers.

Frontier dataset construction: After curriculum training in an iteration, the frozen curriculum generates a large candidate pool. For each task, Agent0 computes self consistency p̂(x) with the current executor and keeps only tasks where p̂ lies in an informative band, for example between 0.3 and 0.8. This defines a challenging frontier dataset that avoids trivial or impossible problems.

Multi turn tool integrated rollouts: For each frontier task, the executor generates a trajectory that can interleave:

natural language reasoning tokens,

python code segments,

output tool feedback.

Generation pauses when a tool call appears, executes the code in a sandboxed interpreter built on VeRL Tool, then resumes conditioned on the result. The trajectory terminates when the model produces a final answer inside {boxed …} tags.

A majority vote across sampled trajectories defines a pseudo label and a terminal reward for each trajectory.

ADPO, ambiguity aware RL: Standard GRPO treats all samples equally, which is unstable when labels come from majority voting on ambiguous tasks. ADPO modifies GRPO in two ways using p̂ as an ambiguity signal.

It scales the normalized advantage with a factor that increases with self consistency, so trajectories from low confidence tasks contribute less.

It sets a dynamic upper clipping bound for the importance ratio, which depends on self consistency. Empirical analysis shows that fixed upper clipping mainly affects low probability tokens. ADPO relaxes this bound adaptively, which improves exploration on uncertain tasks, as visualized by the up clipped token probability statistics.

https://arxiv.org/pdf/2511.16043

Results on mathematical and general reasoning

Agent0 is implemented on top of VeRL and evaluated on Qwen3 4B Base and Qwen3 8B Base. It uses a sandboxed Python interpreter as the single external tool.

The research team evaluate on ten benchmarks:

Mathematical reasoning: AMC, Minerva, MATH, GSM8K, Olympiad Bench, AIME24, AIME25.

General reasoning: SuperGPQA, MMLU Pro, BBEH.

They report pass@1 for most datasets and mean@32 for AMC and AIME tasks.

For Qwen3 8B Base, Agent0 reaches:

math average 58.2 versus 49.2 for the base model,

overall general average 42.1 versus 34.5 for the base model.

Agent0 also improves over strong data free baselines such as R Zero, Absolute Zero, SPIRAL and Socratic Zero, both with and without tools. On Qwen3 8B, it surpasses R Zero by 6.4 percentage points and Absolute Zero by 10.6 points on the overall average. It also beats Socratic Zero, which relies on external OpenAI APIs.

Across three co evolution iterations, average math performance on Qwen3 8B increases from 55.1 to 58.2 and general reasoning also improves per iteration. This confirms stable self improvement rather than collapse.

Qualitative examples show that curriculum tasks evolve from basic geometry questions to complex constraint satisfaction problems, while executor trajectories mix reasoning text with Python calls to reach correct answers.

Key Takeaways

Fully data free co evolution: Agent0 eliminates external datasets and human annotations. Two agents, a curriculum agent and an executor agent, are initialized from the same base LLM and co evolve only via reinforcement learning and a Python tool.

Frontier curriculum from self uncertainty: The curriculum agent uses the executor’s self consistency and tool usage to score tasks. It learns to generate frontier tasks that are neither trivial nor impossible, and that explicitly require tool integrated reasoning.

ADPO stabilizes RL with pseudo labels: The executor is trained with Ambiguity Dynamic Policy Optimization. ADPO down weights highly ambiguous tasks and adapts the clipping range based on self consistency, which makes GRPO style updates stable when rewards come from majority vote pseudo labels.

Consistent gains on math and general reasoning: On Qwen3 8B Base, Agent0 improves math benchmarks from 49.2 to 58.2 average and general reasoning from 34.5 to 42.1, which corresponds to relative gains of about 18 percent and 24 percent.

Outperforms prior zero data frameworks: Across ten benchmarks, Agent0 surpasses previous self evolving methods such as R Zero, Absolute Zero, SPIRAL and Socratic Zero, including those that already use tools or external APIs. This shows that the co evolution plus tool integration design is a meaningful step beyond earlier single round self play approaches.

Editorial Notes

Agent0 is an important step toward practical, data free reinforcement learning for tool integrated reasoning. It shows that a base LLM can act as both Curriculum Agent and Executor Agent, and that GRPO with ADPO and VeRL Tool can drive stable improvement from majority vote pseudo labels. The method also demonstrates that tool integrated co evolution can outperform prior zero data frameworks such as R Zero and Absolute Zero on strong Qwen3 baselines. Agent0 makes a strong case that self evolving, tool integrated LLM agents are becoming a realistic training paradigm.

Check out the PAPER and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Agent0: A Fully Autonomous AI Framework that Evolves High-Performing Agents without External Data through Multi-Step Co-Evolution appeared first on MarkTechPost.

How to Build a Neuro-Symbolic Hybrid Agent that Combines Logical Plann …

In this tutorial, we demonstrate how to combine the strengths of symbolic reasoning with neural learning to build a powerful hybrid agent. We focus on creating a neuro-symbolic architecture that uses classical planning for structure, rules, and goal-directed behavior, while neural networks handle perception and action refinement. As we walk through the code, we see how both layers interact in real time, allowing us to navigate an environment, overcome uncertainty, and adapt intelligently. At last, we understand how neuro-symbolic systems bring interpretability, robustness, and flexibility together in a single agentic framework. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Set, Optional
from collections import deque
import warnings
warnings.filterwarnings(‘ignore’)

@dataclass
class State:
robot_pos: Tuple[int, int]
holding: Optional[str] = None
visited: Set[Tuple[int, int]] = field(default_factory=set)
objects_collected: Set[str] = field(default_factory=set)
def __hash__(self):
return hash((self.robot_pos, self.holding))

class SymbolicPlanner:
def __init__(self, grid_size: int = 8):
self.grid_size = grid_size
self.actions = [‘up’, ‘down’, ‘left’, ‘right’, ‘pickup’, ‘drop’]
def get_successors(self, state: State, obstacles: Set[Tuple[int, int]], objects: Dict[str, Tuple[int, int]]) -> List[Tuple[str, State]]:
successors = []
x, y = state.robot_pos
moves = {‘up’: (x, y-1), ‘down’: (x, y+1), ‘left’: (x-1, y), ‘right’: (x+1, y)}
for action, new_pos in moves.items():
nx, ny = new_pos
if (0 <= nx < self.grid_size and 0 <= ny < self.grid_size and new_pos not in obstacles):
new_state = State(new_pos, state.holding, state.visited | {new_pos}, state.objects_collected.copy())
successors.append((action, new_state))
if state.holding is None:
for obj_name, obj_pos in objects.items():
if state.robot_pos == obj_pos and obj_name not in state.objects_collected:
new_state = State(state.robot_pos, obj_name, state.visited.copy(), state.objects_collected.copy())
successors.append((‘pickup’, new_state))
if state.holding is not None:
new_state = State(state.robot_pos, None, state.visited.copy(), state.objects_collected | {state.holding})
successors.append((‘drop’, new_state))
return successors
def heuristic(self, state: State, goal: Tuple[int, int]) -> float:
return abs(state.robot_pos[0] – goal[0]) + abs(state.robot_pos[1] – goal[1])
def a_star_plan(self, start_state: State, goal: Tuple[int, int], obstacles: Set[Tuple[int, int]], objects: Dict[str, Tuple[int, int]]) -> List[str]:
counter = 0
frontier = [(self.heuristic(start_state, goal), counter, 0, start_state, [])]
visited = set()
while frontier:
frontier.sort()
_, _, cost, state, plan = frontier.pop(0)
counter += 1
if state.robot_pos == goal and len(state.objects_collected) >= len(objects):
return plan
state_key = (state.robot_pos, state.holding)
if state_key in visited:
continue
visited.add(state_key)
for action, next_state in self.get_successors(state, obstacles, objects):
new_cost = cost + 1
new_plan = plan + [action]
priority = new_cost + self.heuristic(next_state, goal)
frontier.append((priority, counter, new_cost, next_state, new_plan))
counter += 1
return []

We lay the foundation for our symbolic reasoning system and define how states, actions, and transitions work. We implement classical planning logic using A* search to generate goal-directed, interpretable action sequences. As we build this part, we establish the rule-based backbone that guides the agent’s high-level decisions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass NeuralPerception:
def __init__(self, grid_size: int = 8):
self.grid_size = grid_size
self.W1 = np.random.randn(grid_size * grid_size, 64) * 0.1
self.b1 = np.zeros(64)
self.W2 = np.random.randn(64, 32) * 0.1
self.b2 = np.zeros(32)
self.W3 = np.random.randn(32, grid_size * grid_size) * 0.1
self.b3 = np.zeros(grid_size * grid_size)
def relu(self, x):
return np.maximum(0, x)
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def perceive(self, noisy_grid: np.ndarray) -> np.ndarray:
x = noisy_grid.flatten()
h1 = self.relu(x @ self.W1 + self.b1)
h2 = self.relu(h1 @ self.W2 + self.b2)
out = self.sigmoid(h2 @ self.W3 + self.b3)
return out.reshape(self.grid_size, self.grid_size)

class NeuralPolicy:
def __init__(self, state_dim: int = 4, action_dim: int = 4):
self.W = np.random.randn(state_dim, action_dim) * 0.1
self.b = np.zeros(action_dim)
self.action_map = [‘up’, ‘down’, ‘left’, ‘right’]
def softmax(self, x):
exp_x = np.exp(x – np.max(x))
return exp_x / exp_x.sum()
def get_action_probs(self, state_features: np.ndarray) -> np.ndarray:
logits = state_features @ self.W + self.b
return self.softmax(logits)
def select_action(self, state_features: np.ndarray, symbolic_action: str) -> str:
probs = self.get_action_probs(state_features)
if symbolic_action in self.action_map:
sym_idx = self.action_map.index(symbolic_action)
probs[sym_idx] += 0.7
probs = probs / probs.sum()
return np.random.choice(self.action_map, p=probs)

We introduce the neural components that allow our agent to sense and adapt. We design a lightweight neural network to denoise the environment and a simple policy network to refine actions based on features. As we integrate these elements, we ensure that our agent can handle uncertainty and adjust behavior dynamically. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass NeuroSymbolicAgent:
def __init__(self, grid_size: int = 8):
self.grid_size = grid_size
self.planner = SymbolicPlanner(grid_size)
self.perception = NeuralPerception(grid_size)
self.policy = NeuralPolicy()
self.obstacles = {(3, 3), (3, 4), (4, 3), (5, 5), (6, 2)}
self.objects = {‘key’: (2, 6), ‘gem’: (6, 6)}
self.goal = (7, 7)
def create_noisy_observation(self, true_grid: np.ndarray) -> np.ndarray:
noise = np.random.randn(*true_grid.shape) * 0.2
return np.clip(true_grid + noise, 0, 1)
def extract_state_features(self, pos: Tuple[int, int], goal: Tuple[int, int]) -> np.ndarray:
return np.array([pos[0]/self.grid_size, pos[1]/self.grid_size, goal[0]/self.grid_size, goal[1]/self.grid_size])
def execute_mission(self, verbose: bool = True) -> Tuple[List, List]:
start_state = State(robot_pos=(0, 0), visited={(0, 0)})
symbolic_plan = self.planner.a_star_plan(start_state, self.goal, self.obstacles, self.objects)
if verbose:
print(f” Symbolic Plan Generated: {len(symbolic_plan)} steps”)
print(f” Plan: {symbolic_plan[:10]}{‘…’ if len(symbolic_plan) > 10 else ”}n”)
true_grid = np.zeros((self.grid_size, self.grid_size))
for obs in self.obstacles:
true_grid[obs[1], obs[0]] = 1.0
noisy_obs = self.create_noisy_observation(true_grid)
perceived_grid = self.perception.perceive(noisy_obs)
if verbose:
print(f” Neural Perception: Denoised obstacle map”)
print(f” Perception accuracy: {np.mean((perceived_grid > 0.5) == true_grid):.2%}n”)
trajectory = [(0, 0)]
current_pos = (0, 0)
actions_taken = []
for i, sym_action in enumerate(symbolic_plan[:30]):
features = self.extract_state_features(current_pos, self.goal)
refined_action = self.policy.select_action(features, sym_action) if sym_action in [‘up’,’down’,’left’,’right’] else sym_action
actions_taken.append(refined_action)
if refined_action == ‘up’: current_pos = (current_pos[0], max(0, current_pos[1]-1))
elif refined_action == ‘down’: current_pos = (current_pos[0], min(self.grid_size-1, current_pos[1]+1))
elif refined_action == ‘left’: current_pos = (max(0, current_pos[0]-1), current_pos[1])
elif refined_action == ‘right’: current_pos = (min(self.grid_size-1, current_pos[0]+1), current_pos[1])
if current_pos not in self.obstacles:
trajectory.append(current_pos)
return trajectory, actions_taken

We bring the symbolic and neural layers together into a unified agent. We generate a symbolic plan, perceive the environment through neural processing, and refine each planned action using the neural policy. As we execute the mission loop, we observe how both systems interact seamlessly to produce robust behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef visualize_execution(agent: NeuroSymbolicAgent, trajectory: List, title: str = “Neuro-Symbolic Agent Execution”):
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
ax = axes[0]
grid = np.zeros((agent.grid_size, agent.grid_size, 3))
for obs in agent.obstacles:
grid[obs[1], obs[0]] = [0.3, 0.3, 0.3]
for obj_pos in agent.objects.values():
grid[obj_pos[1], obj_pos[0]] = [1.0, 0.8, 0.0]
grid[agent.goal[1], agent.goal[0]] = [0.0, 1.0, 0.0]
for i, pos in enumerate(trajectory):
intensity = 0.3 + 0.7 * (i / len(trajectory))
grid[pos[1], pos[0]] = [intensity, 0.0, 1.0]
if trajectory:
grid[trajectory[0][1], trajectory[0][0]] = [1.0, 0.0, 0.0]
ax.imshow(grid)
ax.set_title(“Agent Trajectory in Environment”, fontsize=14, fontweight=’bold’)
ax.set_xlabel(“X Position”)
ax.set_ylabel(“Y Position”)
ax.grid(True, alpha=0.3)
ax = axes[1]
ax.axis(‘off’)
ax.text(0.5, 0.95, “Neuro-Symbolic Architecture”, ha=’center’, fontsize=16, fontweight=’bold’, transform=ax.transAxes)
layers = [(“SYMBOLIC LAYER”, 0.75, “Planning • State Logic • Rules”), (” INTEGRATION”, 0.60, “Feature Extraction • Action Blending”), (“NEURAL LAYER”, 0.45, “Perception • Policy Learning”), (” EXECUTION”, 0.30, “Action Refinement • Feedback”), (“ENVIRONMENT”, 0.15, “State Transitions • Observations”)]
colors = [‘#FF6B6B’, ‘#4ECDC4’, ‘#45B7D1’, ‘#96CEB4’, ‘#FFEAA7′]
for i, (name, y, desc) in enumerate(layers):
ax.add_patch(plt.Rectangle((0.1, y-0.05), 0.8, 0.08, facecolor=colors[i], alpha=0.7, transform=ax.transAxes))
ax.text(0.5, y, f”{name}n{desc}”, ha=’center’, va=’center’, fontsize=10, fontweight=’bold’, transform=ax.transAxes)
plt.tight_layout()
plt.savefig(‘neurosymbolic_agent.png’, dpi=150, bbox_inches=’tight’)
plt.show()
print(f”n Execution complete! Trajectory length: {len(trajectory)} steps”)

We visualize how the agent moves through the environment and how the architecture is structured. We plot obstacles, objects, the goal, and the full trajectory so that we can clearly see the agent’s decision process. As we render the architecture layers, we understand how the hybrid design flows from planning to perception to action. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
print(“=” * 70)
print(“NEURO-SYMBOLIC HYBRID AGENT TUTORIAL”)
print(“Combining Classical AI Planning with Modern Neural Networks”)
print(“=” * 70)
print()
agent = NeuroSymbolicAgent(grid_size=8)
trajectory, actions = agent.execute_mission(verbose=True)
visualize_execution(agent, trajectory)
print(“n” + “=” * 70)
print(“KEY INSIGHTS:”)
print(“=” * 70)
print(“✦ Symbolic Layer: Provides interpretable, verifiable plans”)
print(“✦ Neural Layer: Handles noisy perception & adapts to uncertainty”)
print(“✦ Integration: Combines strengths of both paradigms”)
print(“✦ Benefits: Explainability + Flexibility + Robustness”)
print(“=” * 70)

We run the complete neuro-symbolic pipeline from planning to execution to visualization. We instantiate the agent, execute the mission, and display key insights to summarize the system’s behavior. As we run this final block, we see the overall hybrid architecture in action and appreciate how each component contributes to the outcome.

In conclusion, we observe how smoothly the symbolic and neural components work together to produce a more capable and reliable agent. We appreciate how the symbolic planner gives us transparent, verifiable steps, while the neural layer adds adaptability and perceptual grounding that pure logic cannot offer. Through this hybrid approach, we can build agents that reason, perceive, and act in ways that are both intelligent and interpretable. We end with a deeper understanding of how neuro-symbolic AI moves us closer to practical, resilient agentic systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Neuro-Symbolic Hybrid Agent that Combines Logical Planning with Neural Perception for Robust Autonomous Decision-Making appeared first on MarkTechPost.

Amazon SageMaker AI introduces EAGLE based adaptive speculative decodi …

Generative AI models continue to expand in scale and capability, increasing the demand for faster and more efficient inference. Applications need low latency and consistent performance without compromising output quality. Amazon SageMaker AI introduces new enhancements to its inference optimization toolkit that bring EAGLE based adaptive speculative decoding to more model architectures. These updates make it easier to accelerate decoding, optimize performance using your own data and deploy higher-throughput models using the familiar SageMaker AI workflow.
EAGLE, short for Extrapolation Algorithm for Greater Language-model Efficiency, is a technique that speeds up large language model decoding by predicting future tokens directly from the hidden layers of the model. When you guide optimization using your own application data, the improvements align with the actual patterns and domains you serve, producing faster inference that reflects your real workloads rather than generic benchmarks. Based on the model architecture, SageMaker AI trains EAGLE 3 or EAGLE 2 heads.
Note that this training and optimization is not limited to just a one time optimization operation. You can start by utilizing the datasets provided by SageMaker for the initial training, but as you continue to gather and collect your own data you can also fine-tune using your own curated dataset for highly adaptive, workload-specific performance. An example would be utilizing a tool such as Data Capture to curate your own dataset over time from real-time requests that are hitting your hosted model. This can be an iterative feature with multiple cycles of training to continuously improve performance.
In this post we’ll explain how to use EAGLE 2 and EAGLE 3 speculative decoding in Amazon SageMaker AI.
Solution overview
SageMaker AI now offers native support for both EAGLE 2 and EAGLE 3 speculative decoding, enabling each model architecture to apply the technique that best matches its internal design. For your base LLM, you can utilize either SageMaker JumpStart models or bring your own model artifacts to S3 from other model hubs, such as HuggingFace.
Speculative decoding is a widely employed technique for accelerating inference in LLMs without compromising quality. This method involves using a smaller draft model to generate preliminary tokens, which are then verified by the target LLM. The extent of the speedup achieved through speculative decoding is heavily dependent on the selection of the draft model.

The sequential nature of modern LLMs makes them expensive and slow, and speculative decoding has proven to be an effective solution to this problem. Methods like EAGLE improve upon this by reusing features from the target model, leading to better results. However, a current trend in the LLM community is to increase training data to boost model intelligence without adding inference costs. Unfortunately, this approach has limited benefits for EAGLE. This limitation is due to EAGLE’s constraints on feature prediction. To address this, EAGLE-3 is introduced, which predicts tokens directly instead of features and combines features from multiple layers using a technique called training-time testing. These changes significantly improve performance and allow the model to fully benefit from increased training data.

To give customers maximum flexibility, SageMaker supports every major workflow for building or refining an EAGLE model. You can train an EAGLE model entirely from scratch using the SageMaker curated open dataset, or train it from scratch with your own data to align speculative behavior with your traffic patterns. You can also start from an existing EAGLE base model: either retraining it with the default open dataset for a fast, high-quality baseline, or fine-tuning that base model with your own dataset for highly adaptive, workload-specific performance. In addition, SageMaker JumpStart provides fully pre-trained EAGLE models so you can begin optimizing immediately without preparing any artifacts.
The solution spans six supported architectures and includes a pre-trained, pre-cached EAGLE base to accelerate experimentation. SageMaker AI also supports widely used training data formats, specifically ShareGPT and OpenAI chat and completions, so existing corpora can be used directly. Customers can also provide the data captured using their own SageMaker AI endpoints provided the data is in the above specified formats. Whether you rely on the SageMaker open dataset or bring your own, optimization jobs typically deliver around a 2.5x thoughput over standard decoding while adapting naturally to the nuances of your specific use case.
All optimization jobs automatically produce benchmark results giving you clear visibility into latency and throughput improvements. You can run the entire workflow using SageMaker Studio or the AWS CLI and you deploy the optimized model through the same interface you already use for standard SageMaker AI inference.
SageMaker AI currently supports LlamaForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM, Qwen2ForCausalLM and GptOssForCausalLM with EAGLE 3, and Qwen3NextForCausalLM with EAGLE 2. You can use one optimization pipeline across a mix of architectures while still gaining the benefits of model-specific behavior.
How EAGLE works inside the model
Speculative decoding can be thought of like a seasoned chief scientist guiding the flow of discovery. In traditional setups, a smaller “assistant” model runs ahead, quickly sketching out several possible token continuations, while the larger model examines and corrects those suggestions. This pairing reduces the number of slow, sequential steps by verifying multiple drafts at once.
EAGLE streamlines this process even further. Instead of depending on an external assistant, the model effectively becomes its own lab partner: it inspects its internal hidden-layer representations to anticipate several future tokens in parallel. Because these predictions arise from the model’s own learned structure, they tend to be more accurate upfront, leading to deeper speculative steps, fewer rejections, and smoother throughput.
By removing the overhead of coordinating a secondary model and enabling highly parallel verification, this approach alleviates memory bandwidth bottlenecks and delivers notable speedups, often around 2.5x, while maintaining the same output quality the baseline model would produce.
Running optimization jobs from the SDK or CLI
You can interface with the Optimization Toolkit using the AWS Python Boto3 SDK, Studio UI. In this section we explore utilizing the AWS CLI, the same API calls will map over to the Boto3 SDK. Here, the core API calls for endpoint creation remain the same: create_model, create_endpoint_config, and create_endpoint. The workflow we showcase here begins with model registration using the create_model API call. With the create_model API call you can specify your serving container and stack. You don’t need to create a SageMaker model object and can specify the model data in the Optimization Job API call as well.
For the EAGLE heads optimization, we specify the model data by pointing towards to the Model Data Source parameter, at the moment specification of the HuggingFace Hub Model ID is not supported. Pull your artifacts and upload them to an S3 bucket and specify it in the Model Data Source parameter. By default checks are done to verify that the appropriate files are uploaded so you have the standard model data expected for LLMs:

# traditional model data needed
model/
config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
generation_config.json
vocab.json
model.safetensors
model.safetensors.index.json

Let’s look at a few paths here:

Using your own model data with your own EAGLE curated dataset
Bringing your own trained EAGLE that you may want to train more
Bring your own model data and use SageMaker AI built-in datasets

1. Using your own model data with your own EAGLE curated dataset
We can start an optimization job with the create-optimization-job API call. Here is an example with a Qwen3 32B model. Note that you can bring your own data or also use the built-in SageMaker provided datasets. First we can create a SageMaker Model object that specifies the S3 bucket with our model artifacts:

aws sagemaker –region us-west-2 create-model
–model-name <target-model-name>
–primary-container ‘{ “Image”: “763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}”,
“ModelDataSource”: { “S3DataSource”: { “S3Uri”: “Enter model path”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } } }’ –execution-role-arn “Enter Execution Role ARN”

Our optimization call then pulls down these model artifacts when you specify the SageMaker Model and a TrainingDataSource parameter as the following:

aws sagemaker –region us-west-2 create-optimization-job
–optimization-job-name <job-name>
–account-id <account-id>
–deployment-instance-type ml.p5.48xlarge
–max-instance-count 10
–model-source ‘{
“SageMakerModel”: { “ModelName”: “Created Model name” }
}’
–optimization-configs'{
“ModelSpeculativeDecodingConfig”: {
“Technique”: “EAGLE”,
“TrainingDataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “Enter custom train data location”
}
}
}’
–output-config ‘{
“S3OutputLocation”: “Enter optimization output location”
}’
–stopping-condition ‘{“MaxRuntimeInSeconds”: 432000}’
–role-arn “Enter Execution Role ARN”

2. Bringing your own trained EAGLE that you may want to train more
For your own trained EAGLE you can specify another parameter in the create_model API call where you point towards your EAGLE artifacts, optionally you can also specify a SageMaker JumpStart Model ID to pull down the packaged model artifacts.

# Enable additional model data source with EAGLE artifacts
aws sagemaker –region us-west-2 create-model
–model-name <target-model-name>
–primary-container ‘{ “Image”: “763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}”,
“ModelDataSource”: { “S3DataSource”: { “S3Uri”: “<model path>”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } },
“AdditionalModelDataSources”: [ { “ChannelName”: “eagle_model”,
“S3DataSource”: { “S3Uri”: “<pre-trained EAGLE path>”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } } ] }’ –execution-role-arn “Enter Execution Role ARN”

Similarly the optimization API then inherits this model object with the necessary model data:

aws sagemaker –region us-west-2 create-optimization-job
–account-id <account-id>
–optimization-job-name <job-name>
–deployment-instance-type ml.p5.48xlarge
–max-instance-count 10
–model-source ‘{
“SageMakerModel”: {
“ModelName”: “Created Model Name”
}
}’
–optimization-configs ‘{
“ModelSpeculativeDecodingConfig”: {
“Technique”: “EAGLE”,
“TrainingDataSource”: {
“S3Uri”: “Enter training data path”,
“S3DataType”: “S3Prefix”
}
}
}’
–output-config ‘{
“SageMakerModel”: {
“ModelName”: “Model Name”
},
“S3OutputLocation”: “Enter output data location”
}’
–stopping-condition ‘{“MaxRuntimeInSeconds”: 432000}’
–role-arn “Enter Execution Role ARN”

3. Bring your own model data and use SageMaker built-in datasets
Optionally, we can utilize the SageMaker provided datasets:

# SageMaker Provided Optimization Datasets
gsm8k_training.jsonl (https://huggingface.co/datasets/openai/gsm8k)
magicoder.jsonl (https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K)
opencodeinstruct.jsonl (https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
swebench_oracle_train.jsonl (https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
ultrachat_0_8k_515292.jsonl (https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)

After completion, SageMaker AI stores evaluation metrics in S3 and records the optimization lineage in Studio. You can deploy the optimized model to an inference endpoint with either the create_endpoint API call or in the UI.
Benchmarks
To benchmark this further we compared three states:

No EAGLE: Base model without EAGLE as a baseline
Base EAGLE: EAGLE training using built-in datasets provided by SageMaker AI
Trained EAGLE: EAGLE training using built-in datasets provided by SageMaker AI and retraining with own custom dataset

The numbers displayed below are for qwen3-32B across metrics such as Time to First Token (TTFT) and overall throughput.

Configuration
Concurrency
TTFT (ms)
TPOT (ms)
ITL (ms)
Request Throughput
Output Throughput (tokens/sec)
OTPS per request (tokens/sec)

No EAGLE
4
168.04
45.95
45.95
0.04
86.76
21.76

No EAGLE
8
219.53
51.02
51.01
0.08
156.46
19.6

Base EAGLE
1
89.76
21.71
53.01
0.02
45.87
46.07

Base EAGLE
2
132.15
20.78
50.75
0.05
95.73
48.13

Base EAGLE
4
133.06
20.11
49.06
0.1
196.67
49.73

Base EAGLE
8
154.44
20.58
50.15
0.19
381.86
48.59

Trained EAGLE
1
83.6
17.32
46.37
0.03
57.63
57.73

Trained EAGLE
2
129.07
18
48.38
0.05
110.86
55.55

Trained EAGLE
4
133.11
18.46
49.43
0.1
214.27
54.16

Trained EAGLE
8
151.19
19.15
51.5
0.2
412.25
52.22

Pricing considerations
Optimization jobs run on SageMaker AI training instances, you will be billed depending on the instance type and job duration. Deployment of the resulting optimized model uses standard SageMaker AI Inference pricing.
Conclusion
EAGLE based adaptive speculative decoding gives you a faster and more effective path to improve generative AI inference performance on Amazon SageMaker AI. By working inside the model rather than relying on a separate draft network, EAGLE accelerates decoding, increases throughput and maintains generation quality. When you optimize using your own dataset, the improvements reflect the unique behavior of your applications, resulting in better end-to-end performance. With built-in dataset support, benchmark automation and streamlined deployment, the inference optimization toolkit helps you deliver low-latency generative applications at scale.

About the authors
Kareem Syed-Mohammed is a Product Manager at AWS. He is focuses on enabling generative AI model development and governance on SageMaker HyperPod. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.
Xu Deng is a Software Engineer Manager with the SageMaker team. He focuses on helping customers build and optimize their AI/ML inference experience on Amazon SageMaker. In his spare time, he loves traveling and snowboarding.
Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on SageMaker. In his spare time, he loves traveling and writing.
Vinay Arora is a Specialist Solution Architect for Generative AI at AWS, where he collaborates with customers in designing cutting-edge AI solutions leveraging AWS technologies. Prior to AWS, Vinay has over two decades of experience in finance—including roles at banks and hedge funds—he has built risk models, trading systems, and market data platforms. Vinay holds a master’s degree in computer science and business management.
Siddharth Shah is a Principal Engineer at AWS SageMaker, specializing in large-scale model hosting and optimization for Large Language Models. He previously worked on the launch of Amazon Textract, performance improvements in the model-hosting platform, and expedited retrieval systems for Amazon S3 Glacier. Outside of work, he enjoys hiking, video games, and hobby robotics.
Andy Peng is a builder with curiosity, motivated by scientific research and product innovation. He helped build key initiatives that span AWS SageMaker and Bedrock, Amazon S3, AWS App Runner, AWS Fargate, Alexa Health & Wellness, and AWS Payments, from 0-1 incubation to 10x scaling. Open-source enthusiast.
Johna Liu is a Software Development Engineer on the Amazon SageMaker team, where she builds and explores AI/LLM-powered tools that enhance efficiency and enable new capabilities. Outside of work, she enjoys tennis, basketball and baseball.
Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring new Seattle restaurants, traveling, and spending time with family and friends.

Train custom computer vision defect detection model using Amazon SageM …

On October 10, 2024, Amazon announced the discontinuation of the Amazon Lookout for Vision service, with a scheduled shut down date of October 31, 2025 (see Exploring alternatives and seamlessly migrating data from Amazon Lookout for Vision blog post). As part of our transition guidance for customers, we recommend the use of Amazon SageMaker AI tools to build applications for customers who are interested in AI/ML computer vision models for automated quality inspection use cases. To support that effort, AWS has made a pre-trained computer vision defect detection model available on AWS Marketplace that can be fine-tuned using Amazon SageMaker AI for a customer’s specific use case. If run in the cloud, this model only requires paying for infrastructure costs for training or inference. This approach provides the tools to accelerate solution development while facilitating complete flexibility to build a solution that integrates with any existing hardware and software infrastructure.
In this blog post, you will learn how to migrate your computer vision workloads from Amazon Lookout for Vision to Amazon SageMaker AI by following our step-by-step guidance.
AWS is sharing the main underlying models used for the service to end users in the AWS Marketplace. You can use the two main types of models, binary classification and semantic segmentation, when you train in your own AWS accounts for deployment on AWS or at the edge.
This model helps customers continue to use AWS defect detection technology at their own pace with greater flexibility. For example, you can train your models with larger instance types for faster training times. With access to set hyperparameters, you can also adjust model behavior that was not previously available on the AWS console. For example, you can set the multi-head model for semantic segmentation to disable the binary classifier head. This can make the model mode more tolerant of changing background and lighting conditions. You can also personalize the maximum training time, which was set to a non-changeable 24-hour limit on Amazon Lookout for Vision (L4V).
The GitHub repository for Amazon Lookout for Vision has been updated with a Jupyter Notebook to help you train datasets with these two model types and package them up. From there you can deploy the models by using a SageMaker endpoint, or edge devices.
To label the images beyond the sample data, you can use Amazon SageMaker Ground Truth to enable crowdsourcing or allow private teams to label the data, or use a partner solution such as Edge Impulse, Roboflow, or SuperbAI to do so. When you have the manifest file of the labeled data, the marketplace models can be used for training. You will lose a thumbnail-based dataset management tool like the Amazon Lookout for Vision console, so consider one of the previously mentioned partner solutions to help manage datasets. You can also export your existing data from the Lookout For Vision service using this guide.
Prerequisites
Before you begin, make sure you have the following components and permissions in place:

Amazon SageMaker Studio or Amazon SageMaker Unified Studio for integrated development environment (IDE)
AWS Identity and Access Management (IAM) role with these permissions to follow the principle of least privilege

Amazon S3

s3:GetObject
s3:PutObject
s3:DeleteObject
s3:ListBucket

SageMaker

sagemaker:CreateTrainingJob
sagemaker:CreateModel
sagemaker:CreateEndpoint
sagemaker:CreateEndpointConfig
sagemaker:CreateTransformJob
sagemaker:DescribeTrainingJob
sagemaker:DescribeModel
sagemaker:DescribeEndpoint
sagemaker:DescribeEndpointConfig
sagemaker:DescribeTransformJob
sagemaker:InvokeEndpoint
sagemaker:DeleteEndpoint
sagemaker:DeleteEndpointConfig
sagemaker:DeleteModel

Model subscription:

An AWS account with a subscription to Computer Vision Defect Detection Model or
An IAM role with these three permissions permission to make AWS Marketplace subscriptions in the AWS account you use:

aws-marketplace:ViewSubscriptions
aws-marketplace:Unsubscribe
aws-marketplace:Subscribe

Labeled data (you can use the cookie data sample in Github) or label your own data with SageMaker Ground Truth or an AWS Partner tool
Basic knowledge of creating a SageMaker notebook instance and running Jupyter notebook

Architecture overview
The following diagram illustrates the end-to-end flow, from image acquisition to inferencing at the edge. This blog focus on steps 2 and 3.

Use an edge application to configure cameras or sensors and capture training images.
Use SageMaker GroundTruth or AWS Partner platforms to export and label images.
Use Amazon SageMaker AI for model training.
Use REST, PLC, or digital input for image acquisition and processing.
Run real-time inference using the trained and deployed model.
Publish inference results to analytics and monitoring for alerts and analytics.
Perform automated action on the machine of concern or notify plant personnel of anomalies from inspection station component using OPC-UA or digital output.
Line operators and plant managers receive notifications for action.

Set up the labeling process
This section covers the steps to set up the labeling process using Amazon SageMaker Ground Truth, including creating a private labeling team and configuring the labeling job.

Configure Amazon SageMaker Ground Truth private team:

Select Amazon SageMaker AI, Ground Truth, Labeling workforces.
Select Private, then Create Private Team.
Enter a team name.
Leave other values as their defaults.
Select Create a new Amazon Cognito user group.
Select Create private Team.

On the Workers tab, select Invite New Workers.
Enter your team members’ email addresses to send sign-up invitations.

Label the dataset
After successfully completing the workforce setup for labelling, the next step is to label the dataset. This section explains how to prepare the dataset by uploading the images to an Amazon Simple Storage Service (Amazon S3) bucket, then create and run the SageMaker Ground Truth labeling job to label the images as normal or anomaly.

Upload the image datasets to an Amazon S3 bucket that SageMaker Ground Truth can access. If you don’t have a dataset, you can use either the cookie-dataset or aliens-dataset.

Copy all of the images from “normal” and “anomaly” folders into a single directory for SMGT to access or you will get an error message on the next step.
To use AWS CloudShell, run the following script:

#!/bin/bash
# Clone the repository
git clone https://github.com/aws-samples/amazon-lookout-for-vision.git
cd amazon-lookout-for-vision/aliens-dataset
# Remove existing all directory if it exists
rm -rf all
# Create a new all directory
mkdir -p all
# Copy normal images to all directory
cp normal/*.png all/
# Make sure we’re in the right directory before running the loop
cd “$(dirname “$0″)/amazon-lookout-for-vision/aliens-dataset”
# Copy anomaly images with .anomaly.png suffix
for file in anomaly/*.png; do
if [ -f “$file” ]; then
filename=$(basename “$file”)
cp “$file” “all/${filename}.anomaly.png”
fi
done
# Count files to verify
echo “Normal images: $(find normal -name “*.png” | wc -l)”
echo “Anomaly images: $(find anomaly -name “*.png” | wc -l)”
echo “Total images in all directory: $(find all -type f | wc -l)”
# Upload to S3
aws s3 cp all/ s3://<BUCKET_NAME>/aliens-dataset-all/ –recursive
# Clean up – remove the cloned repository
cd ../..
rm -rf amazon-lookout-for-vision

Alternatively, if you have the AWS CLI installed, you can copy them with the following commands (See setting up AWS CLI for how to do this):

sh-4.2$ git checkout https://github.com/aws-samples/amazon-lookout-for-vision.git
sh-4.2$ cd aliens-dataset ## keep in mind the filenames here clash, the following Linux command can help fix this
sh-4.2$ mkdir all
sh-4.2$ cp normal/.png all
sh-4.2$ aws s3 cp s3://aws-blogs-artifacts-public/artifacts/ML-19308/copy_conflicts.sh .

sh-4.2$ bash copy_conflicts.sh

sh-4.2$ ls -al all/

-rwxrwxr-x 1 ec2-user ec2-user 120035 Feb 17 16:39 59.png
-rwxrwxr-x 1 ec2-user ec2-user 93407 Feb 17 16:39 5.png
-rwxrwxr-x 1 ec2-user ec2-user 125477 Feb 17 16:39 5.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 123679 Feb 17 16:39 60.png
-rwxrwxr-x 1 ec2-user ec2-user 96330 Feb 17 16:39 6.png
-rwxrwxr-x 1 ec2-user ec2-user 126014 Feb 17 16:39 6.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 81051 Feb 17 16:39 7.png
-rwxrwxr-x 1 ec2-user ec2-user 128985 Feb 17 16:39 7.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 94216 Feb 17 16:39 8.png
-rwxrwxr-x 1 ec2-user ec2-user 128002 Feb 17 16:39 8.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 110814 Feb 17 16:39 9.png
-rwxrwxr-x 1 ec2-user ec2-user 131385 Feb 17 16:39 9.png.anomaly.png

sh-4.2$aws s3 cp all/ s3://<BUCKET_NAME>/aliens-dataset-all/ –recursive
Note: To prevent filename clash from the two folders, a suffix anomaly was added. The uploaded files should be in your <BUCKET_NAME>/aliens-dataset-all bucket for the Ground Truth job.

In the AWS Console, navigate to Amazon SageMaker AI, Ground Truth, Labeling Jobs, Create labeling job.

There are several options here to fill in; the most important fields to fill or select are:

Input data setup: Select Automated data setup
S3 location for input datasets: <Full path where your dataset exists>
S3 location data output datasets: <Same location as input dataset>
Data type: Select Image
IAM Role – Select Create new role if you do not have one set up to allow Ground Truth to interact with SageMaker services.

Choose Complete data setup. An Input data connection successful message displays. If you get an error, check your IAM role to make sure S3 access is enabled, and the directory has image files in it, as it will not recurse through sub-directories.

Select the task type. These models support Image Classification (Single Label), which is binary classification (think good or bad), or Semantic segmentation. You cannot use a bounding box type with these models. You can change your selection later.
Choose Next.
For Worker types, select Private. You can read more about Amazon Mechanical Turks or labeling subscriptions in the Developer Guide.
Under Private teams, select the private team you created in the previous steps.
For Task timeout and Task expiration time, leave the default values.
Leave Enable automated data labeling unselected. You can read more about automated data labeling here; however, it is not compatible with semantic segmentation.
On the Image classification screen, add two new labels: normal and anomaly. You can fill in the rest as needed. Choose Preview to see a preview of what it will look like to the end user.
Choose Create.
Select Ground Truth, and then select the Private tab.

Open the labeling portal sign-in URL in a new tab in your browser and then sign in to see your assigned tasks.
Select an assigned task and choose Start working to label the data.
Select normal or anomaly.

When the job is complete, make note of the output dataset location. You will need this for the training step.

If you need to add workers to the labelling job:

On the Amazon SageMaker AI Ground Truth page, select Labeling workforces.
Select the Private tab.
Click on the private team that was created earlier (CV-team).
Select the Workers tab
Select the desired worker from the list and choose Add workers to team.

You will then be redirected to the Amazon SageMaker AI, labelling workforces page with a confirmation message that worker has been added.

After you complete the labeling task, the output of the task is used to train the Computer Vision Detection model from the AWS Marketplace.
Train the model
This section discusses training the computer vision model using the AWS Marketplace Computer Vision Detection model and the labeled dataset from the previous step.

Go to the AWS Marketplace to subscribe to the model, https://aws.amazon.com/marketplace/pp/prodview-j72hhmlt6avp6.
Choose Continue to Subscribe.
Choose Continue to configuration.
Select the latest software version, your Region, and make sure Create a training job is selected.

Note: Copy the Product Arn and store in a text editor or notepad for later use.

Go to SageMaker AI, Notebook instances, Create notebook instance.

Note: GPU-enabled notebook instance is not required. Amazon SageMaker Training jobs will spin up the GPU instances needed during training, so most basic instances will be sufficient.

Select m5.2xl instance, Jupyter lab 4, with volume size of 128 GB. The default is 5 GB, which is too small.
Select an IAM role to allow the notebook to access resources in your account. You will need access to S3.
In the Git Repositories – optional section, select Clone a public Git repository to this notebook instance only.
Enter the Git repository URL. Leave all the other fields as their default, then choose Create notebook instance to start the instance.
After the instance starts, (the status will display as InService), select Open JupyterLab action for the new notebook instance.

JupyterLab opens:

On the left navigation pane, open the computer-vision-defect-detection folder.

In the AWS Console, go to Marketplace, Manage subscriptions, and then copy the ARN of your model subscription.

In the Jupyter notebook, locate the snippet below and update the placeholder value for algorithm_name variable with the Product Arn you copied in the previous step.

# TODO: change this to use subscribed SageMaker algorithm algorithm_name = “<Customer to specify the algorithm name after subscription >”

The bucket that would be used for this step would be automatically created and named in the format SageMaker-<REGION>-<ACCOUNT_ID>.

# Initialize SageMaker session and get execution role
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
#bucket = sagemaker_session.default_bucket()
role = get_execution_role()
# Project name would be used as part of s3 output path
project = “ComputerVisionDefectDetection”

In the AWS Console, navigate to Amazon SageMaker AI, Ground Truth, Labeling jobs and select the job that was completed.
Identify and take note of the output images folder (Output dataset location)

Note: To start the training job, look at the path for the output manifest in <BUCKET NAME>/aliens-dataset/all/aliensv2/manifests/output/output.manifest—this will be the training manifest for the next step.

Set the bucket variable to be the images bucket name that you previously set and object key the path to your manifest:

bucket: where to store the manifest file
classification_manifest_key: where the output manifest file is stored (for example, aliens-dataset-all/[job-name]/manifests/output/output.manifest)

Review the model training configuration in the Classification Model with Algorithm Estimator section.

# Create AlgorithmEstimator for classificatio
classification_estimator = AlgorithmEstimator(
algorithm_arn=algorithm_name,
role=role, instance_count=1,
instance_type=’ml.g4dn.2xlarge’,
volume_size=20, max_run=7200,
input_mode=’Pipe’, # REQUIRED: Algorithm only supports Pipe mode
sagemaker_session=sagemaker_session,
enable_network_isolation=True
)

# Set hyperparameters
classification_estimator.set_hyperparameters(
ModelType=’classification’,
TestInputDataAttributeNames=’source-ref,anomaly-label-metadata,anomaly-label’, 
TrainingInputDataAttributeNames=’source-ref,anomaly-label-metadata,anomaly-label’)

print(“Classification estimator configured successfully”)</code></pre><pre><code class=”lang-python”># Define training input using TrainingInput class
classification_training_input = TrainingInput(
s3_data=classification_s3_path, ‘
s3_data_type=’AugmentedManifestFile’,
attribute_names=[
‘source-ref’,
‘anomaly-label-metadata’,
‘anomaly-label’
],
record_wrapping=’RecordIO’,
input_mode=’Pipe’ # Must match the estimator’s input_mode)
# Start training job
classification_job_name = f’defect-detection-classification-
{datetime.datetime.now().strftime(“%Y-%m-%d-%H-%M-%S”)}
‘print(f”Starting classification training job: {classification_job_name}”)
classification_estimator.fit(
inputs={‘training’: classification_training_input},
job_name=classification_job_name,
wait=True,
logs=True

)

Note: The job uses NVIDIA G4DN instances. They can be sized up to a larger instance to decrease training time, but on a only 118 instances. The image dataset training finishes in less than 10 minutes with a g4dn.2xl. You can experiment with other instance types, however results may vary because the models were extensively tested on the G4DN instances.

Validate the values of TestInputDataAttributeNames and TrainingInputDataAttributeNames in the Hyperparameters section, as well as AttributeNames in the

TrainingInput section. The labels on all three must match the structure of your manifest file. Here is a sample manifest:

{
“source-ref”: “s3://[bucketname]/getting-started/training-images/anomaly-1.jpg”,
“anomaly-label-metadata”: {
“job-name”: “anomaly-label”,
“class-name”: “anomaly”,
“human-annotated”: “yes”,
“creation-date”: “2022-08-22T20:52:51.851Z”,
“type”: “groundtruth/image-classification”
},
“anomaly-label”: 1
}
{
“source-ref”: “s3://[bucketname]/getting-started/training-images/anomaly-2.jpg”,
“anomaly-label-metadata”: {
“job-name”: “anomaly-label”,
“class-name”: “anomaly”,
“human-annotated”: “yes”,
“creation-date”: “2022-08-22T21:11:39.545Z”,
“type”: “groundtruth/image-classification”
},
“anomaly-label”: 1
}

Note: Two of the three values include the labelling job name.

response = sagemaker.create_training_job(
TrainingJobName=classification_training_job_name,
HyperParameters={
‘ModelType’: ‘classification’,
‘TestInputDataAttributeNames’: ‘source-ref,aliens-v3,aliens-v3-metadata’,
‘TrainingInputDataAttributeNames’: ‘source-ref,aliens-v3,aliens-v3-metadata’
}
)

Run all the cells or blocks listed in the Classification Model with Algorithm Estimator section to start the training job.
If you want to train a segmentation model as well, follow the steps in the Segmentation Model with Algorithm Estimator section.

Note: After the training is completed, you are ready to test it!  There are few inference options available for this:

Real-time inference using Amazon SageMaker endpoints
Amazon SageMaker AI Batch Transform inference.
Edge deployment

Deploy the model
Amazon SageMaker AI endpoints and Amazon SageMaker AI Batch Transform inference are both used for inference but serve different purposes.
Amazon SageMaker AI endpoints
Amazon SageMaker AI endpoints are used for real-time inference, providing low-latency predictions suitable for applications requiring immediate responses. Endpoints remain active while they’re deployed, making them better suited for continuous and steady traffic, but potentially more costly due to ongoing resource usage.

In the Jupyter notebook, navigate to the (Optional) Running real-time inference using Amazon SageMaker endpoints section.
Run the following cell blocks to set up and invoke the endpoint:

#classification_training_job_name = “defect-detection-classification-2025-10-01-00-29-57” # remove

classification_training_job_name = “<provide training job name here>”

# Create estimator from training job
estimator = AlgorithmEstimator.attach(classification_training_job_name)

# Deploy endpoint using SageMaker v2 SDK
predictor = estimator.deploy(
initial_instance_count=1,
instance_type=’ml.c5.2xlarge’
)

print(f”Endpoint deployed: {predictor.endpoint_name}”)

#Invoke the endpoint

# Invoke the endpoint using predictor
result = predictor.predict(image_data)

# Clean up the temporary file
os.remove(local_file)

# Print the result
print(“nEndpoint Response:”)
print(json.dumps(result, indent=2))

Validate the inference, then delete the endpoint by running the following block:

# Delete the endpoint

predictor.delete_endpoint()
print(“Endpoint deleted”)

Note: If you start an endpoint, keep in mind you will be billed while it is running until you turn it off.
Amazon SageMaker AI Batch Transform
Batch Transform is designed for offline inference and making predictions on large datasets stored in S3, and is ideal for bulk processing where low latency is not critical. After the job is complete, the resources are released, making it cost-effective for sporadic workloads.

Navigate to the (Optional) Run Batch Transform Inference using SageMaker SDK v2 section.
Define the s3_input_data and s3_output_path parameters.

# Run batch transform job

#############################################
# Change to your input/output data S3 path #
#############################################

s3_input_data = “s3://<Specify-s3-path-to-test-images>”
s3_output_path = f”s3://{bucket}/{project}/batch-transform-output”

Run all the cells and blocks in the (Optional) Run Batch Transform Inference using SageMaker SDK v2 section to complete the batch inference.
Validate the batch transform job after completion by navigating to the s3_output_path folder. The following is a sample inference output file:

{
“Source”: {
“Type”: “direct”
},
“IsAnomalous”: true,
“Confidence”: 0.92744799389183
}

Clean up
To avoid incurring unnecessary charges, delete the following resources when you no longer need them:

Delete SageMaker endpoints.

Navigate to the Amazon SageMaker Console.
Select Endpoints.
Select the endpoint you created.
Choose Delete.

Delete SageMaker Notebook instances.

Navigate to the Amazon SageMaker Console.
Select Notebook instances.
Select the notebook instance you created.
Choose Stop if the instance is running.
Once stopped, choose Delete.

Delete S3 objects and buckets.

Navigate to the Amazon S3 Console.
Delete all objects in the buckets you created for this tutorial.
Delete the empty buckets.

Delete the Ground Truth labeling team.

Navigate to Ground Truth.
Select Labeling workforces.
Select the Private tab.
Select the private team you created.
Choose Delete team.

Conclusion
In this blog post, we’ve demonstrated how to transition from Amazon Lookout for Vision to using the underlying Computer Vision Detection models available through the AWS Marketplace, showing the step-by-step process of setting up labeling, training the model, and running inference through batch transformation. The transition provides customers with greater flexibility in terms of training options, hyperparameter adjustments, and deployment choices while continuing to use AWS defect detection technology at their own pace. Also be sure to check out our edge-based open source integrated Defect Detection Application on GitHub if you would like to combine what you have learned here.

About the authors
Ryan Vanderwerf is a is a senior partner solutions architect at Amazon Web Services specializing in smart manufacturing, vision, and machine learning. Ryan previously provided Java virtual machine-focused consulting and project development as a software engineer at OCI on the Grails and Micronaut team. He was chief architect/director of products at ReachForce, with a focus on software and system architecture for AWS Cloud SaaS solutions for marketing data management. Ryan has built several SaaS solutions in several domains such as financial, media, telecom, and e-learning companies since 1996
Lu Min is a Software Development Engineer for AWS Edge ML services, focused on developing machine learning solutions that operate at the edge for AWS customers. With expertise in optimizing ML models for resource-constrained environments, Lu helps customers implement efficient inference capabilities on edge devices and cloud communication, as well as manage model lifecycle using AWS SageMaker.
Tim Westman is the Product Manager and Go-to-Market Lead for Edge Machine Learning, AWS. Tim leads the Product Management and Business Development for the Edge Machine Learning business at Amazon Web Services. In this role, he works with customers to help build computer vision solutions at the edge to solve complex operational challenges. Tim has more than 30 years of experience in sales, business development and product management roles for leading hardware and software companies, with the last 8 years specializing in AI and computer vision for IoT applications.
Kunle Adeleke is an enterprise solutions architect, providing guidance to large AWS commercial customers in diverse industries craft their technology strategy. Kunle has led enterprise architecture teams and software development teams in both government and commercial sectors. His deep expertise spans software development, solution architecture, enterprise architecture, security, and data & AI/ML.

Practical implementation considerations to close the AI value gap

Artificial Intelligence (AI) is changing how businesses operate. Gartner® predicts at least 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028. And 92% of companies are boosting their AI spending, according to McKinsey.
But here’s the problem: most companies are yet to realize a positive impact of AI on their profit and loss (P&L). According to analysis from S&P Global Market Intelligence,

“The share of companies abandoning most of their AI initiatives jumped to 42%, up from 17% last year [2024]” in the first half of 2025.

According to Gartner,

“Over 40% of agentic AI projects will be canceled by the end of 2027.”

The gap between spending and results is clear. To make AI work, companies need to stop running scattered experiments and start building enterprise-wide programs. As McKinsey puts it:

“The organizations that are building a genuine and lasting competitive advantage from their AI efforts are the ones that are thinking in terms of holistic transformative change that stands to alter their business models, cost structures, and revenue streams—rather than proceeding incrementally.”

The AWS Customer Success Center of Excellence (CS COE) helps customers get tangible value from their AWS investments. We’ve seen a pattern: customers who build AI strategies that address people, process, and technology together succeed more often.
In this post, we share practical considerations that can help close the AI value gap.
Implementation considerations
The following sections include practical implementation considerations for aligning leadership, redesigning incentives, building governance frameworks, and measuring outcomes—all grounded in real-world examples from organizations that have successfully closed their AI value gap. These practical insights can help you avoid common pitfalls and accelerate your path from AI investment to measurable business impact.
Figure 1: Six considerations for successful AI transformation and sustained value realization
Business leaders — not just tech leaders — need to drive your AI agenda
AI transformation requires translating vision into specific business outcomes with clear tracking mechanisms—and this demands broad cross-functional leadership from day one.
Roles like Chief Revenue Officers and line-of-business leaders need a seat at the decision-making table alongside technology leaders right from the start. These leaders have typically joined digital or cloud transformations much later in the process, but AI is different. The most impactful AI use cases come from two sources: line-of-business leaders who understand customer pain points and industry opportunities intimately, and employees across business functions who are willing to change their mindsets and fundamentally alter their operating models. Consider a large global institutional investment organization that embarked on an AI transformation program. They started by defining and creating relevant data and AI technical and business professions. Then, the organization designed and implemented the mechanisms and operating model needed to create data and AI products. Ultimately, they launched a new data and AI organization that helps them create new products, better serve customers, and monetize data assets by addressing new business opportunities. While engineering and product management remained at its core, their entire leadership team treated this as a business development initiative and partnered to make it possible.
Redesign incentives to reward AI-first operations
Transform organizational behavior to reward actual AI adoption, not just theoretical interest. Restructure career pathways to create advancement opportunities tied to effective AI use and measurable business outcomes. Critical to success is defining what outcomes matter. AI can generate voluminous output with little business impact, making measurement of outcomes essential.
One organization introduced standardized definitions for business processes and automation levels. They then redesigned their performance management framework to incorporate automation achievement as a key metric for Product Managers. This approach shifted focus from traditional input metrics toward measurable automation outcomes. It encouraged leaders to prioritize AI-augmented structures and intelligent process redesign over manual operations.
This alignment demonstrates how organizations must clearly define and measure desired outcomes—and tie individual rewards directly to tangible AI-driven business results.
Put people first and have HR lead the change as a strategic partner
HR serves as the cornerstone for aligning culture, talent, and incentives with AI transformation goals. Success requires HR to partner with executives in communicating the rationale for AI initiatives, addressing employee concerns, and fostering organizational buy-in through coaching and thought leadership.
Build AI fluency through tailored learning pathways. Provide focused training with practical tools like pre-populated prompt catalogs and quick-start demonstrations. Strengthen employee engagement through continuous feedback loops, celebrate AI learning participation across teams, and invest in retention strategies that value AI-skilled talent. HR champions adoption by collaborating with business and operations teams to develop role-based “What’s in it for me” content and current versus future process comparisons. For example, HR at a global financial institution took a leadership role to accelerate adoption of a reimagined product operating model. After the institution had invested significantly in a bottom-up transformation, HR designed and led—in partnership with AWS—a top-down approach. They empowered business leaders from lines of business, operations, and technology with extensive executive-level training to help them lead product teams, not just operate them. These leaders worked with technology teams to build mechanisms that helped accelerate adoption of their product operating model. The resulting mechanisms enabled them to create AI solutions focused on industry opportunities and customer needs.
HR support is key to transforming resistance into enthusiasm by embedding AI-first behaviors into the cultural DNA.
Set guardrails that help protect—without slowing down
Establish AI governance frameworks from day one that balance centralization and federation. This facilitates compliance alignment and integration while enabling rapid innovation at the edge. Pure centralization offers simpler governance but slows innovation. Complete federation creates integration challenges and compliance gaps.
For both centralized and federated models, create cross-functional AI governance councils with representation from legal, risk, IT, and business units. Define clear guardrails, approval thresholds, and escalation paths. This approach accelerates AI delivery by creating clear paths to production and reducing bureaucratic friction while maintaining enterprise-wide coherence and risk management.
One financial services customer implemented a three-layered AI governance approach. At the enterprise level, they automated security and compliance policies through policy as code. At the line-of-business level, they created data policies that support AI solutions within the value stream. At the solution level, they addressed individual AI model risks and performance thresholds. This approach facilitated necessary guardrails and policy adherence while allowing builders to focus on value-added AI solution features. It unlocked true innovation at the edge while maintaining compliance alignment with critical policies.
Work with the right partners to move faster on AI
According to Gartner,

“Scaling AI solutions across the enterprise is challenging and requires intentional plans to address AI skills, infrastructure, governance policies and forums to facilitate collaboration, integration, and shared best practices.”

Organizations achieve higher success rates when working with partners who provide AI innovation, cloud expertise, and industry-specific knowledge at the right time. Effective AI transformation partners serve three roles: industry advisors who reimagine existing value streams and workflows to uncover high-value use cases, technical experts who bring leading experience building scalable AI solutions and change champions who manage cultural shifts through training and governance frameworks.
A global insurance company engaged an AI transformation partner for a long-term engagement focused on building durable capabilities. The partner established business case frameworks and assets to prioritize use cases and baseline KPIs. They developed detailed adoption strategies using train-the-trainer methodologies. They implemented measurement systems to continuously track productivity impact. Together, they established governance models for ongoing AI agent creation and enterprise-wide deployment. This “teach to fish” model meant the insurance company could independently sustain and expand their AI transformation beyond the partnership engagement.
Track results that matter—not just what AI costs
Traditional cost prediction models struggle with AI’s continuously changing pricing and capabilities. Success requires anchoring to one or two measurable business outcomes that can be baselined and tracked—such as customer conversations handled entirely by AI agents or revenue uplift per recommendation accepted.
Build adaptive ROI frameworks that can be seamlessly adjusted to changes in token pricing, inference efficiency, and model capabilities rather than fixed cost projections. Focus on outcome-based metrics that demonstrate clear business value as use cases scale. With these metrics executives can make informed investment decisions despite technological uncertainty. This approach transforms AI economics from unpredictable cost centers into measurable value drivers, providing the financial clarity needed for confident scaling decisions. A marketing team implemented generative AI for long-form content creation and quality assurance. They analyzed their end-to-end process to determine the distribution of their production capacity and identify the costliest failure point: localization errors. They anchored against measurable baselines of 150+ annual localization errors and 300 monthly QA hours across 150 assets. The solution delivered immediate impact by catching errors earlier, minimizing costly localization rework while accelerating production speed. Return on investment in the solution was measured through localization cost savings and top-line value through increased content output, providing a clear path to assess the impact of scaling the solution.
Conclusion
Becoming an AI-first organization requires synchronized transformation across seven critical dimensions: Data and AI Vision and Strategy that establishes a data-driven foundation while embedding AI into core business objectives; Business Process Redesign to optimize human-AI collaboration; Culture & Change Management to drive adoption top-down and bottom-up change; Infrastructure and Operations for scalable, self-healing systems; AI Skills and Talent development with continuous learning to build core AI capabilities beyond basic awareness; Security, Governance, and Ethics to facilitate responsible AI deployment; and AI Industrialization for seamless integration and automation.

Figure 2: Seven dimensions of AI-First organizational transformation
These dimensions provide a framework for systematically evaluating and implementing AI transformation. But here’s what matters most: technology alone delivers marginal gains. When orchestrated with organizational change and process redesign, it creates measurable business value. Organizations that have success, compared to those that do not, see dramatic results—45% more in cost savings and 60% more in revenue growth, according to the Boston Consulting Group (BCG).
The AWS Customer Success Center of Excellence collaborates with AWS partners to define programmatic implementation plans that can help customers embed AI into their operations, product development, business processes, and go-to-market strategies. Because becoming AI-first isn’t about isolated technology initiatives—it requires synchronized evolution across people, process, and technology, with comprehensive change management as the enabler.
For more information about becoming an AI-first company, contact your AWS account team. For more information on delivering agents see the AWS Artificial Intelligence blog.

About the authors
Bhargs Srivathsan leads the Customer Success Center of Excellence for Amazon Web Services (AWS), where she is responsible for defining and executing on the strategic vision for customer success across AWS’ services. In this role, she focuses on ensuring AWS customers and partners realize maximum value from their technology investments, particularly as the pace of innovation accelerates with AI and other emerging technologies. She works closely with the field, specialist GTM leaders, and partners across AWS to build and scale customer success capabilities that drive adoption and business outcomes for customers.
Sergio Klarreich is a Senior Manager of Customer Success at AWS, within the Customer Success Center of Excellence. Sergio leads a team focused on enabling enterprises to realize tangible business outcomes from AI investments. With hands-on experience leading Fortune 500 companies through successful AI-first transformation journeys and over 20 years driving technology innovation across global markets. He specializes in bridging the gap between AI strategy and measurable business results.
Joseph Badalamenti is a Senior Customer Success AI Specialist at AWS, within the Customer Success Center of Excellence. As a Customer Success Specialist, he partners with enterprise customers to accelerate their AI transformation journeys. Joseph specializes in Generative AI and Agentic AI implementations, helping organizations realize measurable business value through strategic AI adoption. Joseph has 20+ years experience supporting customers with Digital, Cloud, and AI Transformation journeys.

Microsoft AI Releases Fara-7B: An Efficient Agentic Model for Computer …

How do we safely let an AI agent handle real web tasks like booking, searching, and form filling directly on our own devices without sending everything to the cloud? Microsoft Research has released Fara-7B, a 7 billion parameter agentic small language model designed specifically for computer use. It is an open weight Computer Use Agent that runs from screenshots, predicts mouse and keyboard actions, and is small enough to execute on a single user device, which reduces latency and keeps browsing data local.

Fara-7B: An Efficient Agentic Model for Computer Use

From Chatbots to Computer Use Agents

Conventional chat oriented LLMs return text. Computer Use Agents such as Fara-7B instead control the browser or desktop user interface to complete tasks like filling forms, booking travel, or comparing prices. They perceive the screen, reason about the page layout, then emit low level actions such as click, scroll, type, web_search, or visit_url.

Many existing systems rely on large multimodal models wrapped in complex scaffolding that parses accessibility trees and orchestrates multiple tools. This increases latency and often requires server side deployment. Fara-7B compresses the behavior of such multi agent systems into a single multimodal decoder only model built on Qwen2.5-VL-7B. It consumes browser screenshots and text context, then directly outputs thought text followed by a tool call with grounded arguments such as coordinates, text, or URLs.

FaraGen, Synthetic Trajectories for Web Interaction

The key bottleneck for Computer Use Agents is data. High quality logs of human web interaction with multi step actions are rare and expensive to collect. The Fara project introduces FaraGen, a synthetic data engine that generates and filters web trajectories on live sites.

FaraGen uses a three stage pipeline. Task Proposal starts from seed URLs drawn from public corpora such as ClueWeb22 and Tranco, which are categorized into domains like e commerce, travel, entertainment, or forums. Large language models convert each URL into realistic tasks that users might attempt on that page, for example booking specific movie tickets or creating a shopping list with constraints on reviews and materials. Tasks must be achievable without login or paywall, fully specified, useful, and automatically verifiable.

Fara-7B: An Efficient Agentic Model for Computer Use

Task Solving runs a multi agent system based on Magentic-One and Magentic-UI. An Orchestrator agent plans the high level strategy and keeps a ledger of task state. A WebSurfer agent receives accessibility trees and Set-of-Marks screenshots, then emits browser actions through Playwright, such as click, type, scroll, visit_url, or web_search. A UserSimulator agent supplies follow up instructions when the task needs clarification.

Trajectory Verification uses three LLM based verifiers. An Alignment Verifier checks that the actions and final answer match the task intent. A Rubric Verifier generates a rubric of subgoals and scores partial completion. A Multimodal Verifier inspects screenshots plus the final answer to catch hallucinations and confirm that visible evidence supports success. These verifiers agree with human labels on 83.3 percent of cases, with reported false positive and false negative rates around 17 to 18 percent.

After filtering, FaraGen yields 145,603 trajectories with 1,010,797 steps over 70,117 unique domains. The trajectories range from 3 to 84 steps, with an average of 6.9 steps and about 0.5 unique domains per trajectory, which indicates that many tasks involve sites not seen elsewhere in the dataset. Generating data with premium models such as GPT-5 and o3 costs roughly 1 dollar per verified trajectory.

https://www.microsoft.com/en-us/research/wp-content/uploads/2025/11/Fara-7B-An-Efficient-Agentic-Model-for-Computer-Use.pdf

Model Architecture

Fara-7B is a multimodal decoder only model that uses Qwen2.5-VL-7B as the base. It takes as input a user goal, the latest screenshots from the browser, and the full history of previous thoughts and actions. The context window is 128,000 tokens. At each step the model first generates a chain of thought describing the current state and the plan, then outputs a tool call that specifies the next action and its arguments.

The tool space matches the Magentic-UI computer_use interface. It includes key, type, mouse_move, left_click, scroll, visit_url, web_search, history_back, pause_and_memorize_fact, wait, and terminate. Coordinates are predicted directly as pixel positions on the screenshot, which allows the model to operate without access to the accessibility tree at inference time.

Training uses supervised finetuning over approximately 1.8 million samples that mix multiple data sources. These include the FaraGen trajectories broken into observe think act steps, grounding and UI localization tasks, screenshot based visual question answering and captioning, and safety and refusal datasets.

https://www.microsoft.com/en-us/research/wp-content/uploads/2025/11/Fara-7B-An-Efficient-Agentic-Model-for-Computer-Use.pdf

Benchmarks and Efficiency

Microsoft evaluates Fara-7B on four live web benchmarks: WebVoyager, Online-Mind2Web, DeepShop, and the new WebTailBench, which focuses on under represented segments such as restaurant reservations, job applications, real estate search, comparison shopping, and multi site compositional tasks.

On these benchmarks, Fara-7B achieves 73.5 percent success on WebVoyager, 34.1 percent on Online-Mind2Web, 26.2 percent on DeepShop, and 38.4 percent on WebTailBench. This outperforms the 7B Computer Use Agent baseline UI-TARS-1.5-7B, which scores 66.4, 31.3, 11.6, and 19.5 respectively, and compares favorably to larger systems like OpenAI computer-use-preview and SoM Agent configurations built on GPT-4o.

On WebVoyager, Fara-7B uses on average 124,000 input tokens and 1,100 output tokens per task, with about 16.5 actions. Using market token prices, the research team estimate an average cost of 0.025 dollars per task, versus around 0.30 dollars for SoM agents backed by proprietary reasoning models such as GPT-5 and o3. Fara-7B uses a similar number of input tokens but about one tenth the output tokens of these SoM agents.

Key Takeaways

Fara-7B is a 7B parameter, open weight Computer Use Agent built on Qwen2.5-VL-7B that operates directly from screenshots and text, then outputs grounded actions such as clicks, typing and navigation, without relying on accessibility trees at inference time.

The model is trained with 145,603 verified browser trajectories and 1,010,797 steps generated by the FaraGen pipeline, which uses multi agent task proposal, solving, and LLM based verification on live websites across 70,117 domains.

Fara-7B achieves 73.5 percent success on WebVoyager, 34.1 percent on Online-Mind2Web, 26.2 percent on DeepShop, and 38.4 percent on WebTailBench, improving substantially over the 7B UI-TARS-1.5 baseline on all four benchmarks.

On WebVoyager, Fara-7B uses about 124,000 input tokens and 1,100 output tokens per task, with an average of 16.5 actions, yielding an estimated cost of around 0.025 dollars per task, which is around an order of magnitude cheaper in output token usage than SoM agents backed by GPT 5 class models.

Editorial Notes

Fara-7B is a useful step toward practical Computer Use Agents that can run on local hardware with lower inference cost while preserving privacy. The combination of Qwen2.5 VL 7B, FaraGen synthetic trajectories and WebTailBench gives a clear and well instrumented path from multi agent data generation to a single compact model that matches or exceeds larger systems on key benchmarks while enforcing Critical Point and refusal safeguards.

Check out the Paper, Model weights and technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Microsoft AI Releases Fara-7B: An Efficient Agentic Model for Computer Use appeared first on MarkTechPost.

NVIDIA AI Releases Nemotron-Elastic-12B: A Single AI Model that Gives …

Why are AI dev teams still training and storing multiple large language models for different deployment needs when one elastic model can generate several sizes at the same cost? NVIDIA is collapsing the usual ‘model family’ stack into a single training job. NVIDIA AI team releases Nemotron-Elastic-12B, a 12B parameter reasoning model that embeds nested 9B and 6B variants in the same parameter space, so all three sizes come from one elastic checkpoint with no extra distillation runs per size.

Many in one model family

Most production systems need several model sizes, a larger model for server side workloads, a mid size model for strong edge GPUs, and a smaller model for tight latency or power budgets. The usual pipeline trains or distills each size separately, so token cost and checkpoint storage scale with the number of variants.

Nemotron Elastic takes a different route. It starts from the Nemotron Nano V2 12B reasoning model and trains an elastic hybrid Mamba Attention network that exposes multiple nested submodels. The released Nemotron-Elastic-12B checkpoint can be sliced into 9B and 6B variants, Nemotron-Elastic-9B and Nemotron-Elastic-6B, using a provided slicing script, without any extra optimization.

All variants share weights and routing metadata, so training cost and deployment memory are tied to the largest model, not to the number of sizes in the family.

https://arxiv.org/pdf/2511.16664v1

Hybrid Mamba Transformer with elastic masks

Architecturally, Nemotron Elastic is a Mamba-2 Transformer hybrid. The base network follows the Nemotron-H style design, where most layers are Mamba-2 based sequence state space blocks plus MLP, and a small set of attention layers preserve global receptive field.

Elasticity is implemented by turning this hybrid into a dynamic model controlled by masks:

Width, embedding channels, Mamba heads and head channels, attention heads, and FFN intermediate size can be reduced through binary masks.

Depth, layers can be dropped according to a learned importance ordering, with residual paths preserving signal flow.

A router module outputs discrete configuration choices per budget. These choices are converted to masks with Gumbel Softmax, then applied to embeddings, Mamba projections, attention projections, and FFN matrices. The research team adds several details to keep the SSM structure valid:

Group aware SSM elastification that respects Mamba head and channel grouping.

Heterogeneous MLP elastification where different layers can have distinct intermediate sizes.

Normalized MSE based layer importance to decide which layers stay when depth is reduced.

Smaller variants are always prefix selections in the ranked component lists, which makes the 6B and 9B models true nested subnetworks of the 12B parent.

https://arxiv.org/pdf/2511.16664v1

Two stage training for reasoning workloads

Nemotron Elastic is trained as a reasoning model with a frozen teacher. The teacher is the original Nemotron-Nano-V2-12B reasoning model. The elastic-12B student is optimized jointly for all three budgets, 6B, 9B, 12B, using knowledge distillation plus language modeling loss.

Training runs in two stages:

Stage 1: short context, sequence length 8192, batch size 1536, around 65B tokens, with uniform sampling over the three budgets.

Stage 2: extended context, sequence length 49152, batch size 512, around 45B tokens, with non uniform sampling that favors the full 12B budget.

https://arxiv.org/pdf/2511.16664v1

The second stage is important for reasoning tasks. The above table shows that for AIME 2025, the 6B model improves from 56.88 to 68.13, a 19.8 percent relative gain, while the 9B model gains 9.7 percent and the 12B model gains 4.0 percent after extended context training.

Budget sampling is also tuned. In Stage 2, non uniform weights of 0.5, 0.3, 0.2 for 12B, 9B, 6B avoid degradation of the largest model and keep all variants competitive on Math 500, AIME 2025, and GPQA.

Benchmark results

Nemotron Elastic is evaluated on reasoning heavy benchmarks, MATH 500, AIME 2024, AIME 2025, GPQA, LiveCodeBench v5, and MMLU Pro. The below table summarizes pass at 1 accuracy.

https://arxiv.org/pdf/2511.16664v1

The 12B elastic model matches the NanoV2-12B baseline on average, 77.41 versus 77.38, while also providing 9B and 6B variants from the same run. The 9B elastic model tracks the NanoV2-9B baseline closely, 75.95 versus 75.99. The 6B elastic model reaches 70.61, slightly below Qwen3-8B at 72.68 but still strong for its parameter count given that it is not trained separately.

Training token and memory savings

Nemotron Elastic targets the cost problem directly. The below table compares the token budgets needed to derive 6B and 9B models from a 12B parent:

NanoV2 pretraining for 6B and 9B, 40T tokens total.

NanoV2 Compression with Minitron SSM, 480B exploratory plus 270B final, 750B tokens.

Nemotron Elastic, 110B tokens in a single elastic distillation run.

https://arxiv.org/pdf/2511.16664v1

The research team reports that this gives around 360 times reduction versus training the two extra models from scratch, and around 7 times reduction versus the compression baseline.

Deployment memory is reduced as well. The below table states that storing Nemotron Elastic 6B, 9B, and 12B together requires 24GB of BF16 weights, while storing NanoV2 9B plus 12B requires 42GB. This is a 43 percent memory reduction while also exposing an extra 6B size.

https://arxiv.org/pdf/2511.16664v1

Comparison

SystemSizes (B)Avg reasoning score*Tokens for 6B + 9BBF16 memoryNemotron Elastic6, 9, 1270.61 / 75.95 / 77.41110B24GBNanoV2 Compression9, 1275.99 / 77.38750B42GBQwen3872.68n / an / a

Key Takeaways

Nemotron Elastic trains one 12B reasoning model that contains nested 9B and 6B variants which can be extracted zero shot without extra training.

The elastic family uses a hybrid Mamba-2 and Transformer architecture plus a learned router that applies structured masks over width and depth to define each submodel.

The approach needs 110B training tokens to derive 6B and 9B from the 12B parent which is about 7 times fewer tokens than the 750B token Minitron SSM compression baseline and about 360 times fewer than training extra models from scratch.

On reasoning benchmarks such as MATH 500, AIME 2024 and 2025, GPQA, LiveCodeBench and MMLU Pro the 6B, 9B and 12B elastic models reach average scores of about 70.61, 75.95 and 77.41 which are on par with or close to the NanoV2 baselines and competitive with Qwen3-8B.

All three sizes share one 24GB BF16 checkpoint so deployment memory stays constant for the family compared with around 42GB for separate NanoV2-9B and 12B models which gives about 43 percent memory savings while adding a 6B option.

Editorial Comments

Nemotron-Elastic-12B is a practical step toward making reasoning model families cheaper to build and operate. One elastic checkpoint produces 6B, 9B, and 12B variants with a hybrid Mamba-2 and Transformer architecture, a learned router, and structured masks that preserve reasoning performance. The approach cuts token cost relative to separate compression or pretraining runs and keeps deployment memory at 24GB for all sizes, which simplifies fleet management for multi tier LLM deployments. Overall, Nemotron-Elastic-12B turns multi size reasoning LLMs into a single elastic systems design problem.

Check out the Paper and Model weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post NVIDIA AI Releases Nemotron-Elastic-12B: A Single AI Model that Gives You 6B/9B/12B Variants without Extra Training Cost appeared first on MarkTechPost.

AI Interview Series #3: Explain Federated Learning

Question:

“You’re an ML engineer at a fitness company like Fitbit or Apple Health.

Millions of users generate sensitive sensor data every day — heart rate, sleep cycles, step counts, workout patterns, etc.

You want to build a model that predicts health risk or recommends personalized workouts.

But due to privacy laws (GDPR, HIPAA), none of this raw data can ever leave the user’s device.

How would you train such a model?“

Training a model in this scenario seems impossible at first—after all, you can’t collect or centralize any of the user’s sensor data. But the trick is this: instead of bringing the data to the model, you bring the model to the data.

Using techniques like federated learning, the model is sent to each user’s device, trained locally on their private data, and only the model updates (not the raw data) are sent back. These updates are then securely aggregated to improve the global model while keeping every user’s data fully private.

This approach allows you to leverage massive, real-world datasets without ever violating privacy laws.

What is Federated Learning

Federated Learning is a technique for training machine learning models without ever collecting user data centrally. Instead of uploading private data (like heart rate, sleep cycles, or workout logs), the model is sent to each device, trained locally, and only the model updates are returned. These updates are securely aggregated to improve the global model—ensuring privacy and compliance with laws like GDPR and HIPAA.

There are multiple variants:

Centralized FL: A central server coordinates training and aggregates updates.

Decentralized FL: Devices share updates with each other directly—no single point of failure.

Heterogeneous FL: Designed for devices with different compute capabilities (phones, watches, IoT sensors).

The workflow is simple:

A global model is sent to user devices.

Each device trains on its private data (e.g., a user’s fitness and health metrics).

Only the model updates—not the data—are encrypted and sent back.

The server aggregates all updates into a new global model.

Challenges in Federated Learning

Device Constraints: User devices (phones, smartwatches, fitness trackers) have limited CPU/GPU power, small RAM, and rely on battery. Training must be lightweight, energy-efficient, and scheduled intelligently so it doesn’t interfere with normal device usage.

Model Aggregation: Even after training locally on thousands or millions of devices, we still need to combine all these model updates into a single global model. Techniques like Federated Averaging (FedAvg) help, but updates can be delayed, incomplete, or inconsistent depending on device participation.

Skewed Local Data (Non-IID Data):

Each user’s fitness data reflects personal habits and lifestyle:

Some users run daily; others never run.

Some have high resting heart rates; others have low.

Sleep cycles vary drastically by age, culture, work pattern.

Workout types differ—yoga, strength training, cycling, HIIT, etc.

This leads to non-uniform, biased local datasets, making it harder for the global model to learn generalized patterns.

Intermittent Client Availability: Many devices may be offline, locked, low on battery, or not connected to Wi-Fi. Training must only happen under safe conditions (charging, idle, Wi-Fi), reducing the number of active participants at any moment.

Communication Efficiency: Sending model updates frequently can drain bandwidth and battery. Updates must be compressed, sparse, or limited to smaller subsets of parameters.

Security & Privacy Guarantees: Even though raw data never leaves the device, updates must be encrypted. Additional protections like differential privacy or secure aggregation may be required to prevent reconstructing sensitive patterns from gradients.

AI Interview Series #2: Explain Some of the Common Model Context Protocol (MCP) Security Vulnerabilities

The post AI Interview Series #3: Explain Federated Learning appeared first on MarkTechPost.