How This Agentic Memory Research Unifies Long Term and Short Term Memo …

How do you design an LLM agent that decides for itself what to store in long term memory, what to keep in short term context and what to discard, without hand tuned heuristics or extra controllers? Can a single policy learn to manage both memory types through the same action space as text generation?

Researchers from Alibaba Group and Wuhan University introduce Agentic Memory, or AgeMem, a framework that lets large language model agents learn how to manage both long term and short term memory as part of a single policy. Instead of relying on hand written rules or external controllers, the agent decides when to store, retrieve, summarize and forget, using memory tools that are integrated into the action space of the model.

Why current LLM agents struggle with memory

Most agent frameworks treat memory as two loosely coupled systems.

Long term memory stores user profiles, task information and previous interactions across sessions. Short term memory is the current context window, which holds the active dialogue and retrieved documents.

Existing systems design these two parts in isolation. Long term memory is handled through external stores such as vector databases with simple add and retrieve triggers. Short term memory is managed with retrieval augmented generation, sliding windows or summarization schedules.

This separation creates several issues.

Long term and short term memory are optimized independently. Their interaction is not trained end to end.

Heuristics decide when to write to memory and when to summarize. These rules are brittle and miss rare but important events.

Additional controllers or expert models increase cost and system complexity.

AgeMem removes the external controller and folds memory operations into the agent policy itself.

Memory as tools in the agent action space

In AgeMem, memory operations are exposed as tools. At each step, the model can emit either normal text tokens or a tool call. The framework defines 6 tools.

For long term memory:

ADD stores a new memory item with content and metadata.

UPDATE modifies an existing memory entry.

DELETE removes obsolete or low value items.

For short term memory:

RETRIEVE performs semantic search over long term memory and injects the retrieved items into the current context.

SUMMARY compresses spans of the dialogue into shorter summaries.

FILTER removes context segments that are not useful for future reasoning.

The interaction protocol has a structured format. Each step starts with a <think> block where the model reasons privately. Then the model either emits a <tool_call> block with a JSON list of tool invocations, or an <answer> block with the user facing response. Memory actions are therefore first class decisions, not side effects.

Three stage reinforcement learning for unified memory

AgeMem is trained with reinforcement learning in a way that couples long term and short term memory behavior.

The state at time t includes the current conversational context, the long term memory store and the task specification. The policy chooses either a token or a tool call as the action. The training trajectory for each sample is divided into 3 stages:

Stage 1, long term memory construction: The agent interacts in a casual setting and observes information that will later become relevant. It uses ADD, UPDATE and DELETE to build and maintain long term memory. The short term context grows naturally during this stage.

Stage 2, short term memory control under distractors: The short term context is reset. Long term memory persists. The agent now receives distractor content that is related but not necessary. It must manage short term memory using SUMMARY and FILTER to keep useful content and remove noise.

Stage 3, integrated reasoning: The final query arrives. The agent retrieves from long term memory using RETRIEVE, controls the short term context, and produces the answer.

The crucial detail is that long term memory persists across all stages while short term memory is cleared between Stage 1 and Stage 2. This design forces the model to rely on retrieval rather than on residual context and exposes realistic long horizon dependencies.

Reward design and step wise GRPO

AgeMem uses a step wise variant of Group Relative Policy Optimization (GRPO). For each task, the system samples multiple trajectories that form a group. A terminal reward is computed for each trajectory, then normalized within the group to obtain an advantage signal. This advantage is broadcast to all steps in the trajectory so that intermediate tool choices are trained using the final outcome.

The total reward has three main components:

A task reward that scores answer quality between 0 and 1 using an LLM judge.

A context reward that measures the quality of short term memory operations, including compression, early summarization and preservation of query relevant content.

A memory reward that measures long term memory quality, including the fraction of high quality stored items, the usefulness of maintenance operations and the relevance of retrieved items to the query.

Uniform weights are used for these three components so that each contributes equally to the learning signal. A penalty term is added when the agent exceeds the maximum allowed dialogue length or when the context overflows the limit.

https://arxiv.org/pdf/2601.01885

Experimental setup and main results

The research team fine-tune AgeMem on the HotpotQA training split and evaluate on 5 benchmarks:

ALFWorld for text based embodied tasks.

SciWorld for science themed environments.

BabyAI for instruction following.

PDDL tasks for planning.

HotpotQA for multi hop question answering.

Metrics include success rate for ALFWorld, SciWorld and BabyAI, progress rate for PDDL tasks, and an LLM judge score for HotpotQA. They also define a Memory Quality metric using an LLM evaluator that compares stored memories to the supporting facts of HotpotQA.

https://arxiv.org/pdf/2601.01885

Baselines include LangMem, A Mem, Mem0, Mem0g and a no memory agent. Backbones are Qwen2.5-7B-Instruct and Qwen3-4B-Instruct.

On Qwen2.5-7B-Instruct, AgeMem reaches an average score of 41.96 across the 5 benchmarks, while the best baseline, Mem0, reaches 37.14. On Qwen3-4B-Instruct, AgeMem reaches 54.31, compared to 45.74 for the best baseline, A Mem.

Memory quality also improves. On HotpotQA, AgeMem reaches 0.533 with Qwen2.5-7B and 0.605 with Qwen3-4B, which is higher than all baselines.

Short term memory tools reduce prompt length while preserving performance. On HotpotQA, configurations with STM tools use about 3 to 5 percent fewer tokens per prompt than variants that replace STM tools with a retrieval pipeline.

Ablation studies confirm that each component matters. Adding only long term memory tools on top of a no memory baseline already yields clear gains. Adding reinforcement learning on these tools improves scores further. The full system with both long term and short term tools plus RL gives up to 21.7 percentage points improvement over the no memory baseline on SciWorld.

Implications for LLM agent design

AgeMem suggests a design pattern for future agentic systems. Memory should be handled as part of the learned policy, not as two external subsystems. By turning storage, retrieval, summarization and filtering into explicit tools and training them jointly with language generation, the agent learns when to remember, when to forget and how to manage context efficiently across long horizons.

Key Takeaways

AgeMem turns memory operations into explicit tools, so the same policy that generates text also decides when to ADD, UPDATE, DELETE, RETRIEVE, SUMMARY and FILTER memory.

Long term and short term memory are trained jointly through a three stage RL setup where long term memory persists across stages and short term context is reset to enforce retrieval based reasoning.

The reward function combines task accuracy, context management quality and long term memory quality with uniform weights, plus penalties for context overflow and excessive dialogue length.

Across ALFWorld, SciWorld, BabyAI, PDDL tasks and HotpotQA, AgeMem on Qwen2.5-7B and Qwen3-4B consistently outperforms memory baselines such as LangMem, A Mem and Mem0 on average scores and memory quality metrics.

Short term memory tools reduce prompt length by about 3 to 5 percent compared to RAG style baselines while keeping or improving performance, showing that learned summarization and filtering can replace handcrafted context handling rules.

Check out the FULL PAPER here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export.
The post How This Agentic Memory Research Unifies Long Term and Short Term Memory for LLM Agents appeared first on MarkTechPost.

How Omada Health scaled patient care by fine-tuning Llama models on Am …

This post is co-written with Sunaina Kavi, AI/ML Product Manager at Omada Health.
Omada Health, a longtime innovator in virtual healthcare delivery, launched a new nutrition experience in 2025, featuring OmadaSpark, an AI agent trained with robust clinical input that delivers real-time motivational interviewing and nutrition education. It was built on AWS. OmadaSpark was designed to help members identify their own motivational challenges like emotional eating, improve food decisions, set goals, and sustain lasting behavior change. The following screenshot shows an example of OmadaSpark’s Nutritional Education feature, demonstrating how members receive personalized nutrition education in real time.

In this post, we examine how Omada partnered with AWS and Meta to develop this healthcare-aligned AI solution using Llama models on Amazon SageMaker AI. We explore the technical implementation, architecture, and evaluation process that helped Omada scale personalized nutrition guidance while maintaining their commitment to evidence-based care.
The opportunity for AI-powered nutrition guidance
Nutrition education serves as a cornerstone of Omada’s chronic condition management programs. Although health coaches excel at providing personalized care, the growing demand for quick, convenient nutritional information presented an opportunity to enhance our coaches’ impact through technology. Omada sought an innovative solution that would complement their coaches’ expertise by handling routine analytical tasks, so they could focus more deeply on meaningful member interactions. The goal was to provide immediate, high-quality nutrition education while maintaining strict healthcare compliance with Omada’s care protocols and the personal touches that makes their program effective.
Omada Health’s OmadaSpark aims to help members identify real-world emotional and practical barriers to healthy eating in today’s environment, where ultra-processed foods are prevalent and diets can fail to deliver long-term results. OmadaSpark features motivational interviewing,using questions to help members identify their own goals, reinforce autonomy, and find motivation to change habits. OmadaSpark’s Nutritional Education feature can reduce the mental load of real-time food decisions and encourage members to gradually incorporate healthier food alternatives. Omada’s nutrition experience offers updated tracking capabilities, like water tracking, barcode scanning, and photo-recognition technology that offer flexible and non-restrictive support designed to promote a healthy relationship to food.
“We see AI as a force multiplier for our health coaches, not a replacement,” explains Terry Miller, Omada’s Vice President, Machine Learning, AI and Data Strategy. “Our collaboration with AWS and Meta allowed us to implement an AI solution that aligns with our values of evidence-based, personalized care.”
Solution overview
Omada Health developed the Nutritional Education feature using a fine-tuned Llama 3.1 model on SageMaker AI. The implementation included the Llama 3.1 8B model fine-tuned using Quantized Low Rank Adaptation (QLoRA) techniques, a fine-tuning method that allows language models to efficiently learn on smaller datasets. Initial training used 1,000 question-answer pairs created from Omada’s internal care protocols and peer reviewed literature and specialty society guidelines to provide evidence-based nutritional education.
The following diagram illustrates the high-level architecture of Omada Health’s Llama implementation on AWS.

The solution workflow consists of the following high-level steps:

The Q&A pairs for nutritional education datasets are uploaded to Amazon Simple Storage Service (Amazon S3) for model training.
Amazon SageMaker Studio is used to launch a training job using Hugging Face estimators for fine-tuning Llama 3.1 8B model. QLoRA techniques are used to train the model and model artifacts saved to Amazon S3.
The inference workflow is invoked through a user question through a mobile client for OmadaSpark’s nutritional education feature. A request is invoked to fetch member personal data based on the user profile as well as conversation history, so that responsive information is personalized. For example, a roast beef recipe won’t be delivered to a vegetarian. At the same time, this feature does not provide medical information that is related to a particular person’s medical situation, such as their latest blood glucose test. The SageMaker AI endpoint is invoked for nutrition generation based on the member’s query and historical conversations as context.
The model generates personalized nutrition education, which are fed back to the mobile client, providing evidence-based education for people in Omada’s cardiometabolic programs..
For evaluation of the model performance, LangSmith, an observability and evaluation service where teams can monitor AI application performance, is used to capture inference quality and conversation analytics for continuous model improvement.
Registered Dietitians conduct human review processes, verifying clinical accuracy and safety of the nutrition education provided to users. Upvoted and downvoted responses are viewed in LangSmith annotation queues to determine future fine-tuning and system prompt updates.

The following diagram illustrates the workflow sequence in more detail.

Collaboration and data fine-tuning
A critical aspect of Omada Health’s success with AI implementation was the close collaboration between their clinical team and the AI development team. Omada AI/ML Product Manager Sunaina Kavi, a key figure in this collaboration, highlights the importance of this synergy:
“Our work with the clinical team was pivotal in building trust and making sure the model was optimized to meet real-world healthcare needs,” says Kavi. “By closely working on data selection and evaluation, we made sure that OmadaSpark Nutritional Education not only delivered accurate and personalized nutrition e but also upheld high standards of patient care.
“The AWS and Meta partnership gave us access to state-of-the-art foundation models while maintaining the self-hosted control we need in healthcare, for privacy, security, and quality purposes. The fine-tuning capabilities of SageMaker AI allowed us to adapt Llama to our specific nutrition use case while preserving our data sovereignty.”
Patient data protection remained paramount throughout development. Model training and inference occurred within HIPAA-compliant AWS environments (AWS is Omada’s HIPAA Business Associate), with fine-tuned model weights remaining under Omada’s control through model sovereignty capabilities in SageMaker AI. The AWS security infrastructure provided the foundation for implementation, helping maintain patient data protection throughout the AI development lifecycle. Llama models offered the flexibility needed for healthcare-specific customization without compromising performance. Omada centered their technical implementation around SageMaker AI for model training, fine-tuning, and deployment.
Finally, Omada implemented rigorous testing protocols, including regular human review of model outputs by qualified. Omada launched the entire workflow with the model in 4.5 months. Throughout this process, they continuously monitored response accuracy and member satisfaction, with iterative fine-tuning based on real-world feedback.
Business impact
The introduction of OmadaSpark significantly boosted member engagement of those that used the tool. Members who interacted with the nutrition assistant were three times more likely to return to the Omada app in general compared to those who did not interact with the tool. By providing round-the-clock access to personalized nutritional education, Omada dramatically reduced the time it took to address member nutrition questions from days to seconds.
Following their successful launch, Omada is deepening their partnership with AWS and Meta to expand AI capabilities including fine-tuning models, context window optimization, and adding memory. They are developing a continuous training pipeline incorporating real member questions and enhancing AI features with additional health domains beyond nutrition.
“Our collaboration with AWS and Meta has shown the value of strategic partnerships in healthcare innovation,” shares Miller. “As we look to the future, we’re excited to build on this foundation to develop even more innovative ways to support our members.”
Conclusion
Omada Health’s implementation demonstrates how healthcare organizations can effectively adopt AI while addressing industry-specific requirements and member needs. By using Llama models on SageMaker AI, Omada amplifies the humanity of health coaches and further enriches the member experience. The Omada, AWS, and Meta collaboration showcases how organizations in highly regulated industries can rapidly build AI applications by using innovative foundation models on AWS, the trusted healthcare cloud provider. By combining clinical expertise with advanced AI models and secure infrastructure, they’ve created a solution that can transform care delivery at scale while maintaining the personalized, human-led approach that makes Omada effective.
“This project proves that responsible AI adoption in healthcare is not just possible—it’s essential for reaching more patients with high-quality care,” concludes Miller.
Omada remains committed to growing its human care teams with the efficiency of AI-enabled technology. Looking ahead, the team is dedicated to creating new innovations that foster a sense of real-time support, confidence, and autonomy among members.
For more information, see the following resources:

Explore generative AI on AWS
Learn about unlocking the business value of generative AI
Learn model deployment options on SageMaker AI
Llama on AWS GitHub repo

About the authors
Sunaina Kavi is an AI/ML product manager at Omada, dedicated to leveraging artificial intelligence for behavior change to improve outcomes in diabetes, hypertension, and weight management. She earned a Bachelor of Science in Biomedical Engineering and an MBA from the University of Michigan’s Ross School of Business, specializing in Entrepreneurship and Finance. Prior to transitioning to Omada, she gained experience as an investment banker in Technology, Media, and Telecom in San Francisco. She later joined Rivian, focusing on charging solutions within their infotainment group, and founded her own startup aimed at using AI to manage autoimmune flares. Sunaina is also actively involved in the Generative AI group in San Francisco, working to enhance safety, security, and systematic evaluations within the healthcare community.
Breanne Warner is an Enterprise Solutions Architect at Amazon Web Services supporting healthcare and life science (HCLS) customers. She is passionate about supporting customers to use generative AI on AWS and evangelizing model adoption for first-party and third-party models. Breanne is also Vice President of the Women at Amazon with the goal of fostering inclusive and diverse culture at Amazon. Breanne holds a Bachelor of Science in Computer Engineering from the University of Illinois Urbana-Champaign.
Baladithya Balamurugan is a Solutions Architect at AWS focused on ML deployments for inference and using AWS Neuron to accelerate training and inference. He works with customers to enable and accelerate their ML deployments on services such as Amazon SageMaker and Amazon EC2. Based out of San Francisco, Baladithya enjoys tinkering, developing applications and his homelab in his free time.
Amin Dashti, PhD, is a Senior Data Scientist at AWS, specializing in model customization and training using Amazon SageMaker. With a PhD in Physics, he brings a deep scientific rigor to his work in machine learning and applied AI. His multidisciplinary background—spanning academia, finance, and tech—enables him to tackle complex challenges from both theoretical and practical perspectives. Based in the San Francisco Bay Area, Amin enjoys spending his free time with his family exploring parks, beaches, and local trails.
Marco Punio is a Sr. Specialist Solutions Architect focused on GPU-accelerated AI workloads, large-scale model training, and applied AI solutions on AWS. As a member of the Gen AI Applied Sciences SA team at AWS, he specializes in high-performance computing for AI, optimizing GPU clusters for foundation model training and inference, and serves as a global lead for the Meta–AWS Partnership and technical strategy. Based in Seattle, Washington, Marco enjoys writing, reading, exercising, and building GPU-optimized AI applications in his free time.
Evan Grenda Sr. GenAI Specialist at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise agentic AI challenges. Evan holds a BA in Business Administration from the University of South Carolina, a MBA from Auburn University, and an MS in Data Science from St. Joseph’s University.

A Coding Guide to Demonstrate Targeted Data Poisoning Attacks in Deep …

In this tutorial, we demonstrate a realistic data poisoning attack by manipulating labels in the CIFAR-10 dataset and observing its impact on model behavior. We construct a clean and a poisoned training pipeline side by side, using a ResNet-style convolutional network to ensure stable, comparable learning dynamics. By selectively flipping a fraction of samples from a target class to a malicious class during training, we show how subtle corruption in the data pipeline can propagate into systematic misclassification at inference time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Dataset
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report

CONFIG = {
“batch_size”: 128,
“epochs”: 10,
“lr”: 0.001,
“target_class”: 1,
“malicious_label”: 9,
“poison_ratio”: 0.4,
}

torch.manual_seed(42)
np.random.seed(42)

We set up the core environment required for the experiment and define all global configuration parameters in a single place. We ensure reproducibility by fixing random seeds across PyTorch and NumPy. We also explicitly select the compute device so the tutorial runs efficiently on both CPU and GPU. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass PoisonedCIFAR10(Dataset):
def __init__(self, original_dataset, target_class, malicious_label, ratio, is_train=True):
self.dataset = original_dataset
self.targets = np.array(original_dataset.targets)
self.is_train = is_train
if is_train and ratio > 0:
indices = np.where(self.targets == target_class)[0]
n_poison = int(len(indices) * ratio)
poison_indices = np.random.choice(indices, n_poison, replace=False)
self.targets[poison_indices] = malicious_label

def __getitem__(self, index):
img, _ = self.dataset[index]
return img, self.targets[index]

def __len__(self):
return len(self.dataset)

We implement a custom dataset wrapper that enables controlled label poisoning during training. We selectively flip a configurable fraction of samples from the target class to a malicious class while keeping the test data untouched. We preserve the original image data so that only label integrity is compromised. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef get_model():
model = torchvision.models.resnet18(num_classes=10)
model.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
model.maxpool = nn.Identity()
return model.to(CONFIG[“device”])

def train_and_evaluate(train_loader, description):
model = get_model()
optimizer = optim.Adam(model.parameters(), lr=CONFIG[“lr”])
criterion = nn.CrossEntropyLoss()
for _ in range(CONFIG[“epochs”]):
model.train()
for images, labels in train_loader:
images = images.to(CONFIG[“device”])
labels = labels.to(CONFIG[“device”])
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
return model

We define a lightweight ResNet-based model tailored for CIFAR-10 and implement the full training loop. We train the network using standard cross-entropy loss and Adam optimization to ensure stable convergence. We keep the training logic identical for clean and poisoned data to isolate the effect of data poisoning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef get_predictions(model, loader):
model.eval()
preds, labels_all = [], []
with torch.no_grad():
for images, labels in loader:
images = images.to(CONFIG[“device”])
outputs = model(images)
_, predicted = torch.max(outputs, 1)
preds.extend(predicted.cpu().numpy())
labels_all.extend(labels.numpy())
return np.array(preds), np.array(labels_all)

def plot_results(clean_preds, clean_labels, poisoned_preds, poisoned_labels, classes):
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
for i, (preds, labels, title) in enumerate([
(clean_preds, clean_labels, “Clean Model Confusion Matrix”),
(poisoned_preds, poisoned_labels, “Poisoned Model Confusion Matrix”)
]):
cm = confusion_matrix(labels, preds)
sns.heatmap(cm, annot=True, fmt=”d”, cmap=”Blues”, ax=ax[i],
xticklabels=classes, yticklabels=classes)
ax[i].set_title(title)
plt.tight_layout()
plt.show()

We run inference on the test set and collect predictions for quantitative analysis. We compute confusion matrices to visualize class-wise behavior for both clean and poisoned models. We use these visual diagnostics to highlight targeted misclassification patterns introduced by the attack. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsertransform = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2023, 0.1994, 0.2010))
])

base_train = torchvision.datasets.CIFAR10(root=”./data”, train=True, download=True, transform=transform)
base_test = torchvision.datasets.CIFAR10(root=”./data”, train=False, download=True, transform=transform)

clean_ds = PoisonedCIFAR10(base_train, CONFIG[“target_class”], CONFIG[“malicious_label”], ratio=0)
poison_ds = PoisonedCIFAR10(base_train, CONFIG[“target_class”], CONFIG[“malicious_label”], ratio=CONFIG[“poison_ratio”])

clean_loader = DataLoader(clean_ds, batch_size=CONFIG[“batch_size”], shuffle=True)
poison_loader = DataLoader(poison_ds, batch_size=CONFIG[“batch_size”], shuffle=True)
test_loader = DataLoader(base_test, batch_size=CONFIG[“batch_size”], shuffle=False)

clean_model = train_and_evaluate(clean_loader, “Clean Training”)
poisoned_model = train_and_evaluate(poison_loader, “Poisoned Training”)

c_preds, c_true = get_predictions(clean_model, test_loader)
p_preds, p_true = get_predictions(poisoned_model, test_loader)

plot_results(c_preds, c_true, p_preds, p_true, classes)

print(classification_report(c_true, c_preds, target_names=classes, labels=[1]))
print(classification_report(p_true, p_preds, target_names=classes, labels=[1]))

We prepare the CIFAR-10 dataset, construct clean and poisoned dataloaders, and execute both training pipelines end to end. We evaluate the trained models on a shared test set to ensure a fair comparison. We finalize the analysis by reporting class-specific precision and recall to expose the impact of poisoning on the targeted class.

In conclusion, we observed how label-level data poisoning degrades class-specific performance without necessarily destroying overall accuracy. We analyzed this behavior using confusion matrices and per-class classification reports, which reveal targeted failure modes introduced by the attack. This experiment reinforces the importance of data provenance, validation, and monitoring in real-world machine learning systems, especially in safety-critical domains.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export.
The post A Coding Guide to Demonstrate Targeted Data Poisoning Attacks in Deep Learning by Label Flipping on CIFAR-10 with PyTorch appeared first on MarkTechPost.

Meet SETA: Open Source Training Reinforcement Learning Environments fo …

What does an end to end stack for terminal agents look like when you combine structured toolkits, synthetic RL environments, and benchmark aligned evaluation? A team of researchers from CAMEL AI, Eigent AI and other collaborators have released SETA, a toolkit and environment stack that focuses on reinforcement learning for terminal agents. The project targets agents that operate inside a Unix style shell and must complete verifiable tasks under a benchmark harness such as Terminal Bench.

Three main contributions:

A state of the art terminal agent on Terminal Bench: They achieve state of the art performance with a Claude Sonnet 4.5 based agent on Terminal Bench 2.0 and with a GPT 4.1 based agent on Terminal Bench 1.0. The comparison is restricted to agents that use the same base model.

Scalable RL training with synthetic terminal environments: The research team release an initial synthetic dataset with 400 terminal tasks that cover a range of difficulty levels. Out of these, 260 tasks are used for RLVR finetuning of a Qwen3-8B model.

A clean agent design that generalizes across training and evaluation frameworks: The same agent implementation is used for both local task runs and the official Terminal Bench evaluation harness.

Terminal Toolkit and log structure

The SETA code repository showcases a Terminal Toolkit that turns a language model into an executable terminal agent. For each task run, the framework creates a structured log directory under evaluation/terminal_bench_run. The README page shows a concrete layout for a task called play-zork.

Key files include:

chatagent.log which records the full history of agent messages and tool calls including test results.

A sessions directory with session_logs that capture terminal interactions from the toolkit.

Within session_logs, files such as blocking_commands.log, session_run_zork_1_correct_path.log, session_zork-1.log, and session_zork_start.log store command output for different sessions and modes.

tests.log and tests.log.strip which record the test run output, with the latter removing terminal control characters.

This structure gives a concrete way to debug an agent. You can trace from high level chat decisions in chatagent.log down to individual shell commands in the session logs and confirm success or failure from the test logs.

For official Terminal Bench evaluation, the GitHub repository provides a separate entry point under evaluation/terminal_bench_eval. A developer moves into that directory and runs run_eval.sh for Terminal Bench 1.0 and run_tb2.sh for Terminal Bench 2.0.

Results are written into evaluation/terminal_bench_eval/run/{run_id}/results.json. Task specific session logs are placed under evaluation/terminal_bench_eval/logs/camel_logs/{task_id}. The agent class that binds the CAMEL agent to the benchmark is implemented in tbench_camel_agent.py.

Note Taking Toolkit as persistent memory

The research team also introduces a Note Taking Toolkit described as persistent memory for long horizon tasks. They show example note taking tool calls where the agent writes and reads notes in a structured way while solving terminal tasks. The current public material focuses on the existence of this toolkit and the examples of use. It does not yet describe a full training objective for note usage.

The important point is that the agent has an explicit channel where it can externalize intermediate results and hints, separate from the raw terminal buffer.

Understanding the performance

SETA’s agent harness achieves leading results on Terminal Bench. With Claude Sonnet-4.5 as the backbone, the CAMEL terminal agent reaches 46.5% accuracy on Terminal Bench 2.0 across 89 real world tasks, ranking first and outperforming the second system by 3 percentage points, with especially strong results in git workflows, DevOps automation, and code security tasks. On Terminal Bench 1.0, a GPT 4.1 based agent attains 35% accuracy, which is 4.7 percentage points above the next entry, again within the same model family. In comparison, a supervised Qwen3 8B baseline attains 3.4% on Terminal Bench 2.0, and the Qwen3 8B terminal agent trained with the SETA RL pipeline improves over this baseline on the curated synthetic environments.

Key Takeaways

SETA is a joint community project that provides both agent toolkits and synthetic RL environments specifically for terminal agents, aligned with the Terminal Bench evaluation format.

The framework reports state of the art performance for CAMEL terminal agents on Terminal Bench 1.0 and 2.0 when using Claude Sonnet 4.5 and GPT 4.1 as the base models, evaluated against agents built on the same model families.

The SETA RL dataset on Hugging Face contains 400 synthetic terminal tasks, each packaged as task.yaml, Dockerfile, and run-tests.sh, with 260 tasks used for RLVR finetuning of a Qwen3-8B based agent.

The open source SETA codebase exposes a Terminal Toolkit with structured logging and a Note Taking Toolkit for long horizon memory, and integrates directly with Terminal Bench evaluation scripts and logging paths in the seta GitHub repository.

The overall design demonstrates a clean path from synthetic RL environments to benchmark verified agents, giving developers a reproducible stack to train, debug, and evaluate terminal agents rather than relying on ad hoc tool calling examples.

Check out the Blog, Technical details, GitHub Repo and Weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export.
The post Meet SETA: Open Source Training Reinforcement Learning Environments for Terminal Agents with 400 Tasks and CAMEL Toolkit appeared first on MarkTechPost.

Meta and Harvard Researchers Introduce the Confucius Code Agent (CCA): …

How far can a mid sized language model go if the real innovation moves from the backbone into the agent scaffold and tool stack? Meta and Harvard researchers have released the Confucius Code Agent, an open sourced AI software engineer built on the Confucius SDK that is designed for industrial scale software repositories and long running sessions. The system targets real GitHub projects, complex test toolchains at evaluation time, and reproducible results on benchmarks such as SWE Bench Pro and SWE Bench Verified, while exposing the full scaffold for developers.

https://arxiv.org/pdf/2512.10398

Confucius SDK, scaffolding around the model

The Confucius SDK is an agent development platform that treats scaffolding as a primary design problem rather than a thin wrapper around a language model. It is organized around 3 axes, Agent Experience, User Experience, and Developer Experience.

Agent Experience controls what the model sees, including context layout, working memory and tool results. User Experience focuses on readable traces, code diffs and safeguards for human engineers. Developer Experience focuses on observability, configuration and debugging of the agent itself.

The SDK introduces 3 core mechanisms, a unified orchestrator with hierarchical working memory, a persistent note taking system, and a modular extension interface for tools. A meta agent then automates synthesis and refinement of agent configurations through a build, test, improve loop. The Confucius Code Agent is one concrete instantiation of this scaffold for software engineering.

https://arxiv.org/pdf/2512.10398

Hierarchical working memory for long horizon coding

Real software tasks on SWE Bench Pro often require reasoning over dozens of files and many interaction steps. The orchestrator in Confucius SDK maintains hierarchical working memory, which partitions a trajectory into scopes, summarizes past steps and keeps compressed context for later turns.

This design helps keep prompts within model context limits while preserving important artifacts such as patches, error logs and design decisions. The key point is that effective tool based coding agents need an explicit memory architecture, not just a sliding window of previous messages.

Persistent note taking for cross session learning

The second mechanism is a note taking system that uses a dedicated agent to write structured Markdown notes from execution traces. These notes capture task specific strategies, repository conventions and common failure modes, and they are stored as long term memory that can be reused across sessions.

The research team ran Confucius Code Agent twice on 151 SWE Bench Pro instances with Claude 4.5 Sonnet. On the first run the agent solves tasks from scratch and generates notes. On the second run the agent reads these notes. In this setting, average turns drop from 64 to 61, token usage drops from about 104k to 93k, and Resolve@1 improves from 53.0 to 54.4. This shows that notes are not just logs, they function as effective cross session memory.

Modular extensions and tool use sophistication

Confucius SDK exposes tools as extensions, for example file editing, command execution, test runners and code search. Each extension can maintain its own state and prompt wiring.

The research team studies the impact of tool use sophistication using an ablation on a 100 example subset of SWE Bench Pro. With Claude 4 Sonnet, moving from a configuration without advanced context features to one with advanced context raises Resolve@1 from 42.0 to 48.6. With Claude 4.5 Sonnet, a simple tool use configuration reaches 44.0, while richer tool handling reaches 51.6, with 51.0 for an intermediate variant. These numbers indicate that how the agent chooses and sequences tools matters almost as much as the backbone model choice.

https://arxiv.org/pdf/2512.10398

Meta agent for automatic agent design

On top of these mechanisms, the Confucius SDK includes a meta agent that takes a natural language specification of an agent and iteratively proposes configurations, prompts and extension sets. It then runs the candidate agent on tasks, inspects traces and metrics, and edits the configuration in a build, test, improve loop.

The Confucius Code Agent that the research team evaluates is produced with the help of this meta agent, rather than only hand tuned. This approach turns some of the agent engineering process itself into an LLM guided optimization problem.

Results on SWE Bench Pro and SWE Bench Verified

The main evaluation uses SWE Bench Pro, which has 731 GitHub issues that require modifying real repositories until tests pass. All compared systems share the same repositories, tool environment and evaluation harness, so differences come from the scaffolds and models.

On SWE Bench Pro, the reported Resolve@1 scores are

Claude 4 Sonnet with SWE Agent, 42.7

Claude 4 Sonnet with Confucius Code Agent, 45.5

Claude 4.5 Sonnet with SWE Agent, 43.6

Claude 4.5 Sonnet with Live SWE Agent, 45.8

Claude 4.5 Sonnet with Confucius Code Agent, 52.7

Claude 4.5 Opus with Anthropic system card scaffold, 52.0

Claude 4.5 Opus with Confucius Code Agent, 54.3

These results show that a strong scaffold with a mid tier model, Claude 4.5 Sonnet with Confucius Code Agent at 52.7, can outperform a stronger model with a weaker scaffold, Claude 4.5 Opus with 52.0.

On SWE Bench Verified, Confucius Code Agent with Claude 4 Sonnet reaches Resolve@1 74.6, compared to 66.6 for SWE Agent and 72.8 for OpenHands. A mini SWE Agent variant with Claude 4.5 Sonnet reaches 70.6, which is also below Confucius Code Agent with Claude 4 Sonnet.

The research team also report performance as a function of edited file count. For tasks editing 1 to 2 files, Confucius Code Agent reaches 57.8 Resolve@1, for 3 to 4 files it reaches 49.2, for 5 to 6 files it reaches 44.1, for 7 to 10 files it reaches 52.6, and for more than 10 files it reaches 44.4. This indicates stable behavior on multi file changes in large codebases.

Key Takeaways

Scaffolding can outweigh model size: Confucius Code Agent shows that with strong scaffolding, Claude 4.5 Sonnet reaches 52.7 Resolve@1 on SWE-Bench-Pro, surpassing Claude 4.5 Opus with a weaker scaffold at 52.0.

Hierarchical working memory is essential for long horizon coding: The Confucius SDK orchestrator uses hierarchical working memory and context compression to manage long trajectories over large repositories, rather than relying on a simple rolling history.

Persistent notes act as effective cross session memory: On 151 SWE-Bench-Pro tasks with Claude 4.5 Sonnet, reusing structured notes reduces turns from 64 to 61, token usage from about 104k to 93k, and increases Resolve@1 from 53.0 to 54.4.

Tool configuration materially impacts success rates: On a 100 task SWE-Bench-Pro subset, moving from simple to richer tool handling with Claude 4.5 Sonnet increases Resolve@1 from 44.0 to 51.6, indicating that learned tool routing and recovery strategies are a major performance lever, not just an implementation detail.

Meta agent automates agent design and tuning: A meta agent iteratively proposes prompts, tool sets and configurations, then evaluates and edits them in a build, test, improve loop, and the production Confucius Code Agent is itself generated with this process rather than only manual tuning.

Check out the PAPER HERE. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export.
The post Meta and Harvard Researchers Introduce the Confucius Code Agent (CCA): A Software Engineering Agent that can Operate at Large-Scale Codebases appeared first on MarkTechPost.

How to Build Portable, In-Database Feature Engineering Pipelines with …

In this tutorial, we demonstrate how we use Ibis to build a portable, in-database feature engineering pipeline that looks and feels like Pandas but executes entirely inside the database. We show how we connect to DuckDB, register data safely inside the backend, and define complex transformations using window functions and aggregations without ever pulling raw data into local memory. By keeping all transformations lazy and backend-agnostic, we demonstrate how to write analytics code once in Python and rely on Ibis to translate it into efficient SQL. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install “ibis-framework[duckdb,examples]” duckdb pyarrow pandas

import ibis
from ibis import _

print(“Ibis version:”, ibis.__version__)

con = ibis.duckdb.connect()
ibis.options.interactive = True

We install the required libraries and initialize the Ibis environment. We establish a DuckDB connection and enable interactive execution so that all subsequent operations remain lazy and backend-driven. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsertry:
base_expr = ibis.examples.penguins.fetch(backend=con)
except TypeError:
base_expr = ibis.examples.penguins.fetch()

if “penguins” not in con.list_tables():
try:
con.create_table(“penguins”, base_expr, overwrite=True)
except Exception:
con.create_table(“penguins”, base_expr.execute(), overwrite=True)

t = con.table(“penguins”)
print(t.schema())

We load the Penguins dataset and explicitly register it inside the DuckDB catalog to ensure it is available for SQL execution. We verify the table schema and confirm that the data now lives inside the database rather than in local memory. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef penguin_feature_pipeline(penguins):
base = penguins.mutate(
bill_ratio=_.bill_length_mm / _.bill_depth_mm,
is_male=(_.sex == “male”).ifelse(1, 0),
)

cleaned = base.filter(
_.bill_length_mm.notnull()
& _.bill_depth_mm.notnull()
& _.body_mass_g.notnull()
& _.flipper_length_mm.notnull()
& _.species.notnull()
& _.island.notnull()
& _.year.notnull()
)

w_species = ibis.window(group_by=[cleaned.species])
w_island_year = ibis.window(
group_by=[cleaned.island],
order_by=[cleaned.year],
preceding=2,
following=0,
)

feat = cleaned.mutate(
species_avg_mass=cleaned.body_mass_g.mean().over(w_species),
species_std_mass=cleaned.body_mass_g.std().over(w_species),
mass_z=(
cleaned.body_mass_g
– cleaned.body_mass_g.mean().over(w_species)
) / cleaned.body_mass_g.std().over(w_species),
island_mass_rank=cleaned.body_mass_g.rank().over(
ibis.window(group_by=[cleaned.island])
),
rolling_3yr_island_avg_mass=cleaned.body_mass_g.mean().over(
w_island_year
),
)

return feat.group_by([“species”, “island”, “year”]).agg(
n=feat.count(),
avg_mass=feat.body_mass_g.mean(),
avg_flipper=feat.flipper_length_mm.mean(),
avg_bill_ratio=feat.bill_ratio.mean(),
avg_mass_z=feat.mass_z.mean(),
avg_rolling_3yr_mass=feat.rolling_3yr_island_avg_mass.mean(),
pct_male=feat.is_male.mean(),
).order_by([“species”, “island”, “year”])

We define a reusable feature engineering pipeline using pure Ibis expressions. We compute derived features, apply data cleaning, and use window functions and grouped aggregations to build advanced, database-native features while keeping the entire pipeline lazy. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserfeatures = penguin_feature_pipeline(t)
print(con.compile(features))

try:
df = features.to_pandas()
except Exception:
df = features.execute()

display(df.head())

We invoke the feature pipeline and compile it into DuckDB SQL to validate that all transformations are pushed down to the database. We then run the pipeline and return only the final aggregated results for inspection. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsercon.create_table(“penguin_features”, features, overwrite=True)

feat_tbl = con.table(“penguin_features”)

try:
preview = feat_tbl.limit(10).to_pandas()
except Exception:
preview = feat_tbl.limit(10).execute()

display(preview)

out_path = “/content/penguin_features.parquet”
con.raw_sql(f”COPY penguin_features TO ‘{out_path}’ (FORMAT PARQUET);”)
print(out_path)

We materialize the engineered features as a table directly inside DuckDB and query it lazily for verification. We also export the results to a Parquet file, demonstrating how we can hand off database-computed features to downstream analytics or machine learning workflows.

In conclusion, we constructed, compiled, and executed an advanced feature engineering workflow fully inside DuckDB using Ibis. We demonstrated how to inspect the generated SQL, materialized results directly in the database, and exported them for downstream use while preserving portability across analytical backends. This approach reinforces the core idea behind Ibis: we keep computation close to the data, minimize unnecessary data movement, and maintain a single, reusable Python codebase that scales from local experimentation to production databases.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export.
The post How to Build Portable, In-Database Feature Engineering Pipelines with Ibis Using Lazy Python APIs and DuckDB Execution appeared first on MarkTechPost.

Crossmodal search with Amazon Nova Multimodal Embeddings

Amazon Nova Multimodal Embeddings processes text, documents, images, video, and audio through a single model architecture. Available through Amazon Bedrock, the model converts different input modalities into numerical embeddings within the same vector space, supporting direct similarity calculations regardless of content type. We developed this unified model to reduce the need for separate embedding models, which complicate architectures, are difficult to maintain and operate, and further limit use cases to a one-dimensional approach.
In this post, we explore how Amazon Nova Multimodal Embeddings addresses the challenges of crossmodal search through a practical ecommerce use case. We examine the technical limitations of traditional approaches and demonstrate how Amazon Nova Multimodal Embeddings enables retrieval across text, images, and other modalities. You learn how to implement a crossmodal search system by generating embeddings, handling queries, and measuring performance. We provide working code examples and share how to add these capabilities to your applications.
The search problem
Traditional approaches involve keyword-based search, text embeddings-based natural language search, or hybrid search and can’t process visual queries effectively, creating a gap between user intent and retrieval capabilities. Typical search architectures separate visual and textual processing, losing context in the process. Text queries execute against product descriptions using keyword matching or text embeddings. Image queries, when supported, operate through multiple computer vision pipelines with limited integration to textual content. This separation complicates system architecture and weaken the user experience. Multiple embedding models require separate maintenance and optimization cycles, while crossmodal queries cannot be processed natively within a single system. Visual and textual similarity scores operate in different mathematical spaces, making it difficult to rank results consistently across content types. This separation requires complex mapping that can’t always be done, so embedding systems are kept separately, creating data silos in the process and limiting functionality. Complex product content further complicates it, because product pages combine images, descriptions, specifications, and sometimes video demonstrations.
Crossmodal embeddings
Crossmodal embeddings map text, images, audio, and video into a shared vector space where semantically similar content clusters together. For example, when processing a text query red summer dress and an image of a red dress, both inputs generate vectors close together in the embedding space, reflecting their semantic similarity and unlocking crossmodal retrieval.
By using crossmodal embeddings, you can search across different content types without maintaining separate systems for each modality, solving the problem of segmented multimodal systems where organizations manage multiple embedding models that are nearly impossible to integrate effectively because embeddings from different modalities are incompatible. A single model architecture helps ensure that you have consistent embedding generation across all content types while related content, such as product images, videos, and their descriptions, generates similar embeddings because of joint training objectives. Applications can generate embeddings for all content types using identical API endpoints and vector dimensions, reducing system complexity.
Use case: Ecommerce search
Consider a customer who sees a shirt on TV and wants to find similar items for purchase. They can photograph the item with their phone or try to describe what they saw in text and use this to search for a product. Traditional search handles text queries that reference metadata reasonably well but cannot execute when customers want to use images for search or describe visual attributes of an item. This TV-to-cart shopping experience shows how visual and text search work together. The customer uploads a photo, and the system matches it against product catalogs with both images and descriptions. The crossmodal ecommerce workflow is shown in the following figure.

How Amazon Nova Multimodal Embeddings helps
Amazon Nova handles different types of search queries through the same model, which creates both new search capabilities and technical advantages. Whether you upload images, enter descriptions using text, or combine both, the process works the same way.
Crossmodal search capabilities
As previously stated, Amazon Nova Multimodal Embeddings processes all supported modalities through a unified model architecture. Input content can be text, images, documents, video, or audio and then it generates embeddings in the same vector space. This supports direct similarity calculations between different content types without additional transformation layers. When customers upload images, the system converts them into embeddings and searches against the product catalog using cosine similarity. You get products with similar visual characteristics, regardless of how they’re described in text. Text queries work the same way—customers can describe what they want and find visually similar products, even when the product descriptions use different words. If the customer uploads an image with a text description, the system processes both inputs through the same embedding model for unified similarity scoring. The system also extracts product attributes from images automatically through automated product tagging, supporting semantic tag generation that goes beyond manual categorization.
Technical advantages
The unified architecture has several benefits over separate text and image embeddings. The single-model design and shared semantic space unlocks new use cases that aren’t attainable by managing multiple embedding systems. Applications generate embeddings for all content types using the same API endpoints and vector dimensions. A single model handles all five modalities, so related content, such as product images and their descriptions, produce similar embeddings. You can calculate distances between any combination of text, images, audio, and video to measure how similar they are.
The Amazon Nova Multimodal Embeddings model uses Matryoshka representation learning, supporting multiple embedding dimensions: 3072, 1024, 384, and 256. Matryoshka embedding learning stores the most important information in the first dimensions and less critical details in later dimensions. You can truncate from the end (shown in the following figure) to reduce storage space while maintaining accuracy for your specific use case.

Architecture
Three main components are required to build this approach: embedding generation, vector storage, and similarity search. Product catalogs undergo preprocessing to generate embeddings for all content types. Query processing converts user inputs into embeddings using the same model. Similarity search compares query embeddings against stored product embeddings, as shown in the following figure.

Vector storage systems must support the chosen embedding dimensions and provide efficient similarity search operations. Options include purpose-built vector databases, traditional databases with vector extensions, or cloud-centered vector services such as Amazon S3 Vectors, a feature of Amazon S3 that provides native support for storing and querying vector embeddings directly within S3.
Prerequisites
To use the feature effectively, there are some key aspects required for this implementation. An AWS account with Amazon Bedrock access permissions for the Amazon Nova Multimodal Embeddings model. Additional services required include S3 Vectors. You can follow along in the notebook available in our Amazon Nova samples repository.
Implementation
In the following sections, we skip the initial data download and extraction steps, but the end-to-end approach is available for you to follow along in this notebook. The omitted steps include downloading the Amazon Berkeley Objects (ABO) dataset archives, which include product metadata, catalog images, and 3D models. These archives require extraction and preprocessing to parse approximately 398,212 images and 9,232 product listings from compressed JSON and tar files. After being extracted, the data requires metadata alignment between product descriptions and their corresponding visual assets. We begin this walk through after these preliminary steps are complete, focusing on the core workflow: setting up S3 Vectors, generating embeddings with Amazon Nova Multimodal Embeddings, storing vectors at scale, and implementing crossmodal retrieval. Let’s get started.
S3 Vector bucket and index creation:
Create the vector storage infrastructure for embeddings. S3 Vectors is a managed service for storing and querying high-dimensional vectors at scale. The bucket acts as a container for your vector data, while the index defines the structure and search characteristics. We configure the index with cosine distance metric, which measures similarity based on vector direction rather than magnitude, making it ideal for normalized embeddings from models provided by services such as Amazon Nova Multimodal Embeddings.

*# S3 Vectors configuration*
s3vector_bucket = “amzn-s3-demo-vector-bucket-crossmodal-search”
s3vector_index = “product”
embedding_dimension = 1024
s3vectors = boto3.client(“s3vectors”, region_name=”us-east-1″)
*# Create S3 vector bucket*
s3vectors.create_vector_bucket(vectorBucketName=s3vector_bucket)
*# Create index*
s3vectors.create_index(
vectorBucketName=s3vector_bucket,
indexName=s3vector_index,
dataType=’float32′,
dimension=embedding_dimension,
distanceMetric=’cosine’
)

Product catalog preprocessing:
Here we generate embeddings. Both product images and textual descriptions require embedding generation and storage with appropriate metadata for retrieval. The Amazon Nova Embeddings API processes each modality independently, converting text descriptions and product images into 1024-dimensional vectors. These vectors live in a unified semantic space, which means a text embedding and an image embedding of the same product will be geometrically close to each other.

# Initialize Nova Embeddings Client

class NovaEmbeddings:
def __init__(self, region=’us-east-1′):
self.bedrock = boto3.client(‘bedrock-runtime’, region_name=region)
self.model_id = “amazon.nova-2-multimodal-embeddings-v1:0”

def embed_text(self, text: str, dimension: int = 1024, purpose: str = “GENERIC_INDEX”):
request_body = {
“taskType”: “SINGLE_EMBEDDING”,
“singleEmbeddingParams”: {
“embeddingDimension”: dimension,
“embeddingPurpose”: purpose,
“text”: {
“truncationMode”: “END”,
“value”: text
}
}
}
response = self.bedrock.invoke_model(modelId=self.model_id, body=json.dumps(request_body))
result = json.loads(response[‘body’].read())
return result[’embeddings’][0][’embedding’]

def embed_image(self, image_bytes: bytes, dimension: int = 1024, purpose: str = “GENERIC_INDEX”):
request_body = {
“taskType”: “SINGLE_EMBEDDING”,
“singleEmbeddingParams”: {
“embeddingDimension”: dimension,
“embeddingPurpose”: purpose,
“image”: {
“format”: “jpeg”,
“source”: {“bytes”: base64.b64encode(image_bytes).decode()}
}
}
}
response = self.bedrock.invoke_model(modelId=self.model_id, body=json.dumps(request_body))
result = json.loads(response[‘body’].read())
return result[’embeddings’][0][’embedding’]

embeddings = NovaEmbeddings()

We use the following code to generate the embeddings and upload the data to our vector store.

# Generate embeddings and upload to Amazon S3 Vectors

def get_product_text(product):
name = product.get(‘item_name’, [{}])[0].get(‘value’, ”) if isinstance(product.get(‘item_name’), list) else str(product.get(‘item_name’, ”))
brand = product.get(‘brand’, [{}])[0].get(‘value’, ”) if product.get(‘brand’) else ”
return f”{name}. {brand}”.strip()

vectors_to_upload = []
batch_size = 10
catalog = [] # Keep for local reference

for product in tqdm(sampled_products, desc=”Processing products”):
img_path = get_image_path(product)
text = get_product_text(product)
product_id = product.get(‘item_id’, str(len(catalog)))

with open(img_path, ‘rb’) as f:
img_bytes = f.read()

# Generate embeddings
text_emb = embeddings.embed_text(text)
image_emb = embeddings.embed_image(img_bytes)

# Store in catalog for local use
catalog.append({
‘text’: text,
‘image_path’: str(img_path),
‘text_emb’: text_emb,
‘image_emb’: image_emb,
‘product_id’: product_id
})

# Prepare vectors for S3 upload
vectors_to_upload.extend([
{
“key”: f”text-{product_id}”,
“data”: {“float32”: text_emb},
“metadata”: {“product_id”: product_id, “text”: text, “image_path”: str(img_path), “type”: “text”}
},
{
“key”: f”image-{product_id}”,
“data”: {“float32”: image_emb},
“metadata”: {“product_id”: product_id, “text”: text, “image_path”: str(img_path), “type”: “image”}
},
{
“key”: f”combined-{product_id}”,
“data”: {“float32”: np.mean([text_emb, image_emb], axis=0).tolist()},
“metadata”: {“product_id”: product_id, “text”: text, “image_path”: str(img_path), “type”: “combined”}
}
])

# Batch upload
if len(vectors_to_upload) >= batch_size * 3:
s3vectors.put_vectors(vectorBucketName=s3vector_bucket, indexName=s3vector_index, vectors=vectors_to_upload)
vectors_to_upload = []

# Upload remaining vectors
if vectors_to_upload:
s3vectors.put_vectors(vectorBucketName=s3vector_bucket, indexName=s3vector_index, vectors=vectors_to_upload)

Query processing: 
This code handles customer input through the API. Text queries, image uploads, or combinations convert into the same vector format used for your product catalog. For multimodal queries that combine text and image, we apply mean fusion to create a single query vector that captures information from both modalities. The query processing logic handles three distinct input types and prepares the appropriate embedding representation for similarity search against the S3 Vectors index.

def search_s3(query=None, query_image=None, query_type=’text’, search_mode=’combined’, top_k=5):
“””
Search using S3 Vectors
query_type: ‘text’, ‘image’, or ‘both’
search_mode: ‘text’, ‘image’, or ‘combined’
“””
# Get query embedding
if query_type == ‘both’:
text_emb = embeddings.embed_text(query)
with open(query_image, ‘rb’) as f:
image_emb = embeddings.embed_image(f.read())
query_emb = np.mean([text_emb, image_emb], axis=0).tolist()
query_image_path = query_image
elif query_type == ‘text’:
query_emb = embeddings.embed_text(query)
query_image_path = None
else:
with open(query_image, ‘rb’) as f:
query_emb = embeddings.embed_image(f.read())
query_image_path = query_image

Vector similarity search: 
Next, we add crossmodal retrieval using the S3 Vectors query API. The system finds the closest embedding match to the query, regardless of whether it was text or an image. We use cosine similarity as the distance metric, which measures the angle between vectors rather than their absolute distance. This approach works well for normalized embeddings and is resource efficient, making it suitable for large catalogs when paired with approximate nearest neighbor algorithms. S3 Vectors handles the indexing and search infrastructure, so you can focus on the application logic while the service manages scalability and performance optimization.

# Query S3 Vectors
response = s3vectors.query_vectors(
vectorBucketName=s3vector_bucket,
indexName=s3vector_index,
queryVector={“float32”: query_emb},
topK=top_k,
returnDistance=True,
returnMetadata=True,
filter={“metadata.type”: {“equals”: search_mode}}
)

Result ranking: 
The similarity scores computed by S3 Vectors provide the ranking mechanism. Cosine similarity between query and catalog embeddings determines result order, with higher scores indicating better matches. In production systems, you would typically collect click-through data and relevance judgments to validate that the ranking correlates with actual user behavior. S3 Vectors returns distance values which we convert to similarity scores (1 – distance) for intuitive interpretation where higher values indicate closer matches.

# Extract and rank results by similarity
ranked_results = []
for result in response[‘vectors’]:
metadata = result[‘metadata’]
distance = result.get(‘distance’, 0)
similarity = 1 – distance # Convert distance to similarity score

ranked_results.append({
‘product_id’: metadata[‘product_id’],
‘text’: metadata[‘text’],
‘image_path’: metadata[‘image_path’],
‘similarity’: similarity,
‘distance’: distance
})

# Results are sorted by S3 Vectors (best matches first)
return ranked_results

Conclusion
Amazon Nova Multimodal Embeddings solves the core problem of crossmodal search by using one model instead of managing separate systems. You can use Amazon Nova Multimodal Embeddings to build search that works whether customers upload images, enter descriptions as text, or combine both approaches.
The implementation is straightforward using Amazon Bedrock APIs, and the Matryoshka embedding dimensions let you optimize for your specific accuracy and cost requirements. If you’re building ecommerce search, content discovery, or an application where users interact with multiple content types, this unified approach reduces both development complexity and operational overhead.
Matryoshka representation learning maintains embedding quality across different dimensions [2]. Performance degradation follows predictable patterns, allowing applications to optimize for specific use cases.
Next steps
Amazon Nova Multimodal Embeddings is available in Amazon Bedrock. See Using Nova Embeddings for API references, code examples, and integration patterns for common architectures.
The AWS samples repository contains implementation examples for multimodal embeddings.
Walk through this specific ecommerce example notebook here

About the authors
Tony Santiago is a Worldwide Partner Solutions Architect at AWS, dedicated to scaling generative AI adoption across Global Systems Integrators. He specializes in solution building, technical go-to-market alignment, and capability development—enabling tens of thousands of builders at GSI partners to deliver AI-powered solutions for their customers. Drawing on more than 20 years of global technology experience and a decade with AWS, Tony champions practical technologies that drive measurable business outcomes. Outside of work, he’s passionate about learning new things and spending time with family.
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Sharon Li is a solutions architect at AWS, based in the Boston, MA area. She works with enterprise customers, helping them solve difficult problems and build on AWS. Outside of work, she likes to spend time with her family and explore local restaurants.
Sundaresh R. Iyer is a Partner Solutions Architect at Amazon Web Services (AWS), where he works closely with channel partners and system integrators to design, scale, and operationalize generative AI and agentic architectures. With over 15 years of experience spanning product management, developer platforms, and cloud infrastructure, he specializes in machine learning and AI-powered developer tooling. Sundaresh is passionate about helping partners move from experimentation to production by building secure, governed, and scalable AI systems that deliver measurable business outcomes.

Accelerating LLM inference with post-training weight and activation us …

Foundation models (FMs) and large language models (LLMs) have been rapidly scaling, often doubling in parameter count within months, leading to significant improvements in language understanding and generative capabilities. This rapid growth comes with steep costs: inference now requires enormous memory capacity, high-performance GPUs, and substantial energy consumption. This trend is evident in the open source space. In 2023, TII-UAE released Falcon 180B, the largest open model at the time. Meta surpassed that in 2024 with Llama 3.1, a 405B dense model. As of mid-2025, the largest publicly available model is DeepSeek (V3 – Instruct variant, R1 – Reasoning variant), a mixture of experts (MoE) architecture with 671 billion total parameters—of which 37 billion are active per token. These models deliver state-of-the-art performance across a wide range of tasks, including multi-modal search, code generation, summarization, idea generation, logical reasoning, and even PhD-level problem solving. Despite their value, deploying such models in real-world applications remains largely impractical because of their size, cost, and infrastructure requirements.
We often rely on the intelligence of large models for mission-critical applications such as customer-facing assistants, medical research, or enterprise agents, where hallucinations can lead to serious consequences. However, deploying models with over 100 billion parameters at scale is technically challenging—these models require significant GPU resources and memory bandwidth, making it difficult to spin up or scale down instances quickly in response to fluctuating user demand. As a result, scaling to thousands of users quickly becomes cost-prohibitive, because the high-performance infrastructure requirements make the return on investment (ROI) difficult to justify. Post-training quantization (PTQ) offers a practical alternative; by converting 16- or 32-bit weights and activations into lower-precision 8- or 4-bit integers after training, PTQ can shrink model size by 2–8 times, reduce memory bandwidth requirements, and speed up matrix operations, all without the need for retraining, making it suitable for deploying large models more efficiently. For example, the base DeepSeek-V3 model requires an ml.p5e.48xlarge instance (with 1128 GB H100 GPU memory) for inference, while its quantized variant (QuixiAI/DeepSeek-V3-0324-AWQ) can run on smaller instances such as ml.p5.48xlarge (with 640 GB H100 GPU memory) or even ml.p4de.24xlarge (with 640 GB A100 GPU memory). This efficiency is achieved by applying low-bit quantization to less influential weight channels, while preserving or rescaling the channels that have the greatest impact on activation responses, and keeping activations in full precision—dramatically reducing peak memory usage.
Quantized models are made possible by contributions from the developer community—including projects like Unsloth AI and QuixiAI (formerly: Cognitive Computations)—that invest significant time and resources into optimizing LLMs for efficient inference. These quantized models can be seamlessly deployed on Amazon SageMaker AI using a few lines of code. Amazon SageMaker Inference provides a fully managed service for hosting machine learning, deep learning, and large language or vision models at scale in a cost-effective and production-ready manner. In this post, we explore why quantization matters—how it enables lower-cost inference, supports deployment on resource-constrained hardware, and reduces both the financial and environmental impact of modern LLMs, while preserving most of their original performance. We also take a deep dive into the principles behind PTQ and demonstrate how to quantize the model of your choice and deploy it on Amazon SageMaker.
The steps are:

Choose model
Choose WxAy technique (WxAy here implies weights and activations, which will be discussed in depth later in this post)
Choose algorithm (AWQ, GPTQ, SmoothQuant, and so on)
Quantize
Deploy and inference

To illustrate this workflow and help visualize the process, we’ve included the following flow diagram.

Prerequisites
To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created. For more information, see Create an AWS account.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain.
By default, the model runs in a shared AWS managed virtual private cloud (VPC) with internet access. To enhance security and control access, you should explicitly configure a private VPC with appropriate security groups and IAM policies based on your requirements.
Amazon SageMaker AI provides enterprise-grade security features to help keep your data and applications secure and private. We don’t share your data with model providers, providing you full control over your data. This applies to all models—both proprietary and publicly available, including DeepSeek-R1 on SageMaker. For more information, see Configure security in Amazon SageMaker AI.
As a best practice, it’s always recommended to deploy your LLM’s endpoints inside your VPC and behind a private subnet without internet gateways and preferably with no egress. Ingress from the internet should also be blocked to minimize security risks.
In this post, we use LiteLLM Python SDK to standardize and abstract access to Amazon SageMaker real-time endpoints and LLMPerf tool for evaluation of performance of our quantized models. See Installation in the LLMPerf GitHub repo for setup instructions.
Weights and activation techniques (WₓAᵧ)
As the scale of LLMs continues to grow, deploying them efficiently becomes less about raw performance and more about finding the right balance between speed, cost, and accuracy. In real-world scenarios, quantization starts with three core considerations:

The size of the model you need to host
The cost or target hardware available for inference
The acceptable trade-off between accuracy and inference speed

Understanding how these factors shape quantization choices is key to making LLMs viable in production environments. We’ll explore how post-training quantization techniques like AWQ and generative pre-trained transformers quantization (GPTQ) help navigate these constraints and make state-of-the-art models deployable at scale.
Weights and activation: A deep dive

In neural networks, weights are the static, learned parameters saved in the model—think of them as the fixed coefficients that shape how inputs are combined—while activations are the dynamic values produced at each layer when you run data through the network, representing the response of each neuron to its inputs. The preceding figure illustrates weights and activations in a model flow. We capture their respective precisions with the shorthand WₓAᵧ, where Wₓ is the bit-width for weights (for example, 4-bit or 8-bit) and Aᵧ is the bit-width for activations (for example, 8-bit or 16-bit). For example, W4A16 means weights are stored as 4-bit integers (often with per-channel, symmetric or asymmetric scaling) while activations remain in 16-bit floating point. This notation tells you which parts of the model are compressed and by how much, helping you balance memory use, compute speed, and accuracy.
W4A16 (or W4A16_symmetric)
W4A16 refers to 4-bit precision for weights and 16-bit for activations, using a symmetric quantization for weights. Symmetric quantization means the quantizer’s range is centered around zero (the absolute minimum and maximum of the weight distribution are set to be equal in magnitude). Using 4-bit integer weights yields an 8-times reduction in weight memory compared to FP32 (or 4 times compared to FP16), which is very attractive for deployment. However, with only 16 quantization levels (−8 to +7 for a 4-bit signed integer, in a symmetric scheme), the model is prone to quantization error. If the weight distribution isn’t perfectly zero-centered (for example, if weights have a slight bias or a few large outliers), a symmetric quantizer might waste range on one side and not have enough resolution where the bulk of values lie. Studies have found that a naive 4-bit symmetric quantization of LLM weights can incur a noticeable accuracy drop and is generally inferior to using an asymmetric scheme at this low bit-width. The symmetric W4A16 approach is mainly a baseline; without additional techniques (like AWQ’s scaling or GPTQ’s error compensation), 4-bit weight quantization needs careful handling to avoid serious degradation.
W4A16_asymmetric
Using 4-bit weights with an asymmetric quantization improves upon the symmetric case by introducing a zero-point offset. Asymmetric quantization maps the minimum weight to the lowest representable integer and the maximum weight to the highest integer, rather than forcing the range to be symmetric around zero. This allows the small 4-bit scale to cover the actual range of weight values more effectively. In practice, 4-bit weight quantization with asymmetric scaling significantly outperforms the symmetric approach in terms of model accuracy. By better utilizing all 16 levels of the quantizer (especially when the weight distribution has a non-zero mean or prominent outliers on one side), the asymmetric W4A16 scheme can reduce the quantization error. Modern PTQ methods for 4-bit LLMs almost always incorporate some form of asymmetric or per-channel scaling for this reason. For example, one approach is group-wise quantization where each group of weights (for example, each output channel) gets its own min-max range—effectively an asymmetric quantization per group—which has been identified as a sweet-spot when combined with 4-bit weights. W4A16 with asymmetric quantization is the preferred strategy for pushing weights to ultra-low precision, because it yields better perplexity and accuracy retention than a symmetric 4-bit mapping.
W8A8
This denotes fully quantizing both weights and activations to 8-bit integers. INT8 quantization is a well-understood, widely adopted PTQ technique that usually incurs minimal accuracy loss in many networks, because 256 distinct levels (per quantization range) are usually sufficient to capture the needed precision. For LLMs, weight quantization to 8-bit is relatively straightforward—research has shown that replacing 16-bit weights with INT8 often causes negligible change in perplexity. Activation quantization to 8-bit, however, is more challenging for transformers because of the presence of outliers—occasional very large activation values in certain layers. These outliers can force a quantizer to have an extremely large range, making most values use only a tiny fraction of the 8-bit levels (resulting in precision loss). To address this, techniques like SmoothQuant redistribute some of the quantization difficulty from activations to weights—essentially scaling down outlier activation channels and scaling up the corresponding weight channels (a mathematically equivalent transformation) so that activations have a tighter range that fits well in 8 bits. With such calibrations, LLMs can be quantized to W8A8 with very little performance drop. The benefit of W8A8 is that it enables end-to-end integer inference—both weights and activations are integers—which current hardware can exploit for faster matrix multiplication. Fully INT8 models often run faster than mixed precision models, because they can use optimized INT8 arithmetic throughout.
W8A16
W8A16 uses 8-bit quantization for weights while keeping activations in 16-bit precision (often FP16). It can be seen as a weight-only quantization scenario. The memory savings from compressing weights to INT8 are significant (a 2 times reduction compared to FP16, and 4 times compared to FP32) and, as noted, INT8 weights usually don’t hurt accuracy in LLMs. Because activations remain in high precision, the model’s computation results are nearly as accurate as the original—the main source of error is the minor quantization noise in weights. Weight-only INT8 quantization is thus a very safe choice that yields substantial memory reduction with almost no model quality loss.
Many practical deployments start with weight-only INT8 PTQ as a baseline. This approach is especially useful when you want to reduce model size to fit on a device within a given memory budget without doing complex calibration for activations. In terms of speed, using INT8 weights reduces memory bandwidth requirements (benefiting memory-bound inference scenarios) and can slightly improve throughput, however the activations are still 16-bit, and the compute units might not be fully utilizing integer math for accumulation. If the hardware converts INT8 weights to 16-bit on the fly to multiply by FP16 activations, the speed gain might be limited by that conversion. For memory-bound workloads (common with LLMs at small batch sizes), INT8 weights provide a noticeable speed-up because the bottleneck is often fetching weights from memory. For compute-bound scenarios (such as very large batch throughput), weight-only quantization alone yields less benefit—in those cases, you could quantize activations (moving to W8A8) to use fast INT8×INT8 matrix multiplication fully. In summary, W8A16 is straightforward to implement quantization scheme that dramatically cuts model size with minimal risk, while W8A8 is the next step to maximize inference speed at the cost of a more involved calibration process.
Summary
The following table provides a high-level overview of the WₓAᵧ paradigm.

Technique
Weight format
Activation format
Primary purpose and real-world use case

W4A16 symmetric
4-bit signed integers (per-tensor, zero-centered)
FP16
Baseline research and prototyping. Quick way to test ultra-low weight precision; helps gauge if 4-bit quantization is feasible before moving to more optimized schemes.

W4A16 asymmetric
4-bit signed integers (per-channel minimum and maximum)
FP16
Memory-constrained inference. Ideal when you must squeeze a large model into very tight device memory while tolerating minor calibration overhead.

W8A8
8-bit signed integers (per-tensor or per-channel)
INT8
High-throughput, latency-sensitive deployment. Uses full INT8 pipelines on modern GPUs and CPUs or NPUs for maximum speed in batch or real-time inference.

W8A16
8-bit signed integers (per-tensor)
FP16
Easy weight-only compression. Cuts model size in half with negligible accuracy loss; great first step on GPUs or servers when you prioritize memory savings over peak compute speed.

Inference acceleration through PTQ techniques
As outlined earlier, LLMs with high parameter counts are extremely resource-intensive at inference. In the following sections, we explore how PTQ reduces these requirements, enabling more cost-effective and performant inference. For instance, a Llama 3 70B parameter model at FP16 precision doesn’t fit into a single A100 80 GB GPU and requires at least two A100 80 GB GPUs for reasonable inference at scale, making deployment both costly and impractical for many use cases. To address this challenge, PTQ converts a trained model’s weights (and sometimes activations) from high-precision floats (for example, 16- or 32-bit) to lower-bit integers (for example, 8-bit or 4-bit) after training. This compression can shrink model size by 2–8 times, enabling the model to fit in memory and reducing memory bandwidth demands, which in turn can speed up inference.

Crucially, PTQ requires no additional training—unlike quantization-aware training (QAT), which incorporates quantization into the fine-tuning process. PTQ avoids the prohibitive retraining cost associated with billion-parameter models. The challenge is to quantize the model carefully to minimize any drop in accuracy or increase in perplexity. Modern PTQ techniques strive to retain model performance while dramatically improving deployment efficiency.
Post-training quantization algorithms
Quantizing an entire model directly to 4-bit or 8-bit precision might seem straightforward, but doing so naïvely often results in substantial accuracy degradation—particularly under lower-bit configurations. To overcome this, specialized PTQ algorithms have been developed that intelligently compress model parameters while preserving fidelity. In this post, we focus on two widely adopted and well-researched PTQ techniques, each taking a distinct approach to high-accuracy compression:

Activation-aware weights quantization (AWQ)
Generative pre-trained transformers quantization (GPTQ)

Activation aware weights quantization
AWQ is a PTQ technique that targets weight-only quantization at very low bit widths (typically 4-bit) while keeping activations in higher precision, such as FP16. The core concept is that not all weights contribute equally to a model’s output; a small subset of salient weights disproportionately influences predictions. By identifying and preserving approximately 1% of these critical weight channels—those associated with the largest activation values—AWQ can dramatically close the gap between 4-bit quantized models and their original FP16 counterparts in terms of perplexity. Unlike traditional methods that rank importance based on weight magnitude alone, AWQ uses activation distributions to find which weights truly matter. Early results showed that leaving the top 1% of channels in higher precision was enough to maintain performance—but this introduces hardware inefficiencies due to mixed-precision execution. To get around this, AWQ introduces an elegant workaround of per-channel scaling.
During quantization, AWQ amplifies the weights of activation-salient channels to reduce relative quantization error and folds the inverse scaling into the model, so no explicit rescaling is needed during inference. This adjustment eliminates the overhead of mixed-precision computation while keeping inference purely low-bit. Importantly, AWQ achieves this without retraining—it uses a small calibration dataset to estimate activation statistics and derive scaling factors analytically. The method avoids overfitting to calibration data, ensuring strong generalization across tasks. In practice, AWQ delivers near-FP16 performance even at 4-bit precision, showing far smaller degradation than traditional post-training methods like RTN (round-to-nearest). While there’s still a marginal increase in perplexity compared to full-precision models, the trade-off is often negligible given the 3–4 times reduction in memory footprint and bandwidth. This efficiency enables deployment of very large models—up to 70 billion parameters—on a single high-end GPU such as an A100 or H100. In short, AWQ demonstrates that with careful, activation-aware scaling, precision can be focused where it matters most, achieving low-bit quantization with minimal impact on model quality.
Generative pre-trained transformers quantization (GPTQ)
GPTQ is another PTQ method that takes an error-compensation-driven approach to compressing large language models. GPTQ operates layer by layer, aiming to preserve each layer’s output as closely as possible to that of the original full-precision model. It follows a greedy, sequential quantization strategy: at each step, a single weight or a small group of weights is quantized, while the remaining unquantized weights are adjusted to compensate for the error introduced. This keeps the output of each layer tightly aligned with the original. The process is informed by approximate second-order statistics, specifically an approximation of the Hessian matrix, which estimates how sensitive the output is to changes in each weight. This optimization procedure is sometimes referred to as optimal brain quantization, where GPTQ carefully quantizes weights in an order that minimizes cumulative output error.
Despite its sophistication, GPTQ remains a one-shot PTQ method—it doesn’t require retraining or iterative fine-tuning. It uses a small calibration dataset to run forward passes, collecting activation statistics and estimating Hessians, but avoids any weight updates beyond the greedy compensation logic. The result is an impressively efficient compression technique: GPTQ can quantize models to 3–4 bits per weight with minimal accuracy loss, even for massive models. For example, the method demonstrated compressing a 175 billion-parameter GPT model to 3–4 bits in under 4 GPU-hours, with negligible increase in perplexity, enabling single-GPU inference for the first time at this scale. While GPTQ delivers high accuracy, its reliance on calibration data has led some researchers to note mild overfitting effects, especially for out-of-distribution inputs. Still, GPTQ has become a go-to baseline in LLM quantization because of its strong balance of fidelity and efficiency, aided by mathematical optimizations such as fast Cholesky-based Hessian updates that make it practical even for models with tens or hundreds of billions of parameters.
Using Amazon SageMaker AI for inference optimization and model quantization
In this section, we cover how to implement quantization using Amazon SageMaker AI. We walk through a codebase that you can use to quickly quantize a model using either the GPTQ or AWQ method on SageMaker training jobs backed by one or more GPU instances. The code uses the open source vllm-project/llm-compressor package to quantize dense LLM weights from FP32 to INT4.
All code for this process is available in the amazon-sagemaker-generativeai GitHub repository. The llm-compressor project provides a streamlined library for model optimization. It supports multiple algorithms—GPTQ, AWQ, and SmoothQuant—for converting full- or half-precision models into lower-precision formats. Quantization takes place in three steps, described in the following sections. The full implementation is available in post_training_sagemaker_quantizer.py, with arguments provided for straightforward execution.
Step 1: Load model using HuggingFace transformers
Load the model weights without attaching them to an accelerator. The llm-compressor library automatically detects available hardware and offloads weights to the accelerator as needed. Because it performs quantization layer by layer, the entire model does not need to fit in accelerator memory at once.

def quantize_model(
    args: argparse.Namespace
) -> None:
    try:

        …
       # load model
       model = AutoModelForCausalLM.from_pretrained(
           args.model_id,
           torch_dtype=”auto”,
           device_map=None,
           trust_remote_code=True
       )
       # load tokenizer
       tokenizer_or_processor = AutoTokenizer.from_pretrained(
           args.model_id,
           trust_remote_code=True
       )
       …

Step 2: Select and load the calibration dataset
A calibration dataset is used during PTQ to estimate activation ranges and statistical distributions in a pretrained LLM without retraining. Tools like llm-compressor use this small, representative dataset to run forward passes and collect statistics such as minimum and maximum values or percentiles. These statistics guide the quantization of weights and activations to reduce precision while preserving model accuracy. You can use any tokenized dataset that reflects the model’s expected input distribution for calibration.

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.modifiers.quantization import GPTQModifier
….

def preprocess_data(
    dataset: Any,
    tokenizer: AutoTokenizer,
    max_sequence_length: int
) -> Any:
    def preprocess(example):
        return {
            “text”: tokenizer.apply_chat_template(
                example[“messages”],
                tokenize=False,
            )
        }

    def tokenize(sample: Dict) -> Dict:
        return tokenizer(
            sample[“text”],
            padding=False,
            max_length=max_sequence_length,
            truncation=True,
            add_special_tokens=False,
        )

    dataset = dataset.map(preprocess)
    dataset = dataset.map(tokenize,  remove_columns=dataset.column_names)
    return dataset

Step 3: Run PTQ on the candidate model
The oneshot method in llm-compressor performs a single-pass (no iterative retraining) PTQ using a specified recipe, applying both weight and activation quantization (and optionally sparsity) in one pass.

num_calibration_samples defines how many input sequences (for example, 512) are used to simulate model behavior, gathering the activation statistics necessary for calibrating quantization ranges.
max_seq_length sets the maximum token length (for example, 2048) for those calibration samples, so activations reflect the worst-case sequence context, ensuring quantization remains accurate across input lengths.

Together, these hyperparameters control the representativeness and coverage of calibration, directly impacting quantization fidelity.
The modifier classes (GPTQModifier, AWQModifier) accept a schema parameter that defines the bit-width for both weights and activations. Through this parameter, you can specify formats such as W8A8 (8-bit weights and activations) or W4A16 (4-bit weights with 16-bit activations), giving you fine-grained control over precision trade-offs across model layers.

        …
        …
        logger.info(f”Configuring {args.algorithm.upper()} quantization”)
        if args.algorithm == “awq”:

            quant_scheme = args.awq_quantization_scheme
            recipe = [
                AWQModifier(
                    ignore=[val.rstrip() for val in args.ignore_layers.split(‘,’)],
                    scheme=args.awq_quantization_scheme,
                    targets=[val.rstrip() for val in args.include_targets.split(‘,’)]
                )
            ]

        …
        elif args.algorithm == “gptq”:

            quant_scheme = args.gptq_quantization_scheme
            recipe = [
                GPTQModifier(
                    ignore=[val.rstrip() for val in args.ignore_layers.split(‘,’)],
                    scheme=args.gptq_quantization_scheme,
                    targets=[val.rstrip() for val in args.include_targets.split(‘,’)]
                )
            ]
       …
       …
       oneshot(
           model=model,
           dataset=processed_dataset,
           recipe=recipe,
           max_seq_length=args.max_sequence_length, # <- Set max sequence length
           num_calibration_samples=args.num_calibration_samples, # <- Set max calibration – number of iterations of stats calculation
           output_dir=save_dir,
           trust_remote_code_model=True
       )

Architecture pattern for quantization on Amazon SageMaker AI
The entire workflow, shown in the following figure, is implemented in the post_training_sagemaker_quantizer.py script and can be executed as a SageMaker training job on an instance with NVIDIA GPU support (such as ml.g5.2xlarge) for accelerated quantization.
This process doesn’t involve training or fine-tuning the model. The training job is used solely to run PTQ with GPU acceleration.


hyperparameters = {
    ‘model-id’: ‘meta-llama/Llama-3.1-8B-Instruct’,
    ‘dataset-id’: ‘HuggingFaceH4/ultrachat_200k’,
    ‘dataset-split’: ‘train_sft’,
    ‘dataset-seed’: 42,
    ‘algorithm’: ‘gptq’,
    ‘max-sequence-length’: 1024,
    ‘num-calibration-samples’: 256,
    ‘ignore-layers’: ‘lm_head’,
    ‘include-targets’: ‘Linear’,
    ‘gptq-quantization-scheme’: ‘W8A8′,
}

quantization_estimator = PyTorch(
    entry_point=’post_training_sagemaker_quantizer.py’,
    source_dir=’./scripts’,
    instance_type=’ml.g6e.2xlarge’,
    instance_count=1,
    role=role,
    framework_version=’2.4.0′,
    py_version=’py311′,
    hyperparameters=hyperparameters,
    environment={“HF_TOKEN”: “my-awesome-hf-token”}
)

After a model is quantized, it will be saved to Amazon Simple Storage Service (Amazon S3) directly as an output from the SageMaker training job. We’ll uncompress the model and host it as a SageMaker real-time endpoint using a Amazon SageMaker AI large model inference (LMI) container, powered by vLLM. To find the latest images, see AWS Deep Learning Framework Support Policy for LMI containers (see SageMaker section).

prebaked_inference_image_uri = f”763104351884.dkr.ecr.{sagemaker.Session().boto_session.region_name}.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128″

quant_model = sagemaker.Model(
    image_uri=prebaked_inference_image_uri,
    env={
        “HF_MODEL_ID”: f”{remote_upload_s3uri}/”, <- Your model S3 path
        “OPTION_MAX_MODEL_LEN”: “12000”,
        “OPTION_GPU_MEMORY_UTILIZATION”: “0.95”,
        “OPTION_ENABLE_STREAMING”: “false”,
        “OPTION_ROLLING_BATCH”: “auto”,
        “OPTION_MODEL_LOADING_TIMEOUT”: “3600”,
        “OPTION_PAGED_ATTENTION”: “false”,
        “OPTION_DTYPE”: “fp16″,
    },
    role=role,
    name=model_name,
    sagemaker_session=sagemaker.Session()
)

pretrained_predictor = quant_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=”ml.g5.2xlarge”,
    container_startup_health_check_timeout=600,
    wait=False
)
print(f”Your Endpoint: {endpoint_name} is now deployed!”)
“`

You now have a SageMaker real-time endpoint serving your quantized model and ready for inference. You can query it using the SageMaker Python SDK or litellm, depending on your integration needs.

 from litellm import completion

response = completion(
       model=f”sagemaker/{endpoint_name}”,
       messages=[{ “content”: “Hello”, “role”: “user”}, { “content”: “You are a helpful assistant that follows instructions”, “role”: “system”}],
       temperature=0.1,
       max_tokens=64
   )

Model performance
We will use an ml.g5.2xlarge instance for Llama-3.1-8B and Qwen-2.5-VL-7B models and ml.p4d.24xlarge instance for Llama-3.1-70B model and an LMI container v15 with vLLM backend as a serving framework.
The following is a code snippet from the deployment configuration:

lmi_env = {
    “SERVING_FAIL_FAST”: “true”,
    “OPTION_ASYNC_MODE”: “true”,
    “OPTION_ROLLING_BATCH”: “disable”,
    “OPTION_MAX_MODEL_LEN”: “8192”,
    “OPTION_TENSOR_PARALLEL_DEGREE”: “max”,
    “OPTION_ENTRYPOINT”: “djl_python.lmi_vllm.vllm_async_service”,
}

This performance evaluation’s primary goal is to show the relative performance of model versions on different hardware. The combinations aren’t fully optimized and shouldn’t be viewed as peak model performance on an instance type. Always make sure to test using your data, traffic, and I/O sequence length. The following is performance benchmark script:

#!/bin/bash
export LLM_PERF_CONCURRENT=1
export LLM_PERF_MAX_REQUESTS=$(expr $LLM_PERF_CONCURRENT * 10)
export LLM_PERF_SCRIPT_DIR=$HOME/5_projects/llmperf

export LLM_PERF_OUTPUT=outputs/test-2025-07-08-21-45-57-221

mkdir -p $LLM_PERF_OUTPUT
cp “$0” “${LLM_PERF_OUTPUT}”/

python3 ${LLM_PERF_SCRIPT_DIR}/token_benchmark_ray.py
    –model “sagemaker/model-2025-07-08-21-01-10-147”
    –mean-input-tokens 512
    –stddev-input-tokens 32
    –mean-output-tokens 256
    –stddev-output-tokens 16
    –max-num-completed-requests ${LLM_PERF_MAX_REQUESTS}
    –timeout 1800
    –num-concurrent-requests ${LLM_PERF_CONCURRENT}
    –results-dir “${LLM_PERF_OUTPUT}”
    –llm-api litellm
    –additional-sampling-params ‘{}’

Performance metrics
To understand the impact of PTQ optimization techniques, we focus on five key inference performance metrics—each offering a different lens on system efficiency and user experience:

GPU memory utilization: Indicates the proportion of total GPU memory actively used during inference. Higher memory utilization suggests more of the model or input data is loaded into GPU memory, which can improve throughput—but excessive usage might lead to memory bottlenecks or out-of-memory errors.
End-to-end latency: Measures the total time taken from input submission to final output. This is critical for applications where responsiveness is key, such as real-time systems or user-facing interfaces.
Time to first token (TTFT): Captures the delay between input submission and the generation of the first token. Lower TTFT is especially important for streaming or interactive workloads, where perceived responsiveness matters more than total latency.
Inter-token latency (ITL): Tracks the average time between successive token outputs. A lower ITL results in smoother, faster-seeming responses, particularly in long-form text generation.
Throughput: Measures the number of tokens generated per second across all concurrent requests. Higher throughput indicates better system efficiency and scalability, enabling faster processing of large workloads or more simultaneous user sessions.

Together, these metrics provide a holistic view of inference behavior—balancing raw efficiency with real-world usability. In the next sections of this post, we evaluate three candidate models—each varying in size and architecture—to validate inference performance metrics after quantization using AWQ and GPTQ algorithms across different WₓAᵧ strategies. The selected models include:

Llama-3.1-8B-Instruct: An 8-billion parameter dense decoder-only transformer model optimized for instruction following. Published by Meta, it belongs to the LLaMA (Large Language Model Meta AI) family and is well-suited for general-purpose natural language processing (NLP) tasks.
Llama-3.3-70B-Instruct: A 70-billion parameter model also from Meta’s LLaMA series, this larger variant offers significantly improved reasoning and factual grounding capabilities, making it ideal for high-performance enterprise use cases.
Qwen2.5-VL-7B-Instruct: A 7-billion parameter vision-language model developed by Alibaba’s Institute for Intelligent Computing. It supports both text and image inputs, combining a transformer-based text backbone with a visual encoder, making it suitable for multimodal applications.

Note that each model was tested on a different instance type: Llama-3.1-8B on ml.g5.2xlarge, Llama-3.3-70B on ml.p4dn.24xlarge, and Qwen2.5-VL-7B on ml.g6e.4xlarge.
GPU memory utilization
GPU memory utilization reflects how much device memory is consumed during model execution and directly impacts deployability, batch size, and hardware selection. Lower memory usage enables running larger models on smaller GPUs or serving more concurrent requests on the same hardware. Quantization improves compute efficiency and significantly reduces the memory footprint of LLMs. By converting high-precision weights (for example, FP16 or FP32) into lower-bit formats such as INT8 or FP8, both AWQ and GPTQ strategies enable models to consume substantially less GPU memory during inference. This is critical for deploying large models on memory-constrained hardware or increasing batch sizes for higher throughput. In the following table and chart, we list and visualize the GPU memory utilization (in GB) across the models under multiple quantization configurations. The percentage reduction is compared against the base (unquantized) model size, highlighting the memory savings achieved with each WₓAᵧ strategy, which ranges from ~30%–70% less GPU memory utilization after PTQ.

Model name
Raw (GB)
AWQ
GPTQ

W4A16_ASYM
W4A16
W4A16
W8A8
W4A16_ASYM
W8A16

(GB in memory and % decrease from raw)

Llama-3.1-8B-Instruct (SLM)
17.9
7.9 GB – 56.02%
7.8 GB – 56.13%
7.8 GB – 56.13 %
11.3 GB – 37.05%
7.9 GB – 56.02%
11.3 GB – 37.05%

Llama-3.3-70B-Instruct (LLM)
142.9
41.7 GB – 70.82%
41.4 GB – 71.03%
41.4 GB – 71.03 %
74.7 GB – 47.76%
41.7 GB – 70.82%
74.7 GB – 47.76%

Qwen2.5-VL-7B-Instruct (VLM)
18.5
9.1 GB – 50.94%
9.0 GB – 51.26%
9.0 GB – 51.26%
12.0 GB – 34.98%
9.1 GB – 50.94%
12.0 GB – 34.98%

The figure below illustrates the GPU memory footprint (in GB) of the model in its raw (unquantized) form compared to its quantized variants. Quantization results in ~30%–70% reduction in GPU memory consumption, significantly lowering the overall memory footprint.

End-to-end latency
End-to-end latency measures the total time taken from the moment a prompt is received to the delivery of the final output token. It’s a critical metric for evaluating user-perceived responsiveness and overall system performance, especially in real-time or interactive applications.
In the following table, we report end-to-end latency in seconds across varying concurrency levels (C=1 to C=128) for three models of varying size and modality (Llama-3.1-8B, Llama-3.3-70B, and Qwen2.5-VL-7B) under different quantization strategies.

Model name
C=1
C=8
C=16
C=32
C=64
C=128

Llama-3.1-8B
8.65
10.68
12.19
14.76
28.31
56.67

Llama-3.1-8B-AWQ-W4A16_ASYM
3.33
4.67
5.41
8.1
18.29
35.83

Llama-3.1-8B-AWQ-W4A16
3.34
4.67
5.37
8.02
18.05
35.32

Llama-3.1-8B-GPTQ-W4A16
3.53
4.65
5.35
8
18.07
35.35

Llama-3.1-8B-GPTQ-W4A16_ASYM
3.36
4.69
5.41
8.09
18.28
35.69

Llama-3.1-8B-GPTQ-W8A8
5.47
6.65
7.37
10.17
19.73
38.83

Llama-3.1-8B-GPTQ-W8A16
5.03
6.36
7.15
10.88
20.83
40.76

Llama-3.3-70B
4.56
5.59
6.22
7.26
13.94
27.67

Llama-3.3-70B-AWQ-W4A16_ASYM
3.95
4.13
4.44
5.44
10.79
20.85

Llama-3.3-70B-AWQ-W4A16
3.76
3.47
4.05
4.83
9.84
19.23

Llama-3.3-70B-GPTQ-W4A16
3.51
3.43
4.09
5.72
10.69
21.59

Llama-3.3-70B-GPTQ-W4A16_ASYM
3.6
4.12
4.51
5.71
11.36
21.8

Llama-3.3-70B-GPTQ-W8A8
3.85
4.31
4.88
5.61
10.95
21.29

Llama-3.3-70B-GPTQ-W8A16
4.31
4.48
4.61
5.8
11.11
21.86

Qwen2.5-VL-7B-Instruct (VLM)
5.28
5.89
6.12
7.56
8.77
13.17

Qwen2.5-VL-7B-AWQ-W4A16_ASYM
2.14
2.56
2.77
3.39
5.13
9.22

Qwen2.5-VL-7B-AWQ-W4A16
2.12
2.56
2.71
3.48
4.9
8.94

Qwen2.5-VL-7B-GPTQ-W4A16
2.13
2.54
2.75
3.59
5.11
9.66

Qwen2.5-VL-7B-GPTQ-W4A16_ASYM
2.14
2.56
2.83
3.52
5.09
9.51

Qwen2.5-VL-7B-GPTQ-W8A8
3.62
4.02
4.19
4.75
5.91
9.71

Qwen2.5-VL-7B-GPTQ-W8A16
3.38
3.85
4.04
4.7
6.12
10.93

The following graphs showing end to end latency for different concurrency levels for different models.

The figure above presents the end-to-end latency of the Llama 3-8B model in its raw (unquantized) form and its quantized variants across concurrency levels ranging from 1 to 128 on the same instance.

The figure above presents the end-to-end latency of the Qwen 2.7-7B model in its raw (unquantized) form and its quantized variants across concurrency levels ranging from 1 to 128 on the same instance.

The figure above presents the end-to-end latency of the Llama 3-70B model in its raw (unquantized) form and its quantized variants across concurrency levels ranging from 1 to 128 on the same instance.
Time to first token
TTFT measures the delay between prompt submission and the generation of the first token. This metric plays a crucial role in shaping perceived responsiveness—especially in chat-based, streaming, or interactive applications where initial feedback time is critical. In the following table, we compare TTFT in seconds for three models of varying size and modality—Llama-3.1-8B, Llama-3.3-70B, and Qwen2.5-VL-7B—under different quantization strategies. As concurrency increases (from C=1 to C=128), the results highlight how quantization techniques like AWQ and GPTQ help maintain low startup latency, ensuring a smoother and faster experience even under high load.

Model name
C=1
C=8
C=16
C=32
C=64
C=128

Llama-3.1-8B
0.27
1.44
6.51
11.37
24.96
53.38

Llama-3.1-8B-AWQ-W4A16_ASYM
0.17
0.62
3
6.21
16.17
33.74

Llama-3.1-8B-AWQ-W4A16
0.18
0.62
2.99
6.15
15.96
33.26

Llama-3.1-8B-GPTQ-W4A16
0.37
0.63
2.94
6.14
15.97
33.29

Llama-3.1-8B-GPTQ-W4A16_ASYM
0.19
0.63
3
6.21
16.16
33.6

Llama-3.1-8B-GPTQ-W8A8
0.17
0.86
4.09
7.86
17.44
36.57

Llama-3.1-8B-GPTQ-W8A16
0.21
0.9
3.97
8.42
18.44
38.39

Llama-3.3-70B
0.16
0.19
0.19
0.21
6.87
20.52

Llama-3.3-70B-AWQ-W4A16_ASYM
0.17
0.18
0.16
0.21
5.34
15.46

Llama-3.3-70B-AWQ-W4A16
0.15
0.17
0.16
0.2
4.88
14.28

Llama-3.3-70B-GPTQ-W4A16
0.15
0.17
0.15
0.2
5.28
16.01

Llama-3.3-70B-GPTQ-W4A16_ASYM
0.16
0.17
0.17
0.2
5.61
16.17

Llama-3.3-70B-GPTQ-W8A8
0.14
0.15
0.15
0.18
5.37
15.8

Llama-3.3-70B-GPTQ-W8A16
0.1
0.17
0.15
0.19
5.47
16.22

Qwen2.5-VL-7B-Instruct (VLM)
0.042
0.056
0.058
0.081
0.074
0.122

Qwen2.5-VL-7B-AWQ-W4A16_ASYM
0.03
0.046
0.038
0.042
0.053
0.08

Qwen2.5-VL-7B-AWQ-W4A16
0.037
0.046
0.037
0.043
0.052
0.08

Qwen2.5-VL-7B-GPTQ-W4A16
0.037
0.047
0.036
0.043
0.053
0.08

Qwen2.5-VL-7B-GPTQ-W4A16_ASYM
0.038
0.048
0.038
0.042
0.053
0.082

Qwen2.5-VL-7B-GPTQ-W8A8
0.035
0.041
0.042
0.046
0.055
0.081

Qwen2.5-VL-7B-GPTQ-W8A16
0.042
0.048
0.046
0.052
0.062
0.093

Inter-token latency
ITL measures the average time delay between the generation of successive tokens. It directly affects the smoothness and speed of streamed outputs—particularly important in applications involving long-form text generation or voice synthesis, where delays between words or sentences can degrade user experience. In the following table, we analyze ITL in seconds across three models of varying size and modality—Llama-3.1-8B, Llama-3.3-70B, and Qwen2.5-VL-7B—under different quantization schemes. As concurrency scales up, the results illustrate how quantization strategies like AWQ and GPTQ help maintain low per-token latency, ensuring fluid generation even under high parallel loads.

Model name
C=1
C=8
C=16
C=32
C=64
C=128

Llama-3.1-8B
0.035
0.041
0.047
0.057
0.111
0.223

Llama-3.1-8B-AWQ-W4A16_ASYM
0.013
0.018
0.021
0.031
0.072
0.141

Llama-3.1-8B-AWQ-W4A16
0.013
0.018
0.02
0.031
0.071
0.139

Llama-3.1-8B-GPTQ-W4A16
0.014
0.018
0.02
0.031
0.071
0.139

Llama-3.1-8B-GPTQ-W4A16_ASYM
0.013
0.018
0.021
0.031
0.072
0.14

Llama-3.1-8B-GPTQ-W8A8
0.02
0.026
0.028
0.039
0.077
0.153

Llama-3.1-8B-GPTQ-W8A16
0.02
0.024
0.027
0.042
0.081
0.16

Llama-3.3-70B
0.019
0.024
0.025
0.03
0.065
0.12

Llama-3.3-70B-AWQ-W4A16_ASYM
0.018
0.021
0.021
0.029
0.076
0.163

Llama-3.3-70B-AWQ-W4A16
0.017
0.021
0.022
0.029
0.081
0.201

Llama-3.3-70B-GPTQ-W4A16
0.014
0.018
0.019
0.028
0.068
0.152

Llama-3.3-70B-GPTQ-W4A16_ASYM
0.017
0.02
0.021
0.028
0.067
0.159

Llama-3.3-70B-GPTQ-W8A8
0.016
0.02
0.022
0.026
0.058
0.131

Llama-3.3-70B-GPTQ-W8A16
0.017
0.02
0.021
0.025
0.056
0.122

Qwen2.5-VL-7B-Instruct (VLM)
0.021
0.023
0.023
0.029
0.034
0.051

Qwen2.5-VL-7B-AWQ-W4A16_ASYM
0.008
0.01
0.01
0.013
0.02
0.038

Qwen2.5-VL-7B-AWQ-W4A16
0.008
0.01
0.01
0.014
0.02
0.038

Qwen2.5-VL-7B-GPTQ-W4A16
0.008
0.01
0.01
0.013
0.02
0.038

Qwen2.5-VL-7B-GPTQ-W4A16_ASYM
0.008
0.01
0.011
0.014
0.02
0.038

Qwen2.5-VL-7B-GPTQ-W8A8
0.014
0.015
0.016
0.018
0.023
0.039

Qwen2.5-VL-7B-GPTQ-W8A16
0.013
0.015
0.015
0.018
0.024
0.044

Throughput
Throughput measures the number of tokens generated per second and is a key indicator of how efficiently a model can scale under load. Higher throughput directly enables faster batch processing and supports more concurrent user sessions. In the following table, we present throughput results for Llama-3.1-8B, Llama-3.3-70B, and Qwen2.5-VL-7B across varying concurrency levels and quantization strategies. Quantized models maintain—and in many cases improve—throughput, thanks to reduced memory bandwidth and compute requirements. The substantial memory savings from quantization allows multiple model workers to be deployed on a single GPU, particularly on high-memory instances. This multi-worker setup further amplifies total system throughput at higher concurrency levels, making quantization a highly effective strategy for maximizing utilization in production environments.

Model name
C=1
C=8
C=16
C=32
C=64
C=128

Llama-3.1-8B
33.09
27.41
24.37
20.05
10.71
5.53

Llama-3.1-8B-AWQ-W4A16_ASYM
85.03
62.14
55.25
37.27
16.44
9.06

Llama-3.1-8B-AWQ-W4A16
83.21
61.86
55.31
37.69
16.59
9.19

Llama-3.1-8B-GPTQ-W4A16
80.77
62.19
55.93
37.53
16.48
9.12

Llama-3.1-8B-GPTQ-W4A16_ASYM
81.85
61.75
54.74
37.32
16.4
9.13

Llama-3.1-8B-GPTQ-W8A8
50.62
43.84
40.41
29.04
15.31
8.26

Llama-3.1-8B-GPTQ-W8A16
55.24
46.47
41.79
27.21
14.6
7.94

Llama-3.3-70B
57.93
47.89
44.73
38
20.05
10.95

Llama-3.3-70B-AWQ-W4A16_ASYM
60.24
53.54
51.79
39.3
20.47
11.52

Llama-3.3-70B-AWQ-W4A16
64
53.79
52.4
39.4
20.79
11.5

Llama-3.3-70B-GPTQ-W4A16
78.07
61.68
58.18
41.07
21.21
11.77

Llama-3.3-70B-GPTQ-W4A16_ASYM
66.34
56.47
54.3
40.64
21.37
11.76

Llama-3.3-70B-GPTQ-W8A8
66.79
55.67
51.73
44.63
23.7
12.85

Llama-3.3-70B-GPTQ-W8A16
67.11
57.11
55.06
45.26
24.18
13.08

Qwen2.5-VL-7B-Instruct (VLM)
56.75
51.44
49.61
40.08
34.21
23.03

Qwen2.5-VL-7B-AWQ-W4A16_ASYM
140.89
117.47
107.49
86.33
58.56
30.25

Qwen2.5-VL-7B-AWQ-W4A16
137.77
116.96
106.67
83.06
57.52
29.46

Qwen2.5-VL-7B-GPTQ-W4A16
138.46
117.14
107.25
85.38
58.19
30.19

Qwen2.5-VL-7B-GPTQ-W4A16_ASYM
139.38
117.32
104.22
82.19
58
29.64

Qwen2.5-VL-7B-GPTQ-W8A8
82.81
75.32
72.19
63.11
50.44
29.53

Qwen2.5-VL-7B-GPTQ-W8A16
88.69
78.88
74.55
64.83
48.92
26.55

Conclusion
Post-training quantization (PTQ) techniques like AWQ and GPTQ have proven to be effective solutions for deploying foundation models in production environments. Our comprehensive testing across different model sizes and architectures demonstrates that PTQ significantly reduces GPU memory utilization. The benefits are evident across all key metrics, with quantized models showing better throughput and reduced latency in inference time, including high-concurrency scenarios. These improvements translate to reduced infrastructure costs, improved user experience through faster response times, and the flexibility of deploying larger models on resource-constrained hardware. As language models continue to grow in scale and complexity, PTQ offers a reliable approach for balancing performance requirements with infrastructure constraints, providing a clear path to efficient, cost-effective AI deployment.
In this post, we demonstrated how to streamline LLM quantization using Amazon SageMaker AI and the llm-compressor module. The process of converting a full-precision model to its quantized variant requires just a few simple steps, making it accessible and scalable for production deployments. By using the managed infrastructure of Amazon SageMaker AI, organizations can seamlessly implement and serve quantized models for real-time inference, simplifying the journey from development to production. To explore these quantization techniques further, refer to our GitHub repository.
Special thanks to everyone who contributed to this article: Giuseppe Zappia, Dan Ferguson, Frank McQuillan and Kareem Syed-Mohammed.

About the authors
Pranav Murthy is a Senior Generative AI Data Scientist at AWS, specializing in helping organizations innovate with Generative AI, Deep Learning, and Machine Learning on Amazon SageMaker AI. Over the past 10+ years, he has developed and scaled advanced computer vision (CV) and natural language processing (NLP) models to tackle high-impact problems—from optimizing global supply chains to enabling real-time video analytics and multilingual search. When he’s not building AI solutions, Pranav enjoys playing strategic games like chess, traveling to discover new cultures, and mentoring aspiring AI practitioners. You can find Pranav on LinkedIn
Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.

How Beekeeper optimized user personalization with Amazon Bedrock

This post is cowritten by Mike Koźmiński from Beekeeper.
Large Language Models (LLMs) are evolving rapidly, making it difficult for organizations to select the best model for each specific use case, optimize prompts for quality and cost, adapt to changing model capabilities, and personalize responses for different users.
Choosing the “right” LLM and prompt isn’t a one-time decision—it shifts as models, prices, and requirements change. System prompts are becoming larger (e.g. Anthropic system prompt) and more complex. A lot of mid-sized companies don’t have resources to quickly evaluate and improve them. To address this issue, Beekeeper built an Amazon Bedrock-powered system that continuously evaluates model+prompt candidates, ranks them on a live leaderboard, and routes each request to the current best choice for that use case.
Beekeeper: Connecting and empowering the frontline workforce
Beekeeper offers a comprehensive digital workplace system specifically designed for frontline workforce operations. The company provides a mobile-first communication and productivity solution that connects non-desk workers with each other and headquarters, enabling organizations to streamline operations, boost employee engagement, and manage tasks efficiently. Their system features robust integration capabilities with existing business systems (human resources, scheduling, payroll), while targeting industries with large deskless workforces such as hospitality, manufacturing, retail, healthcare, and transportation. At its core, Beekeeper addresses the traditional disconnect between frontline employees and their organizations by providing accessible digital tools that enhance communication, operational efficiency, and workforce retention, all delivered through a cloud-based SaaS system with mobile apps, administrative dashboards, and enterprise-grade security features.
Beekeeper’s solution: A dynamic evaluation system
Beekeeper solved this challenge with an automated system that continuously tests different model and prompt combinations, ranks options based on quality, cost, and speed, incorporates user feedback to personalize responses, and automatically routes requests to the current best option. Quality is scored with a small synthetic test set and validated in production with user feedback (thumbs up/down and comments). By incorporating prompt mutation, Beekeeper created an organic system that evolves over time. The result is a constantly-optimizing setup that balances quality, latency, and cost—and adapts automatically when the landscape changes.
Real-world example: Chat Summarization
Beekeeper’s Frontline Success Platform unifies communication for deskless workers across industries. One practical application of their LLM system is chat summarization. When a user returns to shift, they might find a chat with many unread messages – instead of reading everything, they can request a summary. The system generates a concise overview with action items tailored to the user’s needs. Users can then provide feedback to improve future summaries. This seemingly simple feature relies on sophisticated technology behind the scenes. The system must understand conversation context, identify important points, recognize action items, and present information concisely—all while adapting to user preferences.

Solution overview
Beekeeper’s solution consists of two main phases: building a baseline leaderboard and personalizing with user feedback.
The system uses several AWS components, including Amazon EventBridge for scheduling, Amazon Elastic Kubernetes Service (EKS) for orchestration, AWS Lambda for evaluation functions, Amazon Relational Database Service (RDS) for data storage, and Amazon Mechanical Turk for manual validation.

The workflow begins with a synthetic rank creator that establishes baseline performance. A scheduler triggers the coordinator, which fetches test data and sends it to evaluators. These evaluators test each model/prompt pair and return results, with a portion sent for manual validation. The system mutates promising prompts to create variations, evaluates these again, and saves the best performers. When user feedback arrives, the system incorporates it through a second phase. The coordinator fetches ranked model/prompt pairs and sends them with user feedback to a mutator, which returns personalized prompts. A drift detector makes sure these personalized versions don’t stray too far from quality standards, and validated prompts are saved for specific users.
Building the baseline leaderboard
To kick-start the optimization journey, Beekeeper engineers selected various models and provided them with domain-specific human-written prompts. The tech team tested these prompts using LLM-generated examples to make sure they were error-free. A solid baseline is crucial here. This foundation helps them refine their approach when incorporating feedback from real users.
The following sections, we dive into their success metrics, which guides their refinement of prompts and helps create an optimal user experience.
Evaluation criteria for baseline
The quality of summaries generated by model/prompt pairs is measured using both quantitative and qualitative metrics, including the following:

Compression ratio – Measures summary length relative to the original text, rewarding adherence to target lengths and penalizing excessive length.
Presence of action items – Makes sure user-specific action items are clearly identified.
Lack of hallucinations – Validates factual accuracy and consistency.
Vector comparison – Assesses semantic similarity to human-generated perfect results.

In the following sections, we walk through each of the evaluation criteria and how they are implemented.
Compression ratio
The compression ratio evaluates the length of the summarized text compared to the original one and its adherence to a target length (it rewards compression ratios close to the target and penalizes texts that deviate from target length). The corresponding score, between 0 and 100, is computed programmatically with the following Python code:

def calculate_compression_score(original_text, compressed_text):
max_length = 650
target_ratio = 1 / 5
margin = 0.05
max_penalty_points = 100 # Maximum penalty if the text is too long

original_length = len(original_text)
compressed_length = len(compressed_text)

# Calculate penalty for exceeding maximum length
excess_length = max(0, original_length – max_length)
penalty = (excess_length / original_length) * max_penalty_points

# Calculate the actual compression ratio
actual_ratio = compressed_length / original_length
lower_bound = target_ratio * (1 – margin)
upper_bound = target_ratio * (1 + margin)

# Calculate the base score based on the compression ratio
if actual_ratio < lower_bound:
base_score = 100 * (actual_ratio / lower_bound)
elif actual_ratio > upper_bound:
base_score = 100 * (upper_bound / actual_ratio)
else:
base_score = 100

# Apply the penalty to the base score
score = base_score – penalty

# Ensure the score does not go below 0
score = max(0, score)

return round(score, 2)

Presence of action items related to the user
To check whether the summary contains all the action items related to the users, Beekeeper relies on the comparison to the ground truth. For the ground truth comparison, the expected output format requires a section labeled “Action items:” followed by bullet points, which uses regular expressions to extract the action item list as in the following Python code:

import re

def extract_action_items(text):
action_section = re.search(r’Action items:(.*?)(?=nn|Z)’, text, re.DOTALL)

if action_section:
action_content = action_section.group(1).strip()
action_items = re.findall(r’^s*-s*(.+)$’, action_content, re.MULTILINE)
return action_items
else:
return []

They include this additional extraction step to make sure the data is formatted in a way that the LLM can easily process. The extracted list is sent to an LLM with the request to check whether it’s correct or not. A +1 score is assigned for each action item correctly assigned, and a -1 is used in case of false positive. After that, scores are normalized to not penalize/gratify summaries with more or less action items.
Lack of hallucinations
To evaluate hallucinations, Beekeeper uses two approaches: cross-LLM evaluation and manual validation.

In the cross-LLM evaluation, a summary created by LLM A (for example, Mistral Large) is passed to the evaluator component, together with the prompt and the initial input. The evaluator submits this text to LLM B (for example, Anthropic’s Claude), asking if the facts from the summary match the raw context. An LLM of a different family is used for this evaluation. Amazon Bedrock makes this exercise particularly simple through the Converse API—users can select different LLMs by changing the model identifier string.
Another important point is the presence of manual verification on a small set of evaluations at Beekeeper, to avoid cases of double hallucination. They assign a score of 1 if no hallucination was detected and -1 if any is detected. For the whole pipeline, they use the same heuristic of 7% manual evaluation (details discussed further along in this post).
Vector comparison
As an additional evaluation method, semantic similarity is used for data with available ground truth information. The embedding models are chosen among the MTEB Leaderboard (multi-task and multi-language comparison of embedding models), considering large vector dimensionality to maximize the amount of information stored inside the vector. Beekeeper uses as its baseline Qwen3, a model providing a 4096 dimensionality and supporting 16-bit quantization for fast computation. Further embedding models are also used directly from Amazon Bedrock. After computing the embedding vectors for both the ground truth answer and the one generated by a given model/prompt pair, cosine similarity is used to compute the similarity, as shown in the following Python code:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(synthetic_summary_embed, generated_summary_embed)

Evaluation baseline
The evaluation baseline of each model/prompt pair is performed by collecting the generated output of a set of fixed, predefined queries that are manually annotated with ground truth outputs containing the “true answers” (in this case, the ideal summaries from in-house and public dataset). This set as mentioned before is created from a public dataset as well as hand crafted examples better representing a customer’s domain. The scores are evaluated automatically based on the metrics described earlier: compression, lack of hallucinations, presence of action items, and vector comparison, to build a baseline version of the leaderboard.
Manual evaluations
For additional validation, Beekeeper manually reviews a scientifically determined sample of evaluations using Amazon Mechanical Turk. This sample size is calculated using Cochran’s formula to support statistical significance.
Amazon Mechanical Turk enables businesses to harness human intelligence for tasks computers can’t perform effectively. This crowdsourcing marketplace connects users with a global, on-demand workforce to complete microtasks like data labeling, content moderation, and research validation—helping to scale operations without sacrificing quality or increasing overhead. As mentioned earlier, Beekeeper employs human feedback to verify that the automatic LLM-based rating system is working correctly. Based on their prior assumptions, they know what percentage of responses should be classified as containing hallucinations. If the number detected by human verification diverges by more than two percentage points from their estimations, they know that the automated process isn’t working properly and needs revision. Now that Beekeeper has established their baseline, they can provide the best results to their customers. By constantly updating their models, they can bring new value in an automated fashion. Whenever their engineers have ideas for new prompt optimization, they can let the pipeline evaluate it against previous ones using baseline results. Beekeeper can take it further and embed user feedback, allowing for more customizable results. However, they don’t want user feedback to fully change the behavior of their model through prompt injection in feedback. In the following section, we examine the organic part of Beekeeper’s pipeline that embeds user preferences into responses without affecting other users.
Evaluation of user feedback

Now that Beekeeper has established their baseline using ground truth set, they can start incorporating human feedback. This works according to the same principles as the previously described hallucination detection process. User feedback is pulled together with input and LLM response. They pass questions to the LLM in the following format:

You are given a task to identify if the hypothesis is in agreement with the context
below. You will only use the contents of the context and not rely on external knowledge.
Answer with yes/no.”context”: {{input}} “summary”: {{output}} “hypothesis”: {{ statement }} “agreement”:

They use this to check whether the feedback provided is still applicable after the prompt-model pair was updated. This works as a baseline for incorporating user feedback. They are now ready to start mutating the prompt. This is done to avoid feedback being applied multiple times. If model change or mutation already solved the problem, there is no need to apply it again.
The mutation process consists of reevaluating the user generated dataset after prompt mutation until the output incorporates the user feedback, then we use the baseline to understand differences and discard changes in case they undermine model work.
The four best-performing model/prompt pairs chosen in the baseline evaluation (for mutated prompts) are further processed through a prompt mutation process, to check for residual improvement of the results. This is essential in an environment where even small modifications to a prompt can lead to dramatically different results when used in conjunction with user feedback.
The initial prompt is enriched with a prompt mutation, the received user feedback, a thinking style (a specific cognitive approach like “Make it creative” or “Think in steps” that guides how the LLM approaches the mutation task), the user context, and is sent to the LLM to produce a mutated prompt. The mutated prompts are added to the list, evaluated, and the corresponding scores are incorporated into the leaderboard. Mutation prompts can also include users feedback when such is present.
Examples of generated mutations prompts include:

“Add hints which would help LLM solve this problem:”

“Modify Instructions to be simpler:”

“Repeat that instruction in another way:”

“What additional instructions would you give someone to include this feedback {feedback}
into that instructions:”

Solution example
The baseline evaluation process starts with eight pairs of prompts and associated models (Amazon Nova, Anthropic Claude 4 Sonnet, Meta Llama 3, and Mistral 8x7B). Beekeeper usually uses four base prompts and two models to start with. These prompts are used across all the models, but results are considered in pairs of prompt-models. Models are automatically updated as newer versions become available via Amazon Bedrock.
Beekeeper starts by evaluating the eight existing pairs:

Each evaluation requires generating 20 summaries per pair (8 x 20 = 160)
Each summary is checked by three static checks and two LLM checks (160 x 2 = 320)

In total, this creates 480 LLM calls. Scores are compared, creating a leaderboard, and two prompt-model pairs are selected. These two prompts are mutated using user feedback, creating 10 new prompts, which are again evaluated, creating 600 calls to the LLM (10 x 20 + 10 x 20 x 2 = 600).
This process can be run n times to perform more creative mutations; Beekeeper usually performs two cycles.

In total, this exercise performs tests on (8 + 10 + 10) x 2 model/prompt pairs. The whole process on average requires around 8,352,000 input tokens and around 1,620,000 output tokens, costing around $48.Newly selected model/prompt pairs are used in production with ratios 1st: 50%, 2nd: 30%, and 3rd: 20%.After deploying the new model/prompt pairs, Beekeeper gathers feedback from the users. This feedback is used to feed the mutator to create three new prompts. These prompts are sent for drift detection, which compares them to the baseline. In total, they create four LLM calls, costing around 4,800 input tokens and 500 output tokens.
Benefits
The key benefit of Beekeeper’s solution is its ability to rapidly evolve and adapt to user needs. With this approach, they can make initial estimations of which model/prompt pairs would be optimal candidates for each task, while controlling both cost and the quality of results. By combining the benefits of synthetic data with user feedback, the solution is suitable even for smaller engineering teams. Instead of focusing on generic prompts, Beekeeper prioritizes tailoring the prompt improvement process to meet the unique needs of each tenant. By doing so, they can refine prompts to be highly relevant and user-friendly. This approach allows users to develop their own style, which in turn enhances their experience as they provide feedback and see its impact. One of the side effects they observed is that certain groups of people prefer different styles of communication. By mapping these results to customer interactions, they aim to present a more tailored experience. This makes sure that feedback given by one user doesn’t impact another. Their preliminary results suggest 13–24% better ratings on response when aggregated per tenant. In summary, the proposed solution offers several notable benefits. It reduces manual labor by automating the LLM and prompt selection process, shortens the feedback cycle, enables the creation of user- or tenant-specific improvements, and provides the capacity to seamlessly integrate and estimate the performance of new models in the same manner as the previous ones.
Conclusion
Beekeeper’s automated leaderboard approach and human feedback loop system for dynamic LLM and prompt pair selection addresses the key challenges organizations face in navigating the rapidly evolving landscape of language models. By continuously evaluating and optimizing quality, size, speed, and cost, the solution helps customers use the best-performing model/prompt combinations for their specific use cases. Looking ahead, Beekeeper plans to further refine and expand the capabilities of this system, incorporating more advanced techniques for prompt engineering and evaluation. Additionally, the team is exploring ways to empower users to develop their own customized prompts, fostering a more personalized and engaging experience. If your organization is exploring ways to optimize LLM selection and prompt engineering, there’s no need to start from scratch. Using AWS services like Amazon Bedrock for model access, AWS Lambda for lightweight evaluation, Amazon EKS for orchestration, and Amazon Mechanical Turk for human validation, a pipeline can be built that automatically evaluates, ranks, and evolves your prompts. Instead of manually updating prompts or re-benchmarking models, focus on creating a feedback-driven system that continuously improves results for your users. Start with a small set of models and prompts, define your evaluation metrics, and let the system scale as new models and use cases emerge.

About the authors
Mike (Michał) Koźmiński is a Zürich-based Principal Engineer at Beekeeper by LumApps, where he builds the foundations that make AI a first-class part of the product. With 10+ years spanning startups and enterprises, he focuses on translating new technology into reliable systems and real customer impact.
Magdalena Gargas is a Solutions Architect passionate about technology and solving customer challenges. At AWS, she works mostly with software companies, helping them innovate in the cloud.
Luca Perrozzi is a Solutions Architect at Amazon Web Services (AWS), based in Switzerland. He focuses on innovation topics at AWS, especially in the area of Artificial Intelligence. Luca holds a PhD in particle physics and has 15 years of hands-on experience as a research scientist and software engineer.
Simone Pomata is a Principal Solutions Architect at AWS. He has worked enthusiastically in the tech industry for more than 10 years. At AWS, he helps customers succeed in building new technologies every day.

Stanford Researchers Build SleepFM Clinical: A Multimodal Sleep Founda …

A team of Stanford Medicine researchers have introduced SleepFM Clinical, a multimodal sleep foundation model that learns from clinical polysomnography and predicts long term disease risk from a single night of sleep. The research work is published in Nature Medicine and the team has released the clinical code as the open source sleepfm-clinical repository on GitHub under the MIT license.

From overnight polysomnography to a general representation

Polysomnography records brain activity, eye movements, heart signals, muscle tone, breathing effort and oxygen saturation during a full night in a sleep lab. It is the gold standard test in sleep medicine, but most clinical workflows use it only for sleep staging and sleep apnea diagnosis. The research team treat these multichannel signals as a dense physiological time series and train a foundation model to learn a shared representation across all modalities.

SleepFM is trained on about 585,000 hours of sleep recordings from about 65,000 people, drawn from multiple cohorts. The largest cohort comes from the Stanford Sleep Medicine Center, where about 35,000 adults and children had overnight studies between 1999 and 2024. That clinical cohort is linked to electronic health records, which later enables survival analysis for hundreds of disease categories.

https://www.nature.com/articles/s41591-025-04133-4

Model architecture and pretraining objective

At the modeling level, SleepFM uses a convolutional backbone to extract local features from each channel, followed by attention based aggregation across channels and a temporal transformer that operates over short segments of the night. The same core architecture already appeared in earlier work on SleepFM for sleep staging and sleep disordered breathing detection, where it showed that learning joint embeddings across brain activity, electrocardiography and respiratory signals improves downstream performance.

The pretraining objective is leave one out contrastive learning. For each short time segment, the model builds separate embeddings for each modality group, such as brain signals, heart signals and respiratory signals, and then learns to align these modality embeddings so that any subset predicts the joint representation of the remaining modalities. This approach makes the model robust to missing channels and heterogeneous recording montages, which are common in real world sleep labs.

After pretraining on unlabeled polysomnography, the backbone is frozen and small task specific heads are trained. For standard sleep tasks, a lightweight recurrent or linear head maps embeddings to sleep stages or apnea labels. For clinical risk prediction, the model aggregates the full night into a single patient level embedding, concatenates basic demographics such as age and sex, and then feeds this representation into a Cox proportional hazards layer for time to event modeling.

Benchmarks on sleep staging and apnea

Before moving to disease prediction, the research team verified that SleepFM competes with specialist models on standard sleep analysis tasks. Prior work already showed that a simple classifier on top of SleepFM embeddings outperforms end to end convolutional networks for sleep stage classification and for detection of sleep disordered breathing, with gains in macro AUROC and AUPRC on several public datasets.

In the clinical study, the same pretrained backbone is reused for sleep staging and apnea severity classification across multi center cohorts. Results reported in the research paper show that SleepFM matches or exceeds existing tools such as traditional convolutional models and other automated sleep staging systems, which validates that the representation captures core sleep physiology and not only statistical artifacts from a single dataset.

Predicting 130 diseases and mortality from one night of sleep

The core contribution of this Stanford’s research paper is disease prediction. The research team maps diagnosis codes in the Stanford electronic health records to phecodes and defines more than 1,000 candidate disease groupings. For each phecode, they compute time to first diagnosis after the sleep study and fit a Cox model on top of SleepFM embeddings.

SleepFM identifies 130 disease outcomes whose risks are predictable from a single night of polysomnography with strong discrimination. These include all cause mortality, dementia, myocardial infarction, heart failure, chronic kidney disease, stroke, atrial fibrillation, several cancers and multiple psychiatric and metabolic disorders. For many of these conditions, performance metrics such as concordance index and area under the receiver operating curve are in ranges comparable to established risk scores, even though the model uses only sleep recordings plus basic demographics.

The reporting also notes that for some cancers, pregnancy complications, circulatory conditions and mental health disorders, predictions based on SleepFM reach accuracy levels around 80 percent for multi year risk windows. This suggests that subtle patterns in the coordination between brain, heart and breathing signals carry information about latent disease processes that are not yet clinically visible.

Comparison with simpler baselines

To assess added value, the research team compared SleepFM based risk models with two baselines. The first uses only demographic features such as age, sex and body mass index. The second trains an end to end model directly on polysomnography and outcomes, without unsupervised pretraining. Across most disease categories, the pretrained SleepFM representation combined with a simple survival head yields higher concordance and higher long horizon AUROC than both baselines.

This research clearly shows that the gain comes less from a complex prediction head and more from the foundation model that has learned a general representation of sleep physiology. In practice, this means that clinical centers can reuse a single pretrained backbone, learn small site specific heads with relatively modest labeled cohorts and still approach state of the art performance.

Check out the Paper and FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export
The post Stanford Researchers Build SleepFM Clinical: A Multimodal Sleep Foundation AI Model for 130+ Disease Prediction appeared first on MarkTechPost.

Scaling medical content review at Flo Health using Amazon Bedrock (Par …

This blog post is based on work co-developed with Flo Health.
Healthcare science is rapidly advancing. Maintaining accurate and up-to-date medical content directly impacts people’s lives, health decisions, and well-being. When someone searches for health information, they are often at their most vulnerable, making accuracy not just important, but potentially life-saving.
Flo Health creates thousands of medical articles every year, providing millions of users worldwide with medically credible information on women’s health. Verifying the accuracy and relevance of this vast content library is a significant challenge. Medical knowledge evolves continuously, and manual review of each article is not only time-consuming but also prone to human error. This is why the team at Flo Health, the company behind the leading women’s health app Flo, is using generative AI to facilitate medical content accuracy at scale. Through a partnership with AWS Generative AI Innovation Center, Flo Health is developing an innovative approach, further called, “Medical Automated Content Review and Revision Optimization Solution” (MACROS) to verify and maintain the accuracy of its extensive health information library. This AI-powered solution is capable of:

Efficiently processing large volumes of medical content based on credible scientific sources.
Identifying potential inaccuracies or outdated information based on credible scientific resources.
Proposing updates based on the latest medical research and guidelines, as well as incorporating user feedback.

The system powered by Amazon Bedrock enables Flo Health to conduct medical content reviews and revision assessments at scale, ensuring up-to-date accuracy and supporting more informed healthcare decision-making. This system performs detailed content analysis, providing comprehensive insights on medical standards and guidelines adherence for Flo’s medical experts to review. It is also designed for seamless integration with Flo’s existing tech infrastructure, facilitating automatic updates where appropriate.
This two-part series explores Flo Health’s journey with generative AI for medical content verification. Part 1 examines our proof of concept (PoC), including the initial solution, capabilities, and early results. Part 2 covers focusing on scaling challenges and real-world implementation. Each article stands alone while collectively showing how AI transforms medical content management at scale.
Proof of Concept goals and success criteria
Before diving into the technical solution, we established clear objectives for our PoC medical content review system:
Key Objectives:

Validate the feasibility of using generative AI for medical content verification
Determine accuracy levels compared to manual review
Assess processing time and cost improvements

Success Metrics:

Accuracy: Content piece recall of 90%
Efficiency: Reduce detection time from hours to minutes per guideline
Cost Reduction: Reduce expert review workload
Quality: Maintain Flo’s editorial standards and medical accuracy
Speed: 10x faster than manual review process

To verify the solution meets Flo Health’s high standards for medical content, Flo Health’s medical experts and content teams were working closely with AWS technical specialists through regular review sessions, providing critical feedback and medical expertise to continuously enhance the AI model’s performance and accuracy. The result is MACROS, our custom-built solution for AI-assisted medical content verification.
Solution overview
In this section, we outline how the MACROS solution uses Amazon Bedrock and other AWS services to automate medical content review and revisions.

Figure 1. Medical Automated Content Review and Revision Optimization Solution Overview
As shown in Figure 1, the developed solution supports two major processes:

Content Review and Revision: Enables the medical standards and style adherence of existing medical articles at scale given the pre-specified custom rules and guidelines and proposes a revision that conforms to the new medical standards as well as Flo’s style and tone guidelines.
Rule Optimization: MACROS accelerates the process of extracting the new (medical) guidelines from the (medical) research, pre-processing them into the format needed for content review, as well as optimizing their quality.

Both steps can be conducted through the user interface (UI) as well as the direct API call. The UI support enables medical experts to directly see the content review statistics, interact with changes, and do manual adjustments. The API call support is intended for the integration into pipeline for periodic assessment.
Architecture
Figure 2 depicts the architecture of MACROS. It consists of two major parts: backend and frontend.

Figure 2. MACROS architecture

In the following, the flow of major app components is presented:
1. Users begin by gathering and preparing content that must meet medical standards and rules.
2. In the second step, the data is provided as PDF, TXT files or text through the Streamlit UI that is hosted in Amazon Elastic Container Service (ECS). The authentication for file upload happens through Amazon API Gateway
3. Alternatively, custom Flo Health JSON files can be directly uploaded to the Amazon Simple Storage Service (S3) bucket of the solution stack.
4. The ECS hosted frontend has AWS IAM permissions to orchestrate tasks using AWS Step Functions.
5. Further, the ECS container has access to the S3 for listing, downloading and uploading files either via pre-signed URL or boto3.
6. Optionally, if the input file is uploaded via the UI, the solution invokes AWS Step Functions service that starts the pre-processing functionality within hosted by an AWS Lambda function. This Lambda has access to Amazon Textract for extracting text from PDF files. The files are stored in S3 and also returned to the UI.
7-9. Hosted on AWS Lambda, Rule Optimizer, Content Review and Revision functions are orchestrated via AWS Step Function. They have access to Amazon Bedrock for generative AI capabilities to perform rule extraction from unstructured data, content review and revision, respectively. Furthermore, they have access to S3 via boto3 SDK to store the results.
10. The Compute Stats AWS Lambda function has access to S3 and can read and combine the results of individual revision and review runs.
11. The solution leverages Amazon CloudWatch for system monitoring and log management. For production deployments dealing with critical medical content, the monitoring capabilities could be extended with custom metrics and alarms to provide more granular insights into system performance and content processing patterns.
Future enhancements
While our current architecture utilizes AWS Step Functions for workflow orchestration, we’re exploring the potential of Amazon Bedrock Flows for future iterations. Bedrock Flows offers promising capabilities for streamlining AI-driven workflows, potentially simplifying our architecture and enhancing integration with other Bedrock services. This alternative could provide more seamless management of our AI processes, especially as we scale and evolve our solution.
Content review and revision
At the core of MACROS lies its Content Review and Revision functionality with Amazon Bedrock foundation models. The Content Review and Revision block consists of five major components: 1) The optional Filtering stage 2) Chunking 3) Review 4) Revision and 5) Post-processing, depicted in Figure 3.

Figure 3. Content review and revision pipeline

Here’s how MACROS processes the uploaded medical content:

Filtering (Optional): The journey begins with an optional filtering step. This smart feature checks whether the set of rules is relevant for the article, potentially saving time and resources on unnecessary processing.
Chunking: The source text is then split into paragraphs. This crucial step facilitates good quality assessment and helps prevent unintended revisions to unrelated text. Chunking can be conducted using heuristics, such as punctuation or regular expression-based splits, as well as using large language models (LLM) to identify semantically complete chunks of text.
Review: Each paragraph or section undergoes a thorough review against the relevant rules and guidelines.
Revision: Only the paragraphs flagged as non-adherent move forward to the revision stage, streamlining the process and maintaining the integrity of adherent content. The AI suggests updates to bring non-adherent paragraphs in line with the latest guidelines and Flo’s style requirements.
Post-processing: Finally, the revised paragraphs are seamlessly integrated back into the original text, resulting in an updated, adherent document.

The Filtering step can be conducted using an additional LLM via Amazon Bedrock call that assesses each section separately with the following prompt structure:

Figure 4. Simplified LLM-based filtering step

Further, non-LLM approaches can be feasible to support the Filtering step:

Encoding the rules and the articles into dense embedding vectors and calculating similarity between them. By setting the similarity threshold we can identify which rule set is considered to be relevant for the input document.
Similarly, the direct keyword-level overlap between the document and the rule can be identified using BLEU or ROUGE metrics.

Content review, as already mentioned, is conducted on a text section basis against group of rules and leads to response in XML format, such as:

<xml>
<section_text> Section text without any changes </section_text>
<adherence> 0 <adherence>
<rule_name> Text of the non-adherent rule </rule_name>
<reason> Reason why the section is non-adherent to the rule </reason>
<rule_name> Text of the non-adherent rule </rule_name>
<reason> Reason why the section is non-adherent to the rule </reason>
<section_text> Section text without any changes </section_text>
<adherence> 1 <adherence>
<section_text> Section text without any changes </section_text>
<adherence> 1 <adherence>
</xml>

Here, 1 indicates adherence and 0 – non-adherence of the text to the specified rules. Using XML format helps to achieve reliable parsing of the output.
This Review step iterates over the sections in the text to make sure that the LLM pays attention to each section separately, which led to more robust results in our experimentation. To facilitate higher non-adherent section detection accuracy, the user can also use the Multi-call mode, where instead of one Amazon Bedrock call assessing adherence of the article against all rules, we have one independent call per rule.
The Revision step receives the output of the Review (non-adherent sections and the reasons for non-adherence), as well as the instruction to create the revision in a similar tone. It then suggests revisions of the non-adherent sentences in a style similar to the original text. Finally, the Post-processing step combines the original text with new revisions, making sure that no other sections are changed.
Different steps of the flow require different levels of LLM model complexity. While simpler tasks like chunking can be done efficiently with a relatively small model like Claude Haiku models family, more complex reasoning tasks like content review and revision require larger models like Claude Sonnet or Opus models family to facilitate accurate analysis and high-quality content generation. This tiered approach to model selection optimizes both performance and cost-efficiency of the solution.
Operating modes
The Content Review and Revision feature operates in two UI modes: Detailed Document Processing and Multi Document Processing, each catering to different scales of content management. The Detailed Document Processing mode offers a granular approach to content assessment and is depicted in Figure 5. Users can upload documents in various formats (PDF, TXT, JSON or paste text directly) and specify the guidelines against which the content should be evaluated.

Figure 5. Detailed Document Processing example
Users can choose from predefined rule sets, here, Vitamin D, Breast Health, and Premenstrual Syndrome and Dysphoric Disorder (PMS and PMDD), or input custom guidelines. These custom guidelines can include rules such as “The title of the article must be medically accurate” as well as adherent and non-adherent to the rule examples of content.
The rulesets make sure that the assessment aligns with specific medical standards and Flo’s unique style guide. The interface allows for on-the-fly adjustments, making it ideal for thorough, individual document reviews. For larger-scale operations, the Multi Document Processing mode should be used. This mode is designed to handle numerous custom JSON files simultaneously, mimicking how Flo would integrate MACROS into their content management system.
Extracting rules and guidelines from unstructured data
Actionable and well-prepared guidelines are not always immediately available. Sometimes they are given in unstructured files or need to be found. Using the Rule Optimizer feature, we can extract and refine actionable guidelines from multiple complex documents.
Rule Optimizer processes raw PDF documents to extract text, which is then chunked into meaningful sections based on document headers. This segmented content is processed through Amazon Bedrock using specialized system prompts, with two distinct modes: Style/tonality and Medical mode.
Style/tonality mode focuses on extracting the guidelines on how the text should be written, its style, what formats and words can or cannot be used.
Rule Optimizer assigns a priority for each rule: high, medium, and low. The priority level indicates the rule’s importance, guiding the order of content review and focusing attention on critical areas first. Rule Optimizer includes a manual editing interface where users can refine rule text, adjust classifications, and manage priorities. Therefore, if users need to update a given rule, the changes are stored for future use in Amazon S3.
The Medical mode is designed to process medical documents and is adapted to a more scientific language. It allows grouping of extracted rules into three classes:

Medical condition guidelines
Treatment specific guidelines
Changes to advice and trends in health

Figure 6. Simplified medical rule optimization prompt

Figure 6 provides an example of a medical rule optimization prompt, consisting of three main components: role setting – medical AI expert, description of what makes a good rule, and finally the expected output. We identify the sufficiently good quality for a rule if it is:

Clear, unambiguous, and actionable
Relevant, consistent, and concise (max two sentences)
Written in active voice
Avoids unnecessary jargon

Implementation considerations and challenges
During our PoC development, we identified several crucial considerations that would benefit others implementing similar solutions:

Data preparation: This emerged as a fundamental challenge. We learned the importance of standardizing input formats for both medical content and guidelines while maintaining consistent document structures. Creating diverse test sets across different medical topics proved essential for comprehensive validation.
Cost management: Monitoring and optimizing cost quickly became a key priority. We implemented token usage tracking and optimized prompt design and batch processing to balance performance and efficiency.
Regulatory and ethical compliance: Given the sensitive nature of medical content, strict regulatory and ethical safeguards were critical. We established robust documentation practices for AI decisions, implemented strict version control for medical guidelines and continuous human medical expert oversight for the AI-generated suggestions. Regional healthcare regulations were carefully considered throughout implementation.
Integration and scaling: We recommend starting with a standalone testing environment while planning for future content management system (CMS) integration through well-designed API endpoints. Building with modularity in mind proved valuable for future enhancements. Throughout the process, we faced common challenges such as maintaining context in long medical articles, balancing processing speed with accuracy, and facilitating consistent tone across AI-suggested revisions.
Model optimization: The diverse model selection capability of Amazon Bedrock proved particularly valuable. Through its platform, we can choose optimal models for specific tasks, achieve cost efficiency without sacrificing accuracy, and smoothly upgrade to newer models – all while maintaining our existing architecture.

Preliminary Results
Our Proof of Concept delivered strong results across the critical success metrics, demonstrating the potential of AI-assisted medical content review. The solution exceeded target processing speed improvements while maintaining 80% accuracy and over 90% recall in identifying content requiring updates. Most notably, the AI-powered system applied medical guidelines more consistently than manual reviews and significantly reduced the time burden on medical experts.
Key Takeaways
During implementation, we uncovered several insights critical for optimizing AI performance in medical content analysis. Content chunking was essential for accurate assessment across long documents, and expert validation of parsing rules helped medical experts to maintain clinical precision.Most importantly, the project confirmed that human-AI collaboration – not full automation – is key to successful implementation. Regular expert feedback and clear performance metrics guided system refinements and incremental improvements. While the system significantly streamlines the review process, it works best as an augmentation tool, with medical experts remaining essential for final validation, creating a more efficient hybrid approach to medical content management.
Conclusion and next steps
This first part of our series demonstrates how generative AI can make the medical content review process faster, more efficient, and scalable while maintaining high accuracy. Stay tuned for Part 2 of this series, where we cover the production journey, deep diving into challenges and scaling strategies.Are you ready to move your AI initiatives into production?

Learn more about the AWS Generative AI Innovation Center and contact your AWS Account Manager to be connected to our expert guidance and support.
Visit the Amazon Bedrock documentation to learn more about available foundation models and their capabilities
Join our AWS Builder community to connect with others on a similar AI journey.

About the authors
Liza (Elizaveta) Zinovyeva, Ph.D., is an Applied Scientist at AWS Generative AI Innovation Center and is based in Berlin. She helps customers across different industries to integrate Generative AI into their existing applications and workflows. She is passionate about AI/ML, finance and software security topics. In her spare time, she enjoys spending time with her family, sports, learning new technologies, and table quizzes.
Callum Macpherson is a Data Scientist at the AWS Generative AI Innovation Center, where cutting-edge AI meets real-world business transformation. Callum partners directly with AWS customers to design, build, and scale generative AI solutions that unlock new opportunities, accelerate innovation, and deliver measurable impact across industries.
Arefeh Ghahvechi is a Senior AI Strategist at the AWS GenAI Innovation Center, specializing in helping customers realize rapid value from generative AI technologies by bridging innovation and implementation. She identifies high-impact AI opportunities while building the organizational capabilities needed for scaled adoption across enterprises and national initiatives.
Nuno Castro is a Sr. Applied Science Manager. He’s has 19 years experience in the field in industries such as finance, manufacturing, and travel, leading ML teams for 11 years.
Dmitrii Ryzhov is a Senior Account Manager at Amazon Web Services (AWS), helping digital-native companies unlock business potential through AI, generative AI, and cloud technologies. He works closely with customers to identify high-impact business initiatives and accelerate execution by orchestrating strategic AWS support, including access to the right expertise, resources, and innovation programs.
Nikita Kozodoi, PhD, is a Senior Applied Scientist at the AWS Generative AI Innovation Center working on the frontier of AI research and business. Nikita builds and deploys generative AI and ML solutions that solve real-world problems and drive business impact for AWS customers across industries.
Aiham Taleb, PhD, is a Senior Applied Scientist at the Generative AI Innovation Center, working directly with AWS enterprise customers to leverage Gen AI across several high-impact use cases. Aiham has a PhD in unsupervised representation learning, and has industry experience that spans across various machine learning applications, including computer vision, natural language processing, and medical imaging.

Detect and redact personally identifiable information using Amazon Bed …

Organizations handle vast amounts of sensitive customer information through various communication channels. Protecting Personally Identifiable Information (PII), such as social security numbers (SSNs), driver’s license numbers, and phone numbers has become increasingly critical for maintaining compliance with data privacy regulations and building customer trust. However, manually reviewing and redacting PII is time-consuming, error-prone, and scales poorly as data volumes grow.
Organizations face challenges when dealing with PII scattered across different content types – from texts to images. Traditional approaches often require separate tools and workflows for handling text and image content, leading to inconsistent redaction practices and potential security gaps. This fragmented approach not only increases operational overhead but also raises the risk of accidental PII exposure.
This post shows an automated PII detection and redaction solution using Amazon Bedrock Data Automation and Amazon Bedrock Guardrails through a use case of processing text and image content in high volumes of incoming emails and attachments. The solution features a complete email processing workflow with a React-based user interface for authorized personnel to more securely manage and review redacted email communications and attachments. We walk through the step-by-step solution implementation procedures used to deploy this solution. Finally, we discuss the solution benefits, including operational efficiency, scalability, security and compliance, and adaptability.
Solution overview
The solution provides an automated system for protecting sensitive information in business communications through three main capabilities:

Automated PII detection and redaction for both email content and attachments using Amazon Bedrock Data Automation and Guardrails, making sure that sensitive data is consistently protected across different content types.
More secure data management workflows where processed communications are encrypted and stored with appropriate access controls, while maintaining a complete audit trail of operations.
Web-based interface options for authorized agents to efficiently manage redacted communications, supported by features like automated email categorization and customizable folder management.

This unified approach helps organizations maintain compliance with data privacy requirements while streamlining their communication workflows.
The following diagram outlines the solution architecture. 
The diagram illustrates the backend PII detection and redaction workflow and the frontend application user interface orchestrated by AWS Lambda and Amazon EventBridge. The process follows these steps:

The workflow starts with the user sending an email to the incoming email server hosted on Amazon Simple Email Service (Amazon SES). This is an optional step.
Alternatively, users can upload the emails and attachments directly into an Amazon Simple Storage Service (S3) landing bucket.
An S3 event notification triggers the initial processing AWS Lambda function that generates a unique case ID and creates a tracking record in Amazon DynamoDB.
Lambda orchestrates the PII detection and redaction workflow by extracting email body and attachments from the email and saving it in a raw email bucket followed by invoking Amazon Bedrock Data Automation and Guardrails for detecting and redacting PII.
Amazon Bedrock Data Automation processes attachments to extract text from the files.
Amazon Bedrock Guardrails detects and redacts the PII from both email body and text from attachments and stores the redacted content in another S3 bucket.
DynamoDB tables are updated with email messages, folders metadata, and email filtering rules.
An Amazon EventBridge Scheduler is used to run the Rules Engine Lambda on a schedule which processes new emails that have yet to be categorized into folders based on enabled email filtering rules criteria.
The Rules Engine Lambda also communicates with DynamoDB to access the messages table and the rules table.
Users can access the optional application user interface through Amazon API Gateway, which manages user API requests and routes requests to render the user interface through S3 static hosting. Users may choose to enable authentication for the user interface based on their security requirements. Alternatively, users can check the status of their email processing in the DynamoDB table and S3 bucket with PII redacted content.
A Portal API Lambda fetches the case details based on user requests.
The static assets served by API Gateway are stored in a private S3 bucket.
Optionally, users may enable Amazon CloudWatch and AWS CloudTrail to provide visibility into the PII detection and redaction process, while using Amazon Simple Notification Service to deliver real-time alerts for any failures, facilitating immediate attention to issues.

In the following sections, we walk through the procedures for implementing this solution.
Walkthrough
The solution implementation involves infrastructure and optional portal setup.
Prerequisites
Before beginning the implementation, make sure to have the following components installed and configured.

An AWS account
Git
Python 3.7 or higher
Node v18 or higher
NPM v9.8 or higher
AWS CDK v2.166 or higher
Terminal/CLI such as macOS Terminal, PowerShell or Windows Terminal, or the Linux command line. AWS CloudShell can also be used when all code is located within an AWS account

Infrastructure setup and deployment process
Verify that an existing virtual private cloud VPC that contains three private subnets with no internet access is created in your AWS account. All AWS CloudFormation stacks need to be deployed within the same AWS account.
CloudFormation stacks
The solution contains three stacks (two required, one optional) that deploys in your AWS account:

S3Stack – Provisions the core infrastructure including S3 buckets for raw and redacted email storage with automatic lifecycle policies, a DynamoDB table for email metadata tracking with time-to-live (TTL) and global secondary indexes, and VPC security groups for more secure Lambda function access. It also creates Amazon Identity and Access Management (IAM) roles with comprehensive permissions for S3, DynamoDB, and Bedrock services, forming a more secure foundation for the entire PII detection and redaction workflow.
ConsumerStack – Provisions the core processing infrastructure including Amazon Bedrock Data Automation projects for document text extraction and Bedrock Guardrails configured to anonymize comprehensive PII entities, along with Lambda functions for email and attachment processing with Amazon Simple Notification Service (SNS) topics for success/failure notifications. It also creates Amazon Simple Email Service (SES) receipt rules for incoming email handling when a domain is configured and S3 event notifications to trigger the email processing workflow automatically.
PortalStack (optional) – This is only needed when users want to use a web-based user interface for managing emails. It provisions the optional web interface including a regional API Gateway, DynamoDB tables for redacted message storage, and S3 buckets for static web assets.

Amazon SES (optional)
Move directly to the Solution Deployment section that follows if Amazon SES is not being used.
The following Amazon SES Setup is optional. The code may be tested without this setup as well. Steps to test the application with or without Amazon SES is covered in the Testing section.
Set up Amazon SES with prod access and verify the domain/email identities for which the solution is to work. We also need to add the MX records in the DNS provider maintaining the domain. Please refer to the following links:

Request SES Production Access
Setting up Amazon SES email receiving

Create credentials for SMTP and save it in AWS Secrets Manager secret with name SmtpCredentials. An IAM user is created for this process.
If any other name is being used for the secret, update the context.json line secret_name with the name of the secret created.
The key for the username in the secret should be smtp_username and the key for password should be smtp_password when storing the same in AWS Secrets Manager.

Obtaining Amazon SES SMTP credentials

Solution deployment
Run the following commands from within a terminal/CLI environment.

Clone the repository

git clone https://github.com/aws-samples/sample-bda-redaction.git

The infra/cdk.json file tells the CDK Toolkit how to execute your app

cd sample-bda-redaction/infra/

Optional: Create and activate a new Python virtual environment (make sure to use python 3.12 as lambda is in CDK is configured for same. If using some other python version update CDK code to reflect the same in lambda runtime)

python3 -m venv .venv
. .venv/bin/activate

Upgrade pip

pip install –upgrade pip

Install Python packages

pip install -r requirements.txt

Create context.json file

cp context.json.example context.json

Update the context.json file with the correct configuration options for the environment.

Property Name
Default
Description
When to Create

vpc_id
“”
VPC ID where resources are deployed
VPC needs to be created prior to execution

raw_bucket
“”
S3 bucket storing raw messages and attachments
Created during CDK deployment

redacted_bucket_name
“”
S3 bucket storing redacted messages and attachments
Created during CDK deployment

inventory_table_name
“”
DynamoDB table name storing redacted message details
Created during CDK deployment

resource_name_prefix
“”
Prefix used when naming resources during the stack creation
During stack creation

retention
90
Number of days for retention of the messages in the redacted and raw S3 buckets
During stack creation

The following properties are only required when the portal is being provisioned.

Property Name
Default
Description

environment
development
The type of environment where resources are provisioned. Values are development or production

Use cases that require the usage of Amazon SES to manage redacted email messages need to set the following configuration variables. Otherwise, these are optional.

Property Name
Description
Comment

domain
The verified domain or email name that is used for Amazon SES
This can be left blank if not setting up Amazon SES

auto_reply_from_email
Email address of the “from” field of the email message. Also used as the email address where emails are forwarded from the Portal application
This can be left blank if not setting up the Portal

secret_name
AWS Secrets Manager secret containing SMTP credentials for forward email functionality from the portal

Deploy Infrastructure by running the following commands from the root of the infra directory.

Bootstrap the AWS account to use AWS CDK

cdk bootstrap

Users can now synthesize the CloudFormation template for this code. Additional environment variables before the cdk synth suppresses the warnings. The deployment process should take approximately 10 min for a first-time deployment to complete.

JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk synth –no-notices

Replace <<resource_name_prefix>> with its chosen value and then run:

JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk deploy <<resource_name_prefix>>-S3Stack <<resource_name_prefix>>-ConsumerStack –no-notices

Testing

Testing the application with Amazon SES

Before starting the test, make sure the Amazon SES Email Receiving rule set that was created by the <<resource_name_prefix>>-ConsumerStack stack is active. We can check by executing the below command and make sure name in the output is <<resource_name_prefix>>-rule-setaws ses describe-active-receipt-rule-set. If the name does not match or the output is blank, execute the following to activate the same:

# Replace <<resource_name_prefix>> with resource_name_prefix used in context.json

aws ses set-active-receipt-rule-set –rule-set-name <<resource_name_prefix>>-rule-set

Once we have the correct rule set active, we can test the application using Amazon SES by sending an email to the verified email or domain in Amazon SES, which automatically triggers the redaction pipeline. Progress can be tracked in the DynamoDB table <<inventory_table_name>>. The inventory table name can be found on the resources tab in the AWS CloudFormation Console for the <<resource_name_prefix>>-S3Stack stack and Logical ID EmailInventoryTable. A unique <<case_id>> is generated and used in the DynamoDB inventory table for each email being processed. Once redaction is complete, the redacted email body can be found in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/email_body/ and redacted attachments in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/attachments/.

Testing the application without Amazon SES

As described earlier, the solution is used to redact any PII data in the email body and attachments. Therefore, to test the application, we need to provide an email file which needs to be redacted. We can do that without Amazon SES by directly uploading an email file to the raw S3 bucket. The raw bucket name can be found on the output tab in the AWS CloudFormation Console for <<resource_name_prefix>>-S3Stack stack and Export Name RawBucket. This triggers the workflow of redacting the email body and attachments by S3 event notification triggering the Lambda. For your convenience, a sample email is available in the infra/pii_redaction/sample_email directory of the repository. Below are the steps to test the application without Amazon SES using the same email file.

# Replace <<raw_bucket>> with raw bucket name created during deployment

aws s3 cp pii_redaction/sample_email/ccvod0ot9mu6s67t0ce81f8m2fp5d2722a7hq8o1 s3://<<raw_bucket>>/domain_emails/

The above triggers the redaction of the email process. You can track the progress in the DynamoDB table <<inventory_table_name>>. A unique <<case_id>> is generated and used in the DynamoDB inventory table for each email being processed. The inventory table name can be found on the resources tab in the AWS CloudFormation Console for <<resource_name_prefix>>-S3Stack stack and Logical ID EmailInventoryTable. Once redaction is complete, the redacted email body can be found in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/email_body/ and redacted attachments in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/attachments/.
Portal setup
The installation of the portal is completely optional. This section can be skipped; check the console of the AWS account where the solution is deployed to view the resources created. The portal serves as a web interface to manage the PII-redacted emails processed by the backend AWS infrastructure, allowing users to view sanitized email content. The Portal can be used to:

List messages: View processed emails with redacted content
Message details: View individual email content and attachments

Portal Prerequisites: This portal requires the installation of the following software tools:

TypeScript
Node v18 or higher
NPM v9.8 or higher

Infrastructure Deployment

Synthesize the CloudFormation template for this code by going to the directory root of the solution. Now run the following command:

cd sample-bda-redaction/infra/

Optional: Create and activate a new Python virtual environment (if the virtual environment has not been created previously):

python3 -m venv .venv. .venv/bin/activatepip install -r requirements.txt

Users can now synthesize the CloudFormation template for this code.

JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk synth –no-notices

Deploy the React-based portal. Replace <<resource_name_prefix>> with its chosen value:

JSII_DEPRECATED=quiet
JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet
cdk deploy <<resource_name_prefix>>-PortalStack –no-notices

The first-time deployment should take approximately 10 minutes to complete.
Environment Variables

Create a new environment file by going to the root of the app directory and update the following variables in the .env file (by copying the .env.example file to .env) using the following command to create the .env file using a terminal/CLI environment.

cp .env.example .env

The file can be created using your preferred text editor as well.

Environment Variable Name
Default
Description
Required

VITE_APIGW
“”
URL of the API Gateway invokes URL (including protocol) without the path (remove /portal from the value). This value can be found in the output of the PortalStack after deploying through AWS CDK. It can also be found under the Outputs tab of the PortalStack CloudFormation stack under the export name of PiiPortalApiGatewayInvokeUrl
Yes

VITE_BASE
/portal
It specifies the path used to request the static files needed to render the portal
Yes

VITE_API_PATH
/api
It specifies the path needed to send requests to the API Gateway
Yes

Portal deployment
Run the following commands from within a terminal/CLI environment.

Before running any of the following commands, go to the root of the app directory to build this application for production by running the following commands:

Install NPM packages

npm install

Build the files

npm run build

After the build succeeds, transfer all of the files within the dist/ directory into the Amazon S3 bucket that is designated for these assets (specified in the PortalStack provisioned via CDK).

Example: aws s3 sync dist/ s3://<<name-of-s3-bucket>> –delete<<name-of-s3-bucket>> is the S3 bucket that has been created in the <<resource-name-prefix>>-PortalStack CloudFormation stack with the Logical ID of PrivateWebHostingAssets. This value can be obtained from the Resources tab of the CloudFormation stack in the AWS Console. This value is also output during the cdk deploy process when the PortalStack has been successfully completed.

Accessing the portal
Use the API Gateway invoke URL from the API Gateway that has been created during the cdk deploy process to access the portal from a web browser. This URL can be found by following these steps:

Visit the AWS Console
Go to API Gateway and find the API Gateway that has been created during the cdk deploy process. The name of the API Gateway can be found in the Resources section of the <<resource-name-prefix>>-PortalStack CloudFormation stack.
Click on the Stages link in the left-hand menu.
Make sure that the portal stage is selected
Find the Invoke URL and copy that value
Enter that value in the address bar of your web browser

The portal’s user interface is now visible within the web browser. If any emails have been processed, they are listed on the home page of the portal.
Access control (optional)
For production deployment, we recommend these approaches to controlling and managing access to the Portal.
Clean up
To avoid incurring future charges, follow these steps to remove the resources created by this solution:

Delete the contents of the S3 buckets created by the solution:

Raw email bucket
Redacted email bucket
Portal static assets bucket (if portal was deployed)

Delete or disable the Amazon SES rule step created by the solution using below cli command:

#to disable the rule set use below command
aws ses set-active-receipt-rule-set

#to delete the rule set use below command
# Replace <<resource_name_prefix>> with resource_name_prefix used in context.json
aws ses delete-receipt-rule-set –rule-set-name <resource_name_prefix>>-rule-set

Remove the CloudFormation stacks in the following order:

cdk destroy <<resource_name_prefix>>-PortalStack (if deployed)
cdk destroy <<resource_name_prefix>>-ConsumerStack
cdk destroy <<resource_name_prefix>>-S3Stack

CDK Destroy does not remove the access log Amazon S3 bucket created as part of the deployment. Users can get access to the log bucket name in the output tab of stack <<resource_name_prefix>>-S3Stack with export name AccessLogsBucket. Execute the below steps to delete the access log bucket:

To delete the contents of the access log bucket, follow the instructions on deleting S3 bucket
Access to the log bucket is version-enabled and deleting the content of the bucket in the above step does not delete versioned objects in the bucket. That needs to be removed separately using below aws cli commands:

#to remove versioned objects use below aws cli command
aws s3api delete-objects –bucket ${accesslogbucket} –delete “$(aws s3api list-object-versions –bucket ${accesslogbucket} –query='{Objects: Versions[].{Key:Key,VersionId:VersionId}}’)”

#once versioned objects are removed we need to remove the delete markers of the versioned objects using below aws cli command
aws s3api delete-objects –bucket ${accesslogbucket} –delete “$(aws s3api list-object-versions –bucket ${accesslogbucket} –query='{Objects: DeleteMarkers[].{Key:Key,VersionId:VersionId}}’)”

Delete the access log Amazon S3 bucket using below aws cli command:

#delete the access log bucket itself using below aws cli command
aws s3api delete-bucket –bucket ${accesslogbucket}

If Amazon SES is configured:

Remove the verified domain/email identities
Delete the MX records from your DNS provider
Delete the SMTP credentials from AWS Secrets Manager

Delete any CloudWatch Log groups created by the Lambda functions

The VPC and its associated resources as prerequisites for this solution may not be deleted if they may be used by other applications.
Conclusion
In this post, we demonstrated how to automate the detection and redaction of PII across both text and image content using Amazon Bedrock Data Automation and Amazon Bedrock Guardrails. By centralizing and streamlining the redaction process, organizations can strengthen alignment with data privacy requirements, enhance security practices, and minimize operational overhead.
However, it is equally important to make sure that your solution is built with Amazon Bedrock Data Automation’s document processing constraints in mind. Amazon Bedrock Data Automation supports PDF, JPEG, and PNG file formats with a maximum console-processing size of 200 MB (500 MB via API), and single documents may not exceed 20 pages unless document splitting is enabled.
By using Amazon Bedrock Data Automation and Amazon Bedrock Guardrails centralized redaction capabilities, organizations can boost data privacy compliance management, cut operational overhead, and maintain stringent security across diverse workloads. This solution’s extensibility further enables integration with other AWS services, fine-tuning detection logic for more advanced PII patterns, and broadening support for additional file types or languages in the future, thereby evolving into a more robust, enterprise-scale data protection framework.
We encourage exploration of the provided GitHub repository to deploy this solution within your organization. In addition to delivering operational efficiency, scalability, security, and adaptability, the solution also provides a unified interface and robust audit trail that simplifies data governance. By refining detection rules, users can integrate additional file formats where possible and use Amazon Bedrock Data Automation and Amazon Bedrock Guardrails modular framework.
We invite you to implement this PII detection and redaction solution in the following GitHub repo to build a more secure, compliance-aligned, and highly adaptable data protection solution on Amazon Bedrock that addresses evolving business and regulatory requirements.

About the Authors
Himanshu Dixit is a Delivery Consultant at AWS Professional Services specializing in databases and analytics, bringing over 18 years of experience in technology. He is passionate for artificial intelligence, machine learning, and generative AI, leveraging these cutting-edge technologies to create innovative solutions that address real-world challenges faced by customers. Outside of work, he enjoys playing badminton, tennis, cricket, table tennis and spending time with her two daughters.
David Zhang is an Engagement Manager at AWS Professional Services, where he leads enterprise-scale AI/ML, cloud transformation initiatives for Fortune 100 customers in telecom, finance, media, and entertainment. Outside of work, he enjoys experimenting with new recipes in his kitchen, playing tenor saxophone, and capturing life’s moments through his camera.
Richard Session is a Lead User Interface Developer for AWS ProServe, bringing over 15 years of experience as a full-stack developer across marketing/advertising, enterprise technology, automotive, and ecommerce industries. With a passion for creating intuitive and engaging user experiences, he uses his extensive background to craft exceptional interfaces for AWS’s enterprise customers. When he’s not designing innovative user experiences, Richard can be found pursuing his love for coffee, spinning tracks as a DJ, or exploring new destinations around the globe.
Viyoma Sachdeva is a Principal Industry Specialist in AWS. She is specialized in AWS DevOps, containerization and IoT helping Customer’s accelerate their journey to AWS Cloud.

Speed meets scale: Load testing SageMakerAI endpoints with Observe.AI …

This post is cowritten with Aashraya Sachdeva from Observe.ai.
You can use Amazon SageMaker to build, train and deploy machine learning (ML) models, including large language models (LLMs) and other foundation models (FMs). This helps you significantly reduce the time required for a range of generative AI and ML development tasks. An AI/ML development cycle typically involves data pre-processing, model development, training, testing and deployment lifecycles. By using SageMaker, your data science and ML engineering teams can offload a lot of the undifferentiated heavy lifting involved with model development.
While SageMaker can help teams offload a lot of heavy lifting, engineering teams still have to use manual steps to implement and fine-tune related services that are part of inference pipelines, such as queues and databases. In addition, teams have to test multiple GPU instance types to find the right balance between performance and cost.
Observe.ai provides a Conversation Intelligence (CI) product that integrates with contact center as a service (CCaaS) solutions. The tool analyzes calls in real time and after they’re complete to enable features such as call summarizations, agent feedback, and auto response . The Conversation Intelligence (CI) features need to scale from customers that have fewer than 100 agents to customers that have thousands of agents—a tenfold increase in scale. To help with this, Observe.ai needed a mechanism to optimize their ML infrastructure and model serving costs. Without such a mechanism, developers had to write multiple test scripts and develop testing pipelines and debugging systems, which consumed a lot of time.
To solve this challenge, Observe.ai developed the One Load Audit Framework (OLAF), which integrates with SageMaker to identify bottlenecks and performance issues in ML services, offering latency and throughput measurements under both static and dynamic data loads. The framework also seamlessly incorporates ML performance testing into the software development lifecycle, facilitating accurate provisioning and cost savings. Using OLAF, Observe.AI’s ML team was able to reduce testing time from a week to a few hours. This helped Observe.AI scale up their frequency of endpoint deployment and customer onboarding multifold. The OLAF utility is available on GitHub and is free to use. It is open source and distributed under the Apache 2.0 license.
In this blog post, you will learn how to use the OLAF utility to test and validate your SageMaker endpoint.
Solution overview
After you’ve deployed your model for inference and verified that it’s functionally accurate, you’ll want to improve the performance of your model. The first step to do this is to load test the inference endpoint. You can use the load test metrics to apply optimizations to your model, decide on GPU instances, and fine tune the ML pipeline to increase performance without compromising on accuracy. Load testing needs to be repeated multiple times to measure the impact of any optimization. To load test, you need to configure load testing scripts to integrate with the relevant SageMaker APIs, extract metrics like latency, CPU, and memory utilization. You also need to set up a dashboard to view the results of the load test and, export the load test metrics for further analysis; and you need a configurable framework to apply concurrent load to the endpoint.
How OLAF helps
OLAF saves you the heavy lifting by providing the preceding elements as a package. OLAF is integrated with Locust, a load testing framework, to provide the capability to create concurrent load and a dashboard to view the results as the test progresses. OLAF integrates with the SageMaker API to invoke the API and to extract the metrics to measure the performance by.
In the following solution, you will learn how to deploy OLAF on your workstation as a Docker container. Using the Load test setup UI (as shown in the following figure), the load test configuration is provided and the OLAF framework uses the Boto3 SDK to push inference requests to a SageMaker inference endpoint. OLAF monitors the latency and available performance metrics using the Performance reports dashboard provided by OLAF.

Prerequisites
For this solution walkthrough, you need the following:

An AWS account
Docker installed on your workstation
The AWS Command Line Interface (AWS CLI) installed and configured. If you’re using long term credentials such as access keys, see manage access keys for IAM users and secure access keys for best practices. This post uses temporary short term credentials generated by the AWS Security Token Service (AWS STS).

Generate your AWS credentials using AWS STS
To get started, use the AWS CLI to generate your credentials.
Note: Ensure that the role or user from which the access keys are generated has AmazonSageMakerFullAccess permission. Your AWS CLI role should have the necessary trust policy to assume the role from which the access keys are generated.
Getting the role-arn
In your AWS CLI type in the following command:

aws iam get-role –role-name sagemaker_role

The command will generate the JSON output below. The role arn is the value in the arn property in the JSON below.

{
“Role”:{
“Path”:”/”,
“RoleName”:”sagemaker_role”,
“RoleId”:”AROA123456789EXAMPLE”,
“Arn”:”arn:aws:iam::111122223333:role/sagemaker_role”,
“CreateDate”:”2025-12-05T13:02:33+00:00″,
“AssumeRolePolicyDocument”:{
“Version”:”2012-10-17″,
“Statement”:[
{
“Effect”:”Allow”,
“Principal”:{
“Service”:”ec2.amazonaws.com”
},
“Action”:”sts:AssumeRole”
}
]
},
“Description”:”Allows EC2 instances to call AWS services on your behalf.”,
“MaxSessionDuration”:3600,
“RoleLastUsed”:{

}
}
}

Run the following command in your AWS CLI:

aws sts assume-role –role-arn <role arn to assume> –role-session-name <session name> –duration-seconds <timeout duration>

Set the role arn value from the step above in the –role-arn argument.
Provide the value olaf_session to the —role-session-name argument and set a value equivalent to how long you expect your load test to run in the –duration-seconds argument. In this blog we are setting it at 1800 seconds which give 30 minutes of load testing time.

The assume-role command will generate temporary AWS credentials as below

{
“Credentials”:{
“AccessKeyId”:”ASIAIOSFODNN7EXAMPLE”,
“SecretAccessKey”:”wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY”,
“SessionToken”:”IQoJb3JpZ2luX2VjEJf//////////wEaCXVzLWVhc3QtMSJFMEMCIFdzSaxLya/E71xi2SeG8KFDF46DkxvsWt6Ih0I5X2I6Ah9FYGmWi3fnQfyPQWuzE0Co44xp+qOAxbfaHJ53OYbBKpkCCF8QARoMNjE1NTE1NDU5MjM5IgyoWu5a5DJX3BMn7LYq9gHiRr2sQvStZT9tvvdS8QFjTntBYFEkDL636Crj4xw5rDieBoYFB9h+ozSqMXOtze79DHQLyCduT+McWOlB9Ic5x/xtzPT9HZsfMaEMUOPgI9LtKWUK367rVdcqBV8HH8wOwUS9RhwIyXg2vsGa+WanaS8o6sO8PVkvqOs4ea3CFguncGgSqIftJvgMg0OswzkAoUKXG6jMwL3Ppu13Dg9NV3YKOsS80vejhEJ8QFiKiTsJKX2QmQz/wUN4DN83y8qeFfYEpuYC92oZzv2gErrsXqFd+7/+2w97mInPlD6g1tyd8FlGdXg821WckmwdPu7TYqsCR9kwiM3LyQY6nwFM3U7f/sCre28o2Js31dig0WHb1iv3nTR6m/bIKqsQL4EtYXPGjHD6Ifsf9nQYtkPQC/PqzXg7anx6Q6OW5CzVvk4xU/G9+HcCej84MutK/hQGp3xnRPuJvUIs/q/QlddURk/MFZW9X3njLCn89FRmJ/tI1Mzy/yctwgLcBetE7RIPgaM/90HNXp62vBMK0tzqR1orm6/7eOGV5DXaprQ=”,
“Expiration”:”2025-12-05T14:34:56+00:00″
},
“AssumedRoleUser”:{
“AssumedRoleId”:”AROA123456789EXAMPLE:olaf-session”,
“Arn”:”arn:aws:sts::111122223333:assumed-role/sm-blog-role/olaf-session”
}
}

Make a note of the access key, secret key, and session token, which you will use to configure the test in the OLAF tool.

Set up your SageMaker inference endpoint
In this step, you set up a SageMaker inference endpoint. The following is a CloudFormation script to set up the endpoint. Copy the content below and save it as a yaml file for use in the steps below.

Resources:
SageMakerExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Principal:
Service: sagemaker.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
– arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
– arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
SageMakerModel:
Type: AWS::SageMaker::Model
Properties:
ModelName: !Sub ‘${AWS::StackName}-flan-t5-model’
ExecutionRoleArn: !GetAtt SageMakerExecutionRole.Arn
EnableNetworkIsolation: true
PrimaryContainer:
Image: !Sub ‘763104351884.dkr.ecr.${AWS::Region}.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04’
Environment:
HF_MODEL_ID: !Sub ‘google/flan-t5-${ModelSize}’
SageMakerEndpointConfig:
Type: AWS::SageMaker::EndpointConfig
Properties:
EndpointConfigName: !Sub ‘${EndpointName}-config’
ProductionVariants:
– VariantName: AllTraffic
ModelName: !GetAtt SageMakerModel.ModelName
InstanceType: !Ref InstanceType
InitialInstanceCount: 1
SageMakerEndpoint:
Type: AWS::SageMaker::Endpoint
Properties:
EndpointName: !Ref EndpointName
EndpointConfigName: !GetAtt SageMakerEndpointConfig.EndpointConfigName
Parameters:
ModelName:
Type: String
Default: flan-t5-model
Description: Name of the SageMaker model
EndpointName:
Type: String
Default: flan-t5-endpoint-blog
Description: Name of the SageMaker endpoint
InstanceType:
Type: String
Default: ml.g5.xlarge
Description: Instance type for the SageMaker endpoint
AllowedValues:
– ml.g4dn.xlarge
– ml.g4dn.2xlarge
– ml.g5.xlarge
– ml.g5.2xlarge
– ml.p3.2xlarge
ModelSize:
Type: String
Default: base
Description: Size of the FLAN-T5 model
AllowedValues:
– small
– base
– large
– xl
– xxl
Outputs:
SageMakerEndpointId:
Description: ID of the SageMaker Endpoint
Value: !Ref SageMakerEndpoint
SageMakerEndpointName:
Description: Name of the SageMaker Endpoint
Value: !Ref EndpointName
ModelName:
Description: Name of the deployed model
Value: !Ref ModelName
AWSTemplateFormatVersion: ‘2010-09-09’
Description: ‘CloudFormation template for deploying FLAN-T5 model on Amazon SageMaker’

Open an AWS CloudShell window by selecting the CloudShell icon at the top of the AWS Management Console in the AWS Region where you want the endpoint to be created.

In your CloudShell window, choose Actions and select Upload file. Select and upload the CloudFormation YAML file shared at the start of this section.

Run the following command at the CloudShell prompt

aws cloudformation create-stack
–stack-name flan-t5-endpoint-stack
–template-body file://<YAML_FILE_NAME>
–capabilities CAPABILITY_IAM

Navigate to the Amazon SageMaker AI Studio console. You might need to change the Region to match where you have deployed your SageMaker endpoint. Select the Inference and then Endpoints in the navigation pane to view the deployed endpoint. The SageMaker endpoint will take a few minutes to complete provisioning. When ready the value of the Status field will be InService. Note the endpoint name.

Install OLAF
You’re ready to install and configure OLAF to help you load test your SageMaker AI inference endpoint.

Clone the OLAF repository from the OLAF GitHub repo:

git clone https://github.com/Observeai-Research/olaf.git

Navigate to the olaf directory and build the docker image for OLAF:

cd olaf
docker build -t olaf .

Run OLAF:

docker run -p 80:8000 olaf

Open a browser window and enter the following URL to bring up the OLAF UI.

http://localhost

Enter olaf as the username and password to sign in to the OLAF dashboard. On the left is a series of radio buttons to select the resource to be tested, including SageMaker, S3, and so on. On the right is a setup screen that changes based on the resource selected.

OLAF supports additional options, including:

Multi-model
Enable batch mode

Test the SageMaker endpoint

Open the OLAF UI at http://localhost:80/.
Select Sagemaker from the navigation pane and configure the test:

SageMaker endpoint– Enter the name of the SageMaker endpoint from the SageMaker Unified Studio console here.
Predictor type – OLAF supports pytorch, sklearn and tensorflow predictors. Keep the default values.
Input Serializer – Serialization options are numpy and json. Keep the default values.
Output Serializer – Serialization options are numpy and json. Keep the default values.
AWS Region – Select the Region where the SageMaker endpoint is deployed
AWS access key – Enter the AWS access key generated from AWS STS in the section “Generate your AWS credentials using AWS STS” above.
AWS secret key – Enter the AWS secret key generated from AWS STS in the section “Generate your AWS credentials using AWS STS” above.
AWS session token – Enter the session token generated from AWS STS in the section “Generate your AWS credentials using AWS STS” above.
Input query json – For this test, enter the following prompt to translate a phrase from English to French.

[
{
“inputs”: “translate the following phrase in English to French : Hello, how are you”
}
]

Choose START LOAD SESSION to start a load test session. The session is started and a link to the session is provided at the bottom of the page. If the link doesn’t appear in a few seconds choose START LOAD SESSION to generate the link to the session.

Selecting the link takes you to a LOCUST dashboard. Enter the number of concurrent users that you want the test to simulate in the Number of users field and the interval (in seconds) that the users must be started in the spawn rate. Choose Start swarming to start the load test.

On starting the test, a reporting page, shown in the following figure, is presented that you can use to monitor the various performance parameters as the test proceeds. The information on this page provides a summary of the statistics, the p50 and p95 latency values, and the CPU and memory usage of the SageMaker workers.

Choose Charts at the top of the screen to view charts that show the Total Requests per Second and the Response Times in milliseconds. The Total Requests per Second chart shows the successful requests in green and the failed requests in red. The Response Times chart shows the fiftieth percentile response times in green and the ninety-fifth percentile response times in yellow.

Choose Workers at the top of the screen to view the worker statistics. Workers are created to generate the desired load. The # users show the number of users generated by the worker, the CPU usage and Memory Usage show the resource utilization by the worker.

You can view and download the final statistics for analysis. Choose Download Data at the top of the screen to view data download options. You can download the data as a CSV file from the Statistics, Failures, Exceptions, and Charts reporting pages.

You must stop the current load session before you can execute a new session. Choose the STOP RUNNING LOAD SESSION to stop the session. If configured, the data can be uploaded into a specified Amazon Simple Storage Service (Amazon S3) bucket. Follow the instructions in Advanced OLAF Usage item 3, Automated Backup of Load Test Report, to configure the upload of test results to Amazon S3.

Hosting the client
For the solution described in this post, you used a desktop to host the OLAF container and set up the load tests. The choice of using your desktop or an Amazon Elastic Compute Cloude (Amazon EC2) instance can impact the latency because the round trip time will be impacted. Network bandwidth can also impact the latency. The key is to standardize the environment that you use to run the tests based on how your customers use the endpoints.
Clean up
When you’re done with this demonstration, remove any resources that you no longer need to avoid incurring future costs.

In the CloudShell terminal run the following command to delete the SageMaker endpoint:

aws cloudformation delete-stack –stack-name flan-t5-endpoint-stack

Run the following command to list the running Docker images

docker ps

Note the container_id and then run the following command to stop the Docker images.

docker stop <container_id>

Conclusion
In this post, you’ve learned how to set up OLAF and use it to load test a SageMaker endpoint with a few basic steps. OLAF represents a significant step forward in streamlining the optimization of ML infrastructure and model serving costs. Through this demonstration, you’ve seen how OLAF seamlessly integrates with SageMaker to provide valuable insights into endpoint performance under various load conditions. Key benefits of OLAF include:

Straightforward setup and integration with existing SageMaker endpoints
Real-time monitoring of performance metrics including latency and throughput
Detailed statistics and downloadable reports for analysis
Ability to test different load patterns and concurrency levels
Support for multiple model types and serialization options

For organizations like Observe.ai that need to scale their ML operations efficiently, OLAF eliminates the need to develop custom testing infrastructure and debugging systems. This means that development teams can focus on their core product features while ensuring optimal performance and cost-effectiveness of their ML infrastructure. As the adoption of ML continues to grow, tools like OLAF become increasingly valuable in helping organizations optimize their ML operations. Whether you’re running a few models or managing a large-scale ML infrastructure, OLAF provides the insights needed to make informed decisions about instance types, scaling, and resource allocation.
In this sample solution, you used short term credentials generated by the AWS STS service to connect to SageMaker from OLAF. Ensure that the necessary steps are taken to secure your access keys and credentials in a production environment.
To get started with OLAF, visit the GitHub repository and follow the installation steps outlined in this post. The framework’s intuitive interface and comprehensive monitoring capabilities make it an essential tool for organizations that want to optimize their SageMaker deployments.

About the authors
Aashraya Sachdeva is a technology leader with deep expertise in genAI, product development, and platform engineering. As the Director of Engineering at Observe, he oversees teams building scalable, agentic solutions that enhance both customer experience and operational efficiency. With extensive experience guiding ML initiatives from early data exploration through deployment and large-scale operations, he brings a pragmatic, reliability-focused approach to delivering high-performing platforms. Throughout his career, he has played a key role in launching multiple products, leveraging his ML background to create innovative yet practical solutions, while consistently fostering collaboration, mentorship, and technical excellence across engineering teams.
Shibu Jacob is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers architect and implement cloud-native solutions. With over two decades of experience in software development and architecture, Shibu specializes in containerization, microservices, and event-driven architectures. He is particularly passionate about the transformative potential of AI in software development and architectural design. Prior to joining AWS, he spent 20 years working with enterprises and startups, bringing a wealth of practical experience to his current role. Outside of work, Shibu enjoys following Formula 1 racing, working on DIY automotive projects, going on long road trips, and spending time with his family.

A Coding Implementation to Build a Unified Apache Beam Pipeline Demons …

In this tutorial, we demonstrate how to build a unified Apache Beam pipeline that works seamlessly in both batch and stream-like modes using the DirectRunner. We generate synthetic, event-time–aware data and apply fixed windowing with triggers and allowed lateness to demonstrate how Apache Beam consistently handles both on-time and late events. By switching only the input source, we keep the core aggregation logic identical, which helps us clearly understand how Beam’s event-time model, windows, and panes behave without relying on external streaming infrastructure. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install -U “grpcio>=1.71.2” “grpcio-status>=1.71.2”
!pip -q install -U apache-beam crcmod

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
from apache_beam.transforms.window import FixedWindows
from apache_beam.transforms.trigger import AfterWatermark, AfterProcessingTime, AccumulationMode
from apache_beam.testing.test_stream import TestStream
import json
from datetime import datetime, timezone

We install the required dependencies and ensure version compatibility so that Apache Beam. We import the core Beam APIs along with windowing, triggers, and TestStream utilities needed later in the pipeline. We also bring in standard Python modules for time handling and JSON formatting. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserMODE = “stream”
WINDOW_SIZE_SECS = 60
ALLOWED_LATENESS_SECS = 120

def make_event(user_id, event_type, amount, event_time_epoch_s):
return {“user_id”: user_id, “event_type”: event_type, “amount”: float(amount), “event_time”: int(event_time_epoch_s)}

base = datetime.now(timezone.utc).replace(microsecond=0)
t0 = int(base.timestamp())

BATCH_EVENTS = [
make_event(“u1”, “purchase”, 20, t0 + 5),
make_event(“u1”, “purchase”, 15, t0 + 20),
make_event(“u2”, “purchase”, 8, t0 + 35),
make_event(“u1”, “refund”, -5, t0 + 62),
make_event(“u2”, “purchase”, 12, t0 + 70),
make_event(“u3”, “purchase”, 9, t0 + 75),
make_event(“u2”, “purchase”, 3, t0 + 50),
]

We define the global configuration that controls window size, lateness, and execution mode. We create synthetic events with explicit event-time timestamps so that windowing behavior is deterministic and easy to reason about. We prepare a small dataset that intentionally includes out-of-order and late events to observe Beam’s event-time semantics. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef format_joined_record(kv):
user_id, d = kv
return {
“user_id”: user_id,
“count”: int(d[“count”][0]) if d[“count”] else 0,
“sum_amount”: float(d[“sum_amount”][0]) if d[“sum_amount”] else 0.0,
}

class WindowedUserAgg(beam.PTransform):
def expand(self, pcoll):
stamped = pcoll | beam.Map(lambda e: beam.window.TimestampedValue(e, e[“event_time”]))
windowed = stamped | beam.WindowInto(
FixedWindows(WINDOW_SIZE_SECS),
allowed_lateness=ALLOWED_LATENESS_SECS,
trigger=AfterWatermark(
early=AfterProcessingTime(10),
late=AfterProcessingTime(10),
),
accumulation_mode=AccumulationMode.ACCUMULATING,
)
keyed = windowed | beam.Map(lambda e: (e[“user_id”], e[“amount”]))
counts = keyed | beam.combiners.Count.PerKey()
sums = keyed | beam.CombinePerKey(sum)
return (
{“count”: counts, “sum_amount”: sums}
| beam.CoGroupByKey()
| beam.Map(format_joined_record)
)

We build a reusable Beam PTransform that encapsulates all windowed aggregation logic. We apply fixed windows, triggers, and accumulation rules, then group events by user and compute counts and sums. We keep this transform independent of the data source, so the same logic applies to both batch and streaming inputs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AddWindowInfo(beam.DoFn):
def process(self, element, window=beam.DoFn.WindowParam, pane_info=beam.DoFn.PaneInfoParam):
ws = float(window.start)
we = float(window.end)
yield {
**element,
“window_start_utc”: datetime.fromtimestamp(ws, tz=timezone.utc).strftime(“%H:%M:%S”),
“window_end_utc”: datetime.fromtimestamp(we, tz=timezone.utc).strftime(“%H:%M:%S”),
“pane_timing”: str(pane_info.timing),
“pane_is_first”: pane_info.is_first,
“pane_is_last”: pane_info.is_last,
}

def build_test_stream():
return (
TestStream()
.advance_watermark_to(t0)
.add_elements([
beam.window.TimestampedValue(make_event(“u1”, “purchase”, 20, t0 + 5), t0 + 5),
beam.window.TimestampedValue(make_event(“u1”, “purchase”, 15, t0 + 20), t0 + 20),
beam.window.TimestampedValue(make_event(“u2”, “purchase”, 8, t0 + 35), t0 + 35),
])
.advance_processing_time(5)
.advance_watermark_to(t0 + 61)
.add_elements([
beam.window.TimestampedValue(make_event(“u1”, “refund”, -5, t0 + 62), t0 + 62),
beam.window.TimestampedValue(make_event(“u2”, “purchase”, 12, t0 + 70), t0 + 70),
beam.window.TimestampedValue(make_event(“u3”, “purchase”, 9, t0 + 75), t0 + 75),
])
.advance_processing_time(5)
.add_elements([
beam.window.TimestampedValue(make_event(“u2”, “purchase”, 3, t0 + 50), t0 + 50),
])
.advance_watermark_to(t0 + 121)
.advance_watermark_to_infinity()
)

We enrich each aggregated record with window and pane metadata so we can clearly see when and why results are emitted. We convert Beam’s internal timestamps into human-readable UTC times for clarity. We also define a TestStream that simulates real streaming behavior using watermarks, processing-time advances, and late data. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_batch():
with beam.Pipeline(options=PipelineOptions([])) as p:
(
p
| beam.Create(BATCH_EVENTS)
| WindowedUserAgg()
| beam.ParDo(AddWindowInfo())
| beam.Map(json.dumps)
| beam.Map(print)
)

def run_stream():
opts = PipelineOptions([])
opts.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=opts) as p:
(
p
| build_test_stream()
| WindowedUserAgg()
| beam.ParDo(AddWindowInfo())
| beam.Map(json.dumps)
| beam.Map(print)
)

run_stream() if MODE == “stream” else run_batch()

We wire everything together into executable batch and stream-like pipelines. We toggle between modes by changing a single flag while reusing the same aggregation transform. We run the pipeline and print the windowed results directly, making the execution flow and outputs easy to inspect.

In conclusion, we demonstrated that the same Beam pipeline can process both bounded batch data and unbounded, stream-like data while preserving identical windowing and aggregation semantics. We observed how watermarks, triggers, and accumulation modes influence when results are emitted and how late data updates previously computed windows. Also, we focused on the conceptual foundations of Beam’s unified model, providing a solid base for later scaling the same design to real streaming runners and production environments.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export
The post A Coding Implementation to Build a Unified Apache Beam Pipeline Demonstrating Batch and Stream Processing with Event-Time Windowing Using DirectRunner appeared first on MarkTechPost.

TII Abu-Dhabi Released Falcon H1R-7B: A New Reasoning Model Outperform …

Technology Innovation Institute (TII), Abu Dhabi, has released Falcon-H1R-7B, a 7B parameter reasoning specialized model that matches or exceeds many 14B to 47B reasoning models in math, code and general benchmarks, while staying compact and efficient. It builds on Falcon H1 7B Base and is available on Hugging Face under the Falcon-H1R collection.

Falcon-H1R-7B is interesting because it combines 3 design choices in 1 system, a hybrid Transformer along with Mamba2 backbone, a very long context that reaches 256k tokens in standard vLLM deployments, and a training recipe that mixes supervised long form reasoning with reinforcement learning using GRPO.

Hybrid Transformer plus Mamba2 architecture with long context

Falcon-H1R-7B is a causal decoder only model with a hybrid architecture that combines Transformer layers and Mamba2 state space components. The Transformer blocks provide standard attention based reasoning, while the Mamba2 blocks give linear time sequence modeling and better memory scaling as context length grows. This design targets the 3 axes of reasoning efficiency that the team describes, speed, token efficiency and accuracy.

The model runs with a default –max-model-len of 262144 when served through vLLM, which corresponds to a practical 256k token context window. This allows very long chain of thought traces, multi step tool use logs and large multi document prompts in a single pass. The hybrid backbone helps control memory use at these sequence lengths and improves throughput compared with a pure Transformer 7B baseline on the same hardware.

Training recipe for reasoning tasks

Falcon H1R 7B uses a 2 stage training pipeline:

In the first stage, the team runs cold start supervised fine tuning on top of Falcon-H1-7B Base. The SFT (supervised fine tuning) data mixes step by step long form reasoning traces in 3 main domains, mathematics, coding and science, plus non reasoning domains such as chat, tool calling and safety. Difficulty aware filtering upweights harder problems and downweights trivial ones. Targets can reach up to 48k tokens, so the model sees long derivations and full solution paths during training.

In the second stage, the SFT checkpoint is refined with GRPO, which is a group relative policy optimization method for reinforcement learning. Rewards are given when the generated reasoning chain is verifiably correct. For math problems, the system uses symbolic checks on the final answer. For code, it executes the generated program against unit tests. This RL stage pushes the model to keep useful intermediate steps while staying within a token budget.

The result is a 7B model that is tuned specifically for chain of thought reasoning, rather than general chat.

Benchmarks in math, coding and general reasoning

The Falcon-H1R-7B benchmark scores are grouped across math, code and agentic tasks, and general reasoning tasks.

In the math group, Falcon-H1R-7B reaches an aggregate score of 73.96%, ahead of Apriel-1.5-15B at 69.32% and larger models like Qwen3-32B and Nemotron-H-47B. On individual benchmarks:

AIME 24, 88.1%, higher than Apriel-1.5-15B at 86.2%

AIME 25, 83.1%, higher than Apriel-1.5-15B at 80%

HMMT 25, 64.9%, above all listed baselines

AMO Bench, 36.3%, compared with 23.3% for DeepSeek-R1-0528 Qwen3-8B

For code and agentic workloads, the model reaches 33.95% as a group score. On LiveCodeBench v6, Falcon-H1R-7B scores 68.6%, which is higher than Qwen3-32B and other baselines. It also scores 28.3% on the SciCode sub problem benchmark and 4.9% on Terminal Bench Hard, where it ranks second behind Apriel 1.5-15B but ahead of several 8B and 32B systems.

https://huggingface.co/blog/tiiuae/falcon-h1r-7b

On general reasoning, Falcon-H1R-7B achieves 49.48% as a group score. It records 61.3% on GPQA D, close to other 8B models, 72.1% on MMLU Pro, which is higher than all other 8B models in the above table, 11.1% on HLE and 53.4% on IFBench, where it is second only to Apriel 1.5 15B.

The key takeaway is that a 7B model can sit in the same performance band as many 14B to 47B reasoning models, if the architecture and training pipeline are tuned for reasoning tasks.

Inference throughput and test time scaling

The team also benchmarked Falcon-H1R-7B on throughput and test time scaling under realistic batch settings.

For a 512 token input and 32k token output, Falcon-H1R-7B reaches about 1,000 tokens per second per GPU at batch size 32 and about 1,500 tokens per second per GPU at batch size 64, nearly double the throughput of Qwen3-8B in the same configuration. For an 8k input and 16k output, Falcon-H1R-7B reaches around 1,800 tokens per second per GPU, while Qwen3-8B stays below 900. The hybrid Transformer along with Mamba architecture is a key factor in this scaling behavior, because it reduces the quadratic cost of attention for long sequences.

Falcon-H1R-7B is also designed for test time scaling using Deep Think with confidence, known as DeepConf. The idea is to run many chains of thought in parallel, then use the model’s own next token confidence scores to filter noisy traces and keep only high quality candidates.

On AIME 24 and AIME 25, Falcon-H1R-7B reaches 96.7% accuracy with fewer than 100 million generated tokens, which puts it on a favorable Pareto frontier of accuracy versus token cost compared with other 8B, 14B and 32B reasoning models. On the parser verifiable subset of AMO Bench, it reaches 35.9% accuracy with 217 million tokens, again ahead of the comparison models at similar or larger scale.

Key Takeaways

Falcon-H1R-7B is a 7B parameter reasoning model that uses a hybrid Transformer along with Mamba2 architecture and supports a 256k token context for long chain of thought prompts.

The model is trained in 2 stages, supervised fine tuning on long reasoning traces in math, code and science up to 48k tokens, followed by GRPO based reinforcement learning with verifiable rewards for math and code.

Falcon-H1R-7B achieves strong math performance, including about 88.1% on AIME 24, 83.1% on AIME 25 and a 73.96% aggregate math score, which is competitive with or better than larger 14B to 47B models.

On coding and agentic tasks, Falcon-H1R-7B obtains 33.95% as a group score and 68.6% on LiveCodeBench v6, and it is also competitive on general reasoning benchmarks such as MMLU Pro and GPQA D.

The hybrid design improves throughput, reaching around 1,000 to 1,800 tokens per second per GPU in the reported settings, and the model supports test time scaling through Deep Think with confidence to improve accuracy using multiple reasoning samples under a controlled token budget.

Check out the Technical details and MODEL WEIGHTS here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export
The post TII Abu-Dhabi Released Falcon H1R-7B: A New Reasoning Model Outperforming Others in Math and Coding with only 7B Params with 256k Context Window appeared first on MarkTechPost.