Checkpointless training on Amazon SageMaker HyperPod: Production-scale …

Foundation model training has reached an inflection point where traditional checkpoint-based recovery methods are becoming a bottleneck to efficiency and cost-effectiveness. As models grow to trillions of parameters and training clusters expand to thousands of AI accelerators, even minor disruptions can result in significant costs and delays.
In this post, we introduce checkpointless training on Amazon SageMaker HyperPod, a paradigm shift in model training that reduces the need for traditional checkpointing by enabling peer-to-peer state recovery. Results from production-scale validation show 80–93% reduction in recovery time (from 15–30 minutes or more to under 2 minutes) and enables up to 95% training goodput on cluster sizes with thousands of AI accelerators.
Understanding goodput
Foundation model training is one of the most resource-intensive processes in AI, often involving millions of dollars in compute spend across thousands of AI accelerators running for days to months. Because of the inherent all-or-none distributed synchrony across all ranks, even a loss of a single rank because of software or hardware faults brings the training workloads to a complete halt. To mitigate such localized faults, the industry has relied on checkpoint-based recovery; periodically saving training states (checkpoints) to a durable store based on a user-defined checkpoint interval. When a fault occurs, the training workload resumes by restoring from the latest saved checkpoint. This traditional restart-to-recover model has become increasingly untenable as model sizes grow from billions to trillions of parameters and training workloads grow from hundreds to thousands of AI accelerators.
This challenge of maintaining efficient training operations at scale has led to the concept of goodput—the actual useful work accomplished in an AI training system compared to its theoretical maximum capacity. In foundation model training, goodput is impacted by system failures and recovery overhead. The gap between the system’s theoretical maximum throughput and its actual productive output (goodput) grows larger with: increased frequency of failures (which rises with cluster size), longer recovery times (which scale with model size and cluster size), and higher costs of idle resources during recovery. This definition helps frame why measuring and optimizing goodput becomes increasingly crucial as AI training scales to larger clusters and more complex models, where even small inefficiencies can result in significant financial and time costs.
A pre-training workload on a HyperPod cluster with 256 P5 instances, checkpointing every 20 minutes, faces two challenges when disrupted: 10 minutes of lost work plus 10 minutes for recovery. With ml.p5.24xlarge instances costing $55 per hour, each disruption costs $4,693 in compute time. For a month-long training, daily disruptions would accumulate to $141,000 in extra costs and delay completion by 10 hours.

As cluster sizes grow, the probability and frequency of failures can increase.

As the training spans across thousands of nodes, disruptions caused by faults become increasingly frequent. Meanwhile, recovery becomes slower because the workload reinitialization overhead grows linearly with cluster size. The cumulative impact of large-scale AI training failures can reach millions of dollars annually and translate directly to delayed time-to-market, slower model iteration cycles, and competitive disadvantage. Every hour of idle GPU time is an hour not spent advancing model capabilities.
Checkpoint-based recovery
Checkpoint-based recovery in distributed training is far more complex and time-consuming than commonly understood. When a failure occurs in traditional distributed training, the restart process involves far more than loading the last checkpoint. Understanding what happens during recovery reveals why it takes so long and why the entire cluster must sit idle.
The all-or-none cascade
A single failure—one GPU error, one network timeout, or one hardware fault—can trigger a complete training cluster shutdown. Because distributed training treats all processes as tightly coupled, any single failure necessitates a complete restart. When any process fails, the orchestration system (for example, TorchElastic or Kubernetes) must terminate every process across the job and restart from scratch. Each restart requires navigating a complex, multi-stage recovery process where every stage is sequential and blocking:

Stage 1: Training job restart – The training job orchestrator detects a failure, terminates all processes in all nodes followed by a cluster-wide restart or the training job.
Stage 2: Process and network initialization – Every process must re-execute the training script from the beginning. That includes rank initialization, loading of Python modules from durable store such as Network File System (NFS) or object storage, establishing the training topology and communication backend through peer discovery and process groups creation. The process group initialization alone can take tens of minutes on large clusters.
Stage 3: Checkpoint retrieval – Each process must first identify the last completely saved checkpoint, then retrieve it from persistent storage (for example, NFS or object storage) and load multiple state dictionaries: the model’s parameters and buffers, the optimizer’s internal state (momentum, variance, and so on), the learning rate scheduler, and training loop metadata (epoch, batch number). This step can take tens of minutes or longer depending on cluster and model size.
Stage 4: Data loader initialization – The data-loading ranks have additional responsibility to initialize the data buffers. That includes retrieving the data checkpoint from durable storage such as Amazon FSx or Amazon Simple Storage Service (Amazon S3) and prefetching the training data to start the training loop. Data checkpointing is an essential step to avoid processing the same data samples multiple times or skipping samples upon training disruption. Depending on the data mix strategy, data locality, and bandwidth, the process can take a few minutes.
Stage 5: First step overhead – After checkpoint and training data are retrieved and loaded, there is additional overhead to run the first training step, we call it first step overhead (FSO). During this first step, there is typically time spent in memory allocation, creating and setting up the CUDA context for communication with GPUs, and compilation part of the CUDA graph, and so on.
Stage 6: Lost steps overhead – Only after all previous stages complete successfully can the training loop resume its regular progress. Because the training resumes from the last saved model checkpoint, all the steps computed between the checkpoint and the fault encountered are lost. Those lost steps need to be recomputed, we call this lost steps overhead (LSO). Following the recomputation phase, the training job resumes productive work that directly contributes to goodput.

How checkpointless training eliminates these bottlenecks
The five stages outlined above—termination and restart, process discovery and network setup, checkpoint retrieval, GPU context reinitialization, and training loop resumption—represent the fundamental bottlenecks in checkpoint-based recovery. Each stage is sequential and blocking, and training recovery can take minutes to several hours for large models. Critically, the entire cluster must wait for every stage to complete before training can resume.
Checkpointless training eliminates this cascade. Checkpointless training preserves model state coherence across the distributed cluster, eliminating the need for periodic snapshots. When failures occur, the system quickly recovers by using healthy peers, avoiding both storage I/O operations and full process restarts typically required by traditional checkpointing approaches.

Checkpointless training architecture

Checkpointless training is built on five components that work together to eliminate the traditional checkpoint-restart bottlenecks. Each component addresses a specific bottleneck in the recovery process, and together they enable automatic detection and recovery of infrastructure faults in minutes with zero manual intervention, even with thousands of AI accelerators.
Component 1: TCPStore-less/root-less NCCL and Gloo initialization (optimizing stage 2)
In a typical distributed training setup (for example, using torch.distributed), all ranks must initialize a process group. The process group creates a communication layer, allowing all processes (or ranks, that is, individual nodes) to be aware of each other and exchange information. A TCPStore is often used as a rendezvous point where all ranks check in to discover each other’s connection information. When thousands of ranks try to contact a designated root server (typically rank 0) simultaneously, it becomes a bottleneck. This leads to a flood of simultaneous network requests to a single root server that can cause network congestion, increase latency by tens of minutes, and further slow the communication process.
Checkpointless training eliminates this centralized dependency. Instead of funneling all connection requests through a single root server, the system uses a symmetric address pattern where each rank independently computes peer connection information using a global group counter. Ranks connect directly to each other using predetermined port assignments, avoiding the TCPStore bottleneck. Process group initialization drops from tens of minutes to seconds, even on clusters with thousands of nodes. The system also eliminates the single-point-of-failure risk inherent in root-based initialization.
Component 2: Memory-mapped data loading (optimizing stage 4)
One of the hidden costs in traditional recovery is reloading training data. When a process restarts, it must reload batches from disk, rebuild data loader state, and carefully position itself to avoid processing duplicate samples or skipping data. On large-scale training runs, this data loading can add minutes to every recovery cycle.
Checkpointless training uses memory-mapped data loading to maintain cached data across accelerators. Training data is mapped into shared memory regions that persist even when individual processes fail. When a node recovers, it doesn’t reload data from disk but reconnects to the existing memory-mapped cache. The data loader state is preserved, helping to ensure that training continues from the correct position without duplicate or skipped samples. MMAP also reduces host CPU memory usage by maintaining only one copy of data per node (compared to eight copies with traditional data loaders on 8-GPU nodes), and training can resume immediately using cached batches while the data loader concurrently prefetches the next data in the background.

Memory-mapped data loading workflow

Component 3: In-process recovery (optimizing stage 1, 2, and 5)
Traditional checkpoint-based recovery treats failures as job-level events: a single GPU error triggers termination of the entire distributed training job. Every process across the cluster must be killed and restarted, even though only one component failed.
Checkpointless training uses in-process recovery to isolate failures at the process level. When a GPU or process fails, only the failed process executes an in-process recovery to rejoin the training loop within seconds, overcoming recoverable or transient errors. Healthy processes continue running without interruption. The failed process stays alive (avoiding full process teardown), preserving the CUDA context, compiler cache, and GPU state, hence eliminating minutes of reinitialization overhead. In cases where the error is non-recoverable (such as hardware failure), the system automatically swaps the faulty component with a pre-warmed hot spare, enabling training to continue without disruptions.
This eliminates the need for full cluster termination and restart, dramatically reducing recovery overhead.
Component 4: Peer-to-peer state replication (optimizing stage 3 and 6)
Checkpoint-based recovery requires loading model and optimizer state from persistent storage (such as Amazon S3 or FSx for Lustre). For models with billions to trillions of parameters, this means transferring tens to hundreds of gigabytes over the network, deserializing state dictionaries, and reconstructing optimizer buffers which could take tens of minutes and create a massive I/O bottleneck.
The most critical innovation in checkpointless training is continuous peer-to-peer state replication. Instead of periodically saving model state to centralized storage, each GPU maintains redundant copies of its model shards on peer GPUs. When a failure occurs, the recovering process doesn’t load from Amazon S3. It copies state directly from a healthy peer over the high-speed Elastic Fabric Adapter (EFA) network interconnect. This peer-to-peer architecture eliminates the I/O bottleneck that dominates traditional checkpoint recovery. State transfer happens in seconds, compared to minutes for loading multi-gigabyte checkpoints from storage. The recovering node pulls only the specific shards it needs, further reducing transfer time.
Component 5: SageMaker HyperPod training operator (optimizing all stages)
The SageMaker HyperPod training operator orchestrates the checkpointless training components, serving as the coordination layer that ties together initialization, data loading, checkpointless recovery, and checkpoint fallback mechanisms. It maintains a centralized control plane with a global view of training process health across the entire cluster, coordinating fault detection, recovery decisions, and cluster-wide synchronization.
The operator implements intelligent recovery escalation: it first attempts in-process restart for failed components, and if that’s not feasible (for example, because of container crashes or node failures), it escalates to process-level recovery. During a process-level recovery, instead of restarting the entire job when failures occur, the operator restarts only training processes, keeping the containers alive. As a result, the recovery times are faster than a job-level restart, which requires tearing down and recreating the training infrastructure, involving pod rescheduling, container pulls, environment initialization, and re-loading from checkpoints. When failures occur, the operator broadcasts coordinated stop signals to prevent cascading timeouts and integrates with the SageMaker HyperPod health-monitoring agent to automatically detect hardware issues and trigger recovery without manual intervention.

Getting started with checkpointless training
This section guides you through setting up and configuring checkpointless training on SageMaker HyperPod to reduce fault recovery from hours to minutes.
Prerequisites
Before integrating checkpointless training into your training workload, verify that your environment meets the following requirements:
Infrastructure requirements:

Amazon SageMaker HyperPod cluster orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS)
HyperPod training operator v1.2 or later installed on the cluster
Recommended instance types: ml.p5., p5e., or p5en.48xlarge, ml.p6.p6-b200.48xlarge, or ml.p6e-gb200.36xlarge
Minimum cluster size: Two nodes for peer-to-peer checkpointless recovery

Software requirements:

Supported frameworks: Nemo, PyTorch, PyTorch Lightning
Training data formats: JSON, JSONGZ (compressed JSON), or ARROW
Amazon Elastic Container Registry (Amazon ECR) repository for container images. Use the HyperPod checkpointless training container—required for rootless NCCL initialization (Tier 1) and peer-to-peer checkpointless recovery (Tier 4)

658645717510.dkr.ecr.<region>.amazonaws.com/sagemaker-hyperpod/pytorch-training:2.3.0-checkpointless

Checkpointless training workflow
Checkpointless training is designed for incremental adoption. You can start with basic capabilities and progressively enable advanced features as your training scales. The integration is organized into four tiers, each building on the previous one:
Tier 1: NCCL initialization optimization
NCCL initialization optimization eliminates the centralized root process bottleneck during initialization. Nodes discover and connect to peers independently using infrastructure signals. This enables faster process group initialization (seconds instead of minutes) and elimination of single-point-of-failure during startup.
Integration steps: Enable an environment variable as part of the job specification and verify that the job runs with the checkpointless training container.

# kubernetes job spec
env:
– name: HPCT_USE_CONN_DATA # Enable Rootless
value: “1”
– name: TORCH_SKIP_TCPSTORE # Enable TCPStore Removal
value: “1”

Tier 2: Memory-mapped data loading
Memory mapped data loading keeps training data cached in shared memory across process restarts, eliminating data reload overhead during recovery. This enables instant data access during recovery. No need to reload or re-shuffle data when a process restarts.
Integration steps: Augment the existing data loader with a memory mapped cache

from hyperpod_checkpointless_training.dataloader.mmap_data_module import MMAPDataModule
from hyperpod_checkpointless_training.dataloader.config import CacheResumeMMAPConfig

base_data_module = MY_DATA_MODULE(…). # Customer’s own datamodule

mmap_config = CacheResumeMMAPConfig(
cache_dir=self.cfg.mmap.cache_dir,
)

mmap_dm = MMAPDataModule(
data_module=base_data_module,
mmap_config=CacheResumeMMAPConfig(
cache_dir=self.cfg.mmap.cache_dir,
),
)

Tier 3: In-process recovery
In-process recovery isolates failures to individual processes instead of requiring full job restarts. Failed processes recover independently while healthy processes continue training. It enables sub-minute recovery from process-level failures. Healthy processes stay alive, while failed processes recover independently.
Integration steps:

from hyperpod_checkpointless_training.inprocess.health_check import CudaHealthCheck
from hyperpod_checkpointless_training.inprocess.wrap import HPCallWrapper, HPWrapper
from hyperpod_checkpointless_training.inprocess.train_utils import HPAgentK8sAPIFactory
@HPWrapper(
health_check=CudaHealthCheck(),
hp_api_factory=HPAgentK8sAPIFactory(),
abort_timeout=60.0,
)
def re_executable_codeblock(): # The re-executable codeblock defined by user, usually it’s main function or train loop

Tier 4: Checkpointless (peer-to-peer recovery) (NeMo integration)
Checkpointless recovery enables complete peer-to-peer state replication and recovery. Failed processes recover model and optimizer state directly from healthy peers without loading from storage. This step enables elimination of checkpoint loading. Failed processes recover model and optimizer state from healthy replicas over the high-speed EFA interconnect.
Integration steps:

from hyperpod_checkpointless_training.inprocess.train_utils import wait_rank
wait_rank()

def main():
@HPWrapper(
health_check=CudaHealthCheck(),
hp_api_factory=HPAgentK8sAPIFactory(),
abort_timeout=60.0,
checkpoint_manager=PEFTCheckpointManager(enable_offload=True),
abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),
finalize=CheckpointlessFinalizeCleanup(),
)
def run_main(cfg, caller: Optional[HPCallWrapper] = None):

trainer = Trainer(
strategy=CheckpointlessMegatronStrategy(…,
num_distributed_optimizer_instances=2),
callbacks=[…, CheckpointlessCallback(…)],
)
trainer.fresume = resume
trainer._checkpoint_connector = CheckpointlessCompatibleConnector(trainer)
trainer.wrapper = caller

wait_rank: All ranks will wait for the rank information from the Hyperod training operator infrastructure.
HPWrapper: Python function wrapper that enables restart capabilities for a restart code block (RCB). The implementation uses a context manager instead of a Python decorator because the call wrapper lacks information about the number of RCBs it should monitor.
CudaHealthCheck: Helps ensure that the CUDA context for the current process is in a healthy state. It synchronizes with the GPU and uses the device corresponding to LOCAL_RANK environment variable, or the main thread’s default CUDA device if LOCAL_RANK was not specified in the environment.
HPAgentK8sAPIFactory: This is the API that checkpointless training will use to understand the training status from the other pods in a K8s training cluster. It also provides an infrastructure-level barrier, which makes sure every rank can successfully perform the abort and restart.
CheckpointManager: Manages in-memory checkpoints and peer-to-peer recovery for checkpointless fault tolerance.
We recommend starting with Tier 1 and validating it in your environment. Add Tier 2 when data loading overhead becomes a bottleneck. Adopt Tier 3 and Tier 4 for maximum resilience on the largest training clusters.
For NeMo users and HyperPod recipe users, Tier 4 is available out-of-the-box with minimal configuration changes for Llama and GPT open source recipes. NeMo examples for Llama and GPT open source models can be found in SageMaker HyperPod checkpointless training.
Performance results
Checkpointless training has been validated at production scale across multiple cluster configurations. The latest Amazon Nova models were trained using this technology on tens of thousands of AI accelerators.
In this section, we demonstrate results from extensive testing across a range of cluster sizes, spanning 16 GPUs to 2,304 GPUs. Checkpointless training demonstrated significant improvements in recovery time, consistently reducing downtime by 80–93% compared to traditional checkpoint-based recovery.

Cluster (H100s)
Model
Traditional recovery
Checkpointless recovery
Improvement

2,304 GPUs
Internal model
15–30 minutes
Less than 2 minutes
~87–93% faster

256 GPUs
Llama-3 70B (pre-training)
4 min, 52 sec
47 seconds
~84% faster

16 GPUs
Llama-3 70B (fine-tuning)
5 min 10 sec
50 seconds
~84% faster

These recovery time improvements have a direct relationship to ML goodput, defined as the percentage of time your cluster spends making forward progress on training rather than sitting idle during failures. As clusters scale to thousands of nodes, failure frequency increases proportionally. At the same time, traditional checkpoint-based recovery times also increase with cluster size due to growing coordination overhead. This creates a compounding problem: more frequent failures combined with longer recovery times rapidly erode goodput at scale.
Checkpointless training makes optimizations across the entire recovery stack, enabling more than 95% goodput even on clusters with thousands of AI accelerators. Based on our internal studies, we consistently observed goodput upwards of 95% across massive-scale deployments that exceeded 2,300 GPUs.
We also verified that model training accuracy is not impacted by checkpointless training. Specifically, we measured checksum matching for traditional checkpoint-based training and checkpointless training, and at every training step verified a bit-wise match on training loss. The following is a plot for the training loss for a Llama-3 70B pre-training workload on 32 x ml.p5.48xlarge instances for both traditional checkpointing versus checkpointless training.

Conclusion
Foundation model training has reached an inflection point. As clusters scale to thousands of AI accelerators and training runs extend to months, the traditional checkpoint-based recovery paradigm is increasingly becoming a bottleneck. A single GPU failure that previously would have caused minutes of downtime now triggers tens of minutes of cluster-wide idle time on thousands of AI accelerators, with cumulative costs reaching millions of dollars annually.
Checkpointless training rethinks this paradigm entirely by treating failures as local, recoverable events rather than cluster-wide catastrophes. Failed processes recover state from healthy peers in seconds, enabling the rest of the cluster to continue making forward progress. The shift is fundamental: from How do we restart quickly? to How do we avoid stopping at all?
This technology has enabled more than 95% goodput when training on SageMaker HyperPod. Our internal studies on 2,304 GPUs show recovery times dropped from 15–30 minutes to under 90 seconds, translating to over 80% reduction in idle GPU time per failure.
To get started, explore What is Amazon SageMaker AI?. Sample implementations and recipes are available in the AWS GitHub HyperPod checkpointless training and SageMaker HyperPod recipes repositories.

About the Authors
Anirudh Viswanathan is a Senior Product Manager, Technical, at AWS with the SageMaker team, where he focuses on Machine Learning. He holds a Master’s in Robotics from Carnegie Mellon University and an MBA from the Wharton School of Business. Anirudh is a named inventor on more than 50 AI/ML patents. He enjoys long-distance running, exploring art galleries, and attending Broadway shows. You can connect with Anirudh on LinkedIn.
Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS. He helps AWS customers, from small startups to large enterprises to train and deploy foundation models efficiently on AWS. He has a background in Microprocessor Engineering passionate about computational optimization problems and improving the performance of AI workloads. You can connect with Roy on LinkedIn.
Fei Wu is a Senior Software Developer at AWS with Sagemaker team. Fei’s focus is on ML system and distributed training techniques. He holds a PhD in Electrical Engineering from StonyBrook University. When outside of work, Fei enjoys playing basketball and watching movies. You can connect with Fei on LinkedIn.
Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. At AWS, Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.
Anirban Roy is a Principal Engineer at AWS with the SageMaker team, primarily focussing on AI training infra, resiliency and observability. He holds a Master’s in Computer Science from Indian Statistical Institute in Kolkata. Anirban is a seasoned distributed software system builder with more than 20 years of experience and multiple patents and publications. He enjoys road biking, reading non-fiction, gardening and nature traveling. You can connect with Anirban on LinkedIn
Arun Nagarajan is a Principal Engineer on the Amazon SageMaker AI team, where he currently focuses on distributed training across the entire stack. Since joining the SageMaker team during its launch year, Arun has contributed to multiple products within SageMaker AI, including real-time inference and MLOps solutions. When he’s not working on machine learning infrastructure, he enjoys exploring the outdoors in the Pacific Northwest and hitting the slopes for skiing.

Adaptive infrastructure for foundation model training with elastic tra …

Modern AI infrastructure serves multiple concurrent workloads on the same cluster, from foundation model (FM) pre-training and fine-tuning to production inference and evaluation. In this shared environment, the demands for AI accelerators fluctuates continuously as inference workloads scale with traffic patterns, and experiments complete and release resources. Despite this dynamic availability of AI accelerators, traditional training workloads remain locked into their initial compute allocation, unable to take advantage of idle compute capacity without manual intervention.
Amazon SageMaker HyperPod now supports elastic training, enabling your machine learning (ML) workloads to automatically scale based on resource availability. In this post, we demonstrate how elastic training helps you maximize GPU utilization, reduce costs, and accelerate model development through dynamic resource adaptation, while maintain training quality and minimizing manual intervention.
How static allocation impacts infrastructure utilization
Consider a 256 GPU cluster running both training and inference workloads. During off-peak hours at night, inference may release 96 GPUs. That leaves 96 GPUs sitting idle and available to speed up training. Traditional training jobs run at a fixed scale; such jobs can’t absorb idle compute capacity. As a result, a single training job that starts with 32 GPUs gets locked at this initial configuration, while 96 additional GPUs remain idle; this translates to 2,304 wasted GPU-hours per day, representing thousands of dollars spent daily on underutilized infrastructure investment. The problem is compounded as the cluster size scales.
Scaling distributed training dynamically is technically complex. Even with infrastructure that supports elasticity, you need to halt jobs, reconfigure resources, adjust parallelization, and reshard checkpoints. This complexity is compounded by the need to maintain training progress and model accuracy throughout these transitions. Despite underlying support from SageMaker HyperPod with Amazon EKS and frameworks like PyTorch and NeMo, manual intervention can still consume hours of ML engineering time. The need to repeatedly adjust training runs based on accelerator availability distracts teams from their actual work in developing models.
Resource sharing and workload preemption add another layer of complexity. Current systems lack the ability to gracefully handle partial resource requests from higher-priority workloads. Consider a scenario where a critical fine-tuning job requires 8 GPUs from a cluster where a pre-training workload occupies all 32 GPUs. Today’s systems force a binary choice: either stop the entire pre-training job or deny resources to the higher-priority workload, even though 24 GPUs would suffice for continued pre-training at reduced scale. This limitation leads organizations to over-provision infrastructure to avoid resource contention, resulting in larger queues of pending jobs, increased costs, and reduced cluster efficiency.
Solution overview
SageMaker HyperPod now offers elastic training. Training workloads can automatically scale up to utilize available accelerators and gracefully contract when resources are needed elsewhere, all while maintaining training quality. SageMaker HyperPod manages the complex orchestration of checkpoint management, rank reassignment, and process coordination, minimizing manual intervention and helping teams focus on model development rather than infrastructure management.
The SageMaker HyperPod training operator integrates with the Kubernetes control plane and resource scheduler to make scaling decisions. It monitors pod lifecycle events, node availability, and scheduler priority signals. This lets it detect scaling opportunities almost instantly, whether from newly available resources or new requests from higher-priority workloads. Before initiating any transition, the operator evaluates potential scaling actions against configured policies (minimum and maximum node boundaries, scaling frequency limits) before initiating transitions.

Elastic Training Scaling Event Workflow
Elastic training adds or removes data parallel replicas while keeping the global batch size constant. When resources become available, new replicas join and speed up throughput without affecting convergence. When a higher-priority workload needs resources, the system removes replicas instead of killing the entire job. Training continues at reduced capacity.
When a scaling event occurs, the operator broadcasts a synchronization signal to all ranks. Each process completes its current step and saves state using PyTorch Distributed Checkpoint (DCP). As new replicas join or existing replicas depart, the operator recalculates rank assignments and initiates process restarts across the training job. DCP then loads and redistributes the checkpoint data to match the new replica count, making sure each worker has the correct model and optimizer state. Training resumes with adjusted replicas, and the constant global batch size makes sure convergence remains unaffected.
For clusters using Kueue (including SageMaker HyperPod task governance), elastic training implements intelligent workload management through multiple admission requests. The operator first requests minimum required resources with high priority, then incrementally requests additional capacity with lower priority. This approach enables partial preemption: when higher-priority workloads need resources, only the lower-priority replicas are revoked, allowing training to continue on the guaranteed baseline rather than terminating completely.

Getting started with elastic training
In the following sections, we guide you through setting up and configuring elastic training on SageMaker HyperPod.
Prerequisites
Before integrating elastic training in your training workload, ensure your environment meets the following requirements:

SageMaker HyperPod cluster orchestrated by Amazon EKS with Kubernetes v1.32 and above. For information on creating a SageMaker HyperPod EKS cluster, see Creating a SageMaker HyperPod cluster with Amazon EKS orchestration.
HyperPod training operator v1.2 and above installed on the cluster.
SageMaker HyperPod task governance v1.3.1 and above for job queuing, prioritization, and scheduling.

Configure namespace isolation and resource controls
If you use cluster auto scaling (like Karpenter), set namespace-level ResourceQuotas. Without them, elastic training’s resource requests can trigger unlimited node provisioning. ResourceQuotas limit the maximum resources that jobs can request while still allowing elastic behavior within defined boundaries.
The following code is an example ResourceQuota for a namespace limited to 8 ml.p5.48xlarge instances (each instance has 8 NVIDIA H100 GPUs, 192 vCPUs, and 640 GiB memory, so 8 instances =64 GPUs, 1,536 vCPUs, and 5,120 GiB memory):

apiVersion: v1
kind: ResourceQuota
metadata:
name: training-quota
namespace: team-ml
spec:
hard:
nvidia.com/gpu: “64”
vpc.amazonaws.com/efa: “256”
requests.cpu: “1536”
requests.memory: “5120Gi”
limits.cpu: “1536”
limits.memory: “5120Gi”

We recommend organizing workloads into separate namespaces per team or project, with AWS Identity and Access Management (IAM) role-based access control (RBAC) mappings to support proper access control and resource isolation.
Build HyperPod training container
The HyperPod training operator uses a custom PyTorch launcher from the HyperPod Elastic Agent Python package to detect scaling events, coordinate checkpoint operations, and manage the rendezvous process when the world size changes. Install the elastic agent, then replace torchrun with hyperpodrun in your launch command. For more details, see HyperPod elastic agent.
The following code is an example training container configuration:

FROM <YOUR-BASE-IMAGE>

RUN pip install hyperpod-elastic-agent # insall hyperpod-elastic-agent
ENTRYPOINT [“entrypoint.sh”]

# entrypoint.sh …
hyperpodrun –nnodes=node_count –nproc-per-node=proc_count
–rdzv-backend hyperpod

Enable elastic scaling in training code:
Complete the following steps to enable elastic scaling in your training code:

Add the HyperPod elastic agent import to your training script to detect when scaling events occur:

from hyperpod_elastic_agent.elastic_event_handler import elastic_event_detected

Modify your training loop to check for elastic events after each training batch. When a scaling event is detected, your training process needs to save a checkpoint and exit gracefully, allowing the operator to restart the job with a new world size:

def train_epoch(model, dataloader, optimizer, args):
for batch_idx, batch_data in enumerate(dataloader):
# Forward and backward pass
loss = model(batch_data).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()

# Check if we should checkpoint (periodic or scaling event)
should_checkpoint = (batch_idx + 1) % args.checkpoint_freq == 0
elastic_event = elastic_event_detected() # Returns True when scaling is needed

# Save checkpoint if scaling-up or scaling down job
if should_checkpoint or elastic_event:
save_checkpoint(model, optimizer, scheduler,
checkpoint_dir=args.checkpoint_dir,
step=global_step)

if elastic_event:
# Exit gracefully – operator will restart with new world size
print(“Elastic scaling event detected. Checkpoint saved.”)
return

The key pattern here is checking for elastic_event_detected() during your training loop and returning from the training function after saving a checkpoint. This allows the training operator to coordinate the scaling transition across all workers.

Finally, implement checkpoint save and load functions using PyTorch DCP. DCP is essential for elastic training because it automatically reshards model and optimizer states when your job resumes with a different number of replicas:

import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict

def save_checkpoint(model, optimizer, lr_scheduler, user_content, checkpoint_path):
“””Save checkpoint using DCP for elastic training.”””
state_dict = {
“model”: model,
“optimizer”: optimizer,
“lr_scheduler”: lr_scheduler,
**user_content
}

dcp.save(
state_dict=state_dict,
storage_writer=dcp.FileSystemWriter(checkpoint_path)
)

def load_checkpoint(model, optimizer, lr_scheduler, checkpoint_path):
“””Load checkpoint using DCP with automatic resharding.”””
state_dict = {
“model”: model,
“optimizer”: optimizer,
“lr_scheduler”: lr_scheduler
}

dcp.load(
state_dict=state_dict,
storage_reader=dcp.FileSystemReader(checkpoint_path)
)

return model, optimizer, lr_scheduler

For single-epoch training scenarios where each data sample must be seen exactly once, you must persist your dataloader state across scaling events. Without this, when your job resumes with a different world size, previously processed samples may be repeated or skipped, affecting training quality. A stateful dataloader saves and restores the dataloader’s position during checkpointing, making sure training continues from the exact point where it stopped. For implementation details, refer to the stateful dataloader guide in the documentation.
Submit elastic training job
With your training container built and code instrumented, you’re ready to submit an elastic training job. The job specification defines how your training workload scales in response to cluster resource availability through the elasticPolicy configuration.
Create a HyperPodPyTorchJob specification that defines your elastic scaling behavior using the following code:

apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
name: elastic-training-job
spec:
elasticPolicy:
minReplicas: 2 # Minimum replicas to keep training running
maxReplicas: 8 # Maximum replicas for scale-up
replicaIncrementStep: 2 # Scale in fixed increments of 2 nodes
# Alternative: use replicaDiscreteValues: [2, 4, 8] for specific scale points
gracefulShutdownTimeoutInSeconds: 600 # Time allowed for checkpoint save
scalingTimeoutInSeconds: 60 # Delay before initiating scale-up
faultyScaleDownTimeoutInSeconds: 30 # Wait time before scaling down on failures
replicaSpecs:
– name: worker
replicas: 2 # Initial replica count
maxReplicas: 8 # Must match elasticPolicy.maxReplicas
template:
spec:
containers:
– name: pytorch
image: <your-training-container>
command: [“hyperpodrun”]
args:
– “–nnodes=2”
– “–nproc-per-node=8”
– “–rdzv-backend=hyperpod”
– “train.py”
resources:
requests:
nvidia.com/gpu: 8
vpc.amazonaws.com/efa: 32
limits:
nvidia.com/gpu: 8
vpc.amazonaws.com/efa: 32

The elasticPolicy configuration controls how your training job responds to resource changes:

minReplicas and maxReplicas: These define the scaling boundaries. Your job will always maintain at least minReplicas and never exceed maxReplicas, maintaining predictable resource usage.
replicaIncrementStep vs. replicaDiscreteValues: Choose one approach for scaling granularity. Use replicaIncrementStep for uniform scaling (for example, a step of 2 means scaling to 2, 4, 6, 8 nodes). Use replicaDiscreteValues: [2, 4, 8] to specify exact allowed configurations. This is useful when certain world sizes work better for your model’s parallelization strategy.
gracefulShutdownTimeoutInSeconds: This gives your training process time to complete checkpointing before the operator forces a shutdown. Set this based on your checkpoint size and storage performance.
scalingTimeoutInSeconds: This introduces a stabilization delay before scale-up to prevent thrashing when resources fluctuate rapidly. The operator waits this duration after detecting available resources before triggering a scale-up event.
faultyScaleDownTimeoutInSeconds: When pods fail or crash, the operator waits this duration for recovery before scaling down. This prevents unnecessary scale-downs due to transient failures.

Elastic training incorporates anti-thrashing mechanisms to maintain stability in environments with rapidly fluctuating resource availability. These protections include enforced minimum stability periods between scaling events and an exponential backoff strategy for frequent transitions. By preventing excessive fluctuations, the system makes sure training jobs can make meaningful progress at each scale point rather than being overwhelmed by frequent checkpoint operations. You can tune these anti-thrashing policies in the elastic policy configuration, enabling a balanced approach between responsive scaling and training stability that aligns with their specific cluster dynamics and workload requirements.
You can then submit the job using kubectl or the SageMaker HyperPod CLI, as covered in documentation:

kubectl apply -f elastic-job.yaml

Using SageMaker HyperPod recipes
We have created SageMaker HyperPod recipes for elastic training for publicly available FMs, including Llama and GPT-OSS. These recipes provide pre-validated configurations that handle parallelization strategy, hyperparameter adjustments, and checkpoint management automatically, requiring only YAML configuration changes to specify the elastic policy with no code modifications. Teams simply specify minimum and maximum node boundaries in their job specification, and the system manages all scaling coordination as cluster resources fluctuate.

# Enable elastic training in an existing recipe
python launcher.py
recipes=llama/llama3_1_8b_sft
recipes.elastic_policy.is_elastic=true
recipes.elastic_policy.min_nodes=2
recipes.elastic_policy.max_nodes=8

Recipes also support scale-specific configurations through the scale_config field, so you can define different hyperparameters (batch size, learning rate) for each world size. This is particularly useful when scaling requires adjusting batch distribution or enabling uneven batch sizes. For detailed examples, see the SageMaker HyperPod Recipes repository.
Performance results
To demonstrate elastic training’s impact, we fine-tuned a Llama-3 70B model on the TAT-QA dataset using a SageMaker HyperPod cluster with up to 8 ml.p5.48xlarge instances. This benchmark illustrates how elastic training performs in practice when dynamically scaling in response to resource availability, simulating a realistic environment where training and inference workloads share cluster capacity.
We evaluated elastic training across two key dimensions: training throughput and model convergence during scaling transitions. We observed a consistent improvement in throughput at different scaling configurations from 1 node to 8 nodes, as shown in the following figures. Training performance improved from 2,000 tokens/second at 1 node, and up to 14,000 tokens/second at 8 nodes. Throughout the training run, the loss continued decrease as model training continued to converge.

Training throughput with Elastic Training
Model convergence with Elastic Training
Integration with SageMaker HyperPod capabilities
Beyond its core scaling capabilities, elastic training takes advantage of the integration with the infrastructure capabilities of SageMaker HyperPod. Task governance policies automatically trigger scaling events when workload priorities shift, enabling training to yield resources to higher-priority inference or evaluation workloads. Support for SageMaker Training Plans allows training to opportunistically scale using cost-optimized capacity types while maintaining resilience through automatic scale-down when spot instances are reclaimed. The SageMaker HyperPod observability add-on complements these capabilities by providing detailed insights into scaling events, checkpoint performance, and training progression, helping teams monitor and optimize their elastic training deployments.
Conclusion
Elastic training on SageMaker HyperPod addresses the problem of wasted resources in AI clusters. Training jobs can now scale automatically as resources become available without requiring manual infrastructure adjustments. The technical architecture of elastic training maintains training quality throughout scaling transitions. By preserving the global batch size and learning rate across different data-parallel configurations, the system maintains consistent convergence properties regardless of the current scale.
You can expect three primary benefits. First, from an operational perspective, the reduction of manual reconfiguration cycles fundamentally changes how ML teams work. Engineers can focus on model innovation and development rather than infrastructure management, significantly improving team productivity and reducing operational overhead. Second, infrastructure efficiency sees dramatic improvements as training workloads dynamically consume available capacity, leading to substantial reductions in idle GPU hours and corresponding cost savings. Third, time-to-market accelerates considerably as training jobs automatically scale to utilize available resources, enabling faster model development and deployment cycles.
To get started, refer to the documentation guide. Sample implementations and recipes are available in the GitHub repository.

About the Authors
Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS. He helps AWS customers, from small startups to large enterprises to train and deploy foundation models efficiently on AWS. He has a background in Microprocessor Engineering passionate about computational optimization problems and improving the performance of AI workloads. You can connect with Roy on LinkedIn.
Anirudh Viswanathan is a Senior Product Manager, Technical, at AWS with the SageMaker team, where he focuses on Machine Learning. He holds a Master’s in Robotics from Carnegie Mellon University and an MBA from the Wharton School of Business. Anirudh is a named inventor on more than 50 AI/ML patents. He enjoys long-distance running, exploring art galleries, and attending Broadway shows. You can connect with Anirudh on LinkedIn.
Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker AI. He holds a Master’s degree from UIUC with a specialization in Data science. He specializes in Generative AI workloads, helping customers build and deploy LLM’s using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.
Oleg Talalov is a Senior Software Development Engineer at AWS, working on the SageMaker HyperPod team, where he focuses on Machine Learning and high-performance computing infrastructure for ML training. He holds a Master’s degree from Peter the Great St. Petersburg Polytechnic University. Oleg is an inventor on multiple AI/ML technologies and enjoys cycling, swimming, and running. You can connect with Oleg on LinkedIn
Qianlin Liang is a Software Development Engineer at AWS with the SageMaker team, where he focuses on AI systems. He holds a Ph.D. in Computer Science from University of Massachusetts Amherst. His research develops system techniques for efficient and resilient machine learning. Outside of works, he enjoys running and photographing. You can connect with Qianlin on LinkedIn.
Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. At AWS, Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.
Anirban Roy is a Principal Engineer at AWS with the SageMaker team, primarily focusing on AI training infra, resiliency and observability. He holds a Master’s in Computer Science from Indian Statistical Institute in Kolkata. Anirban is a seasoned distributed software system builder with more than 20 years of experience and multiple patents and publications. He enjoys road biking, reading non-fiction, gardening and nature traveling. You can connect with Anirban on LinkedIn
Arun Nagarajan is a Principal Engineer on the Amazon SageMaker AI team, where he currently focuses on distributed training across the entire stack. Since joining the SageMaker team during its launch year, Arun has contributed to multiple products within SageMaker AI, including real-time inference and MLOps solutions. When he’s not working on machine learning infrastructure, he enjoys exploring the outdoors in the Pacific Northwest and hitting the slopes for skiing.

Customize agent workflows with advanced orchestration techniques using …

Large Language Model (LLM) agents have revolutionized how we approach complex, multi-step tasks by combining the reasoning capabilities of foundation models with specialized tools and domain expertise. While single-agent systems using frameworks like ReAct work well for straightforward tasks, real-world challenges often require multiple specialized agents working in coordination. Think about planning a business trip: one agent is needed to research flights based on schedule constraints, another to find accommodations near meeting locations, and a third to coordinate ground transportation—each requiring different tools and domain knowledge. This multi-agent approach introduces a critical architectural challenge: orchestrating the flow of information between agents to ensure reliable, predictable outcomes. Without proper orchestration, agent interactions can become unpredictable, making systems difficult to debug, monitor, and scale in production environments. Agent orchestration addresses this challenge by defining explicit workflows that govern how agents communicate, when they execute, and how their outputs integrate into cohesive solutions. Rather than allowing agents to interact ad hoc, orchestration creates structured pathways that make reasoning transparent and information flow intentional.
Strands Agents is an open-source SDK designed specifically for building orchestrated artificial intelligence (AI) systems. It provides flexible agent abstractions, seamless tool integration, comprehensive observability, and orchestration components like GraphBuilder that enable developers to connect agents into directed workflows with precision and control.
In this post, we explore two powerful orchestration patterns implemented with Strands Agents. Using a common set of travel planning tools, we demonstrate how different orchestration strategies can solve the same problem through distinct reasoning approaches: ReWOO (Reasoning Without Observation), which separates planning, execution, and synthesis into discrete stages, and Reflexion, which implements iterative refinement through structured critique and improvement cycles. These examples will show you how Strands enables precise control over multi-agent workflows, resulting in more reliable, transparent, and maintainable AI systems.
Getting started with Strands Agents
Strands Agents is an open-source framework recently launched by AWS for building production-ready AI agents. It simplifies agent development by abstracting the agent loop into three core components:

Model Provider: The reasoning engine (like Claude on Amazon Bedrock)
System Prompt: Instructions that shape the agent’s role and constraints
Toolbelt: The set of APIs or functions the agent can call

This modular design lets users start with simple single-agent systems and scale up to sophisticated multi-agent architectures. Strands includes built-in support for async operations, session state management, and integrations with multiple providers including Amazon Bedrock, Anthropic, and Mistral. It also integrates seamlessly with AWS services like Lambda, Fargate, and AgentCore.
What makes Strands particularly powerful is its multi-agent orchestration capabilities. Users can compose agents in several ways: use one agent as a tool for another, pass control between agents through handoffs, or coordinate multiple agents working in parallel. The SDK’s GraphBuilder feature lets users connect agents into structured workflows, enabling them to collaborate on complex tasks in a controlled, predictable manner.
For production deployments, Strands provides enterprise-grade observability through OpenTelemetry integration. This provides distributed tracing across an entire agent system, making it easy to debug issues and monitor performance as users scale from prototypes to production workloads.
Fundamentals of Agent Orchestration with Strands
The  ReAct pattern is the default approach for most AI agents today. It combines planning, tool invocation, and answer synthesis into a single agent loop. While this works for simple tasks, it creates problems for complex scenarios. The agent might call tools repeatedly without a clear strategy, mix evidence gathering with conclusions, or rush to an answer without verification. These issues become critical in applications requiring structured reasoning, compliance checks, or multi-step validation. This is where orchestration shines.
Instead of one agent doing everything, Strands enables the creation of specialized agents with distinct roles in solving the problem. For example, one agent might plan the approach, another executes the plan, and a third synthesizes the results. Users connect these agents in controlled workflows that match exact requirements. In Strands, orchestration patterns use a graph execution model. Think of it as a flowchart where:

Each node is a specialized agent
Edges define how information flows between agents
The structure makes reasoning steps visible and debuggable

Unlike ReAct’s hidden decision-making, graphs expose every step. Users can trace which agent produced what output, when it became available, and how the next agent used it. This transparency is crucial for building reliable systems. Strands provides four fundamental components for any orchestration pattern:

Nodes: Agents that encapsulate specific logic or expertise
Edges: Connections that define execution order and data flow
AgentResult: Standardized output format from each agent
GraphResult: Complete execution trace with timing, outputs, and paths taken

The GraphBuilder API lets users wire these components together to define which agents participate, how data flows between them, and where user input enters the system. At runtime, the graph executes deterministically and returns structured results.

Consider a document Q&A pipeline:

User Query → Retriever Agent → Summarizer Agent → Final Answer
builder = GraphBuilder()
builder.add_node(retriever, “retriever”)
builder.add_node(summarizer, “summarizer”)
builder.add_edge(“retriever”, “summarizer”)
builder.set_entry_point(“retriever”)

The Retriever searches for relevant documents. The Summarizer condenses them. Each agent only sees the data it needs, when it needs it. The flow is explicit, predictable, and easy to debug. This same approach scales to complex patterns. Users can add branches for different reasoning paths, loops for iterative refinement, or parallel execution for exploring multiple strategies. The key is that control is maintained over how information flows through the system.
In the sections that follow, we implement our first pattern: ReWOO, which separates planning from execution to create more reliable agent workflows.
Dataset and default orchestration
Dataset details
We evaluated our system on the τ-Bench airline domain dataset (Yao et al., 2024), which features 300+ flight entries, 500 synthetic user profiles, 2,000+ pre-generated bookings, detailed airline policies, simulated APIs for reservation operations, and 50 structured real-world scenarios. This comprehensive benchmark provides a controlled yet realistic testbed for assessing how agents interpret policies, execute appropriate API calls, and maintain consistency across complex airline operations including upgrades, itinerary changes, and cancellations. While the original dataset presents each task as a multi-turn conversation, we’ve simplified them into single turn queries for this tutorial to better showcase the orchestration patterns.
Architecture at a glance: Default orchestration with ReAct
ReAct (Reasoning + Acting) interleaves two phases inside a single agent loop. The agent reasons in natural language to decide the next step, invokes a tool if needed, observes the tool’s output, and continues reasoning with that observation until it can produce a final answer.
In Strands Agents, the ReAct baseline maps cleanly to a single Agent that owns the τ-Bench airline toolbelt – a list of airline tools(search flights, book/modify/cancel reservations, look up profiles, etc.). The tools are the python functions provided in Tau-Bench dataset, converted to Strands tools using @tool decorator.

tools = [
    book_reservation,
    calculate,
    cancel_reservation,
    get_reservation_details,
    get_user_details,
    list_all_airports,
    search_direct_flight,
    search_onestop_flight,
    send_certificate,
    think,
    transfer_to_human_agents,
    update_reservation_baggages,
    update_reservation_flights,
    update_reservation_passengers,
]

prompt = “””
You are a helpful assistant for a travel website. Help the user answer any questions.

<instructions>
– Remember to check if the the airport city is in the state mentioned by the user. For example, Houston is in Texas.
– Infer about the the U.S. state in which the airport city resides. For example, Houston is in Texas.
– You should not use made-up or placeholder arguments.
<instructions>

<policy>
{policy}
</policy>
“””

react_agent=Agent(model = model,tools = tools,system_prompt = prompt)
react_response = react_agent(user_query) 

There is no explicit planner or critic; the policy that governs “think → act → observe → think …” lives inside the agent’s prompt and the model’s internal loop. This makes ReAct a natural baseline for tool-augmented systems because it requires minimal orchestration—one agent ‘tool-executor’ with a toolbelt—and it tends to be fast in simple tasks.

Architecture at a glance: ReWOO (Reasoning Without Observations)
ReWOO reframes “how tools are used” rather than “which tools exist.” We keep a single tool-executor for all airline APIs, but we enforce a plan → execute → synthesize separation around it. In Strands, this becomes a small, explicit graph where each node returns a typed result (AgentResult) and the runtime forwards those results downstream in a deterministic way. This leads to governance, observability, and repeatability.

Planner (plan only). Produces a strictly formatted plan.
Worker (execute only). Parses the plan, resolves arguments, call tools, and accumulates evidence in a normalized structure. Decoupling execution from planning makes tool use predictable and policy-enforceable (the worker can only run what the plan authorizes).
Solver (synthesize only). Reads evidence—results from the tools not the tools directly—then composes the final answer. It keeps tool effects and decision-making auditable; avoids “hidden” follow-up calls in the last step.

Constructed with Strands’ GraphBuilder (nodes, edges, entry point), this becomes a deterministic DAG. The runtime hands each downstream node the original task plus the upstream node’s output—captured in AgentResult.

from strands.multiagent.graph import GraphBuilder

b = GraphBuilder()
b.add_node(planner_agent, “planner”)
b.add_node(worker_agent,  “worker”)
b.add_node(solver_agent,  “solver”)
b.add_edge(“planner”, “worker”)
b.add_edge(“worker”,  “solver”)
b.set_entry_point(“planner”)
graph = b.build()

Planner: plan-only agent with a strict grammar
The planner generates a declarative program describing tool usage, not an answer. The following are the important features to design an effective planner prompt:

Enumerate the allowed set of tool names with arguments
Few-shot examples to demonstrate the LLM how to plan to answer a given user query.
Enforce the output shape. We used this:

Plan 1: <short intent>
#E1 = <tool_name>[key=value, …]

Plan 2: <short intent>
#E2 = <tool_name>[key=value, …]

#E4 = REPEAT(<analysis_or_count>) {
    <tool_a>[…]
    <tool_b>[…]
}

The plan is returned as an AgentResult. A strict plan is audit-ready and minimizes ambiguity. It also enables static checks (e.g., “only these tools are allowed; one per step”) before anything runs.
Example of a plan created by planner agent for user query:
“My user id is mia_li_3668. I want to fly from New York to Seattle on May 20 (one way). I do not want to fly before 11am EST. I want to fly in economy. I prefer direct flights but one stopover is also fine. If there are multiple options, I prefer the one with the lowest price. I have 3 baggages. I do not want insurance. I want to use my two certificates to pay. If only one certificate can be used, I prefer using the larger one, and pay the rest with my 7447 card.”

DEBUG: PLANNER AGENT CALLED

Plan 1: Get user details to check available certificates
#E1 = get_user_details[user_id=”mia_li_3668″]

Plan 2: Get list of airports to find the airport codes for New York and Seattle
#E2 = list_all_airports[]

Plan 3: Search for direct flights using airport codes from #E2 and date from given user question
#E3 = search_direct_flight[origin=”JFK”, destination=”SEA”, date=”2024-05-20″]

Plan 4: If no suitable direct flights after 11am, search for one-stop flights
#E4 = search_onestop_flight[origin=”JFK”, destination=”SEA”, date=”2024-05-20″]

Plan 5: Think about the flight selection, pricing, and payment options
#E5 = think[“Analyze the flight options from #E3 and #E4:
– Filter flights departing after 11am EST
– Select cheapest suitable flight (direct preferred)
– Calculate baggage fees (3 bags total)
– Determine payment strategy using certificates from user profile
– Plan to use the largest certificate first
– Prepare to use 7447 card for remaining balance”]

Plan 6: Book the reservation with all the collected information
#E6 = book_reservation[user_id=”mia_li_3668″, origin=”JFK”, destination=”SEA”, flight_type=”one_way”, cabin=”economy”, flights=[selected_flight_from_E3_or_E4], passengers=[{“first_name”:”Mia”, “last_name”:”Li”}], payment_methods=[largest_certificate, remaining_certificate_or_card_7447], total_baggages=3, nonfree_baggages=calculated_from_E5, insurance=false]

Worker: deterministic executor with argument and loop resolution
The worker executes only what the plan authorizes; argument resolution is data-driven. This makes behavior reproducible across runs and model versions.The worker treats the plan as an executable spec.

Unified plan parser: It parses both regular steps and REPEAT blocks, sorts by evidence ID, and executes them in order.It parses both regular steps and REPEAT blocks, sorts by evidence ID, and executes them in order.

Evidence ledger : Every step produces a structured record (#E{id} with description + results). Errors are captured as evidence instead of failing silently.step_evidence[f’#E{eid}’] = {

step_evidence[f’#E{eid}’] = {
‘evidence_id’: f’#E{eid}’,
‘description’: f”Execute {tool} with {kwargs or ‘no parameters’}”,
‘results’: result_text<br />
}
all_evidence.update(step_evidence)

Context-aware dynamic argument resolution: Build a context from (a) the original task and (b) previous N evidences. Fill placeholders (e.g., airport codes, reservation IDs) from that context—no brittle regex on raw strings. This can be done in couple of different ways. One way is to use a LLM to infer the argument values from the built context. The second method is to use a regex matching to resolve argument values.
Dynamic tool dispatch with special cases : Tools are invoked directly using getattr.

Example of an executed step:

DEBUG: Processing step #E3

DEBUG: Tool name: search_direct_flight
DEBUG: Calling search_direct_flight with kwargs:
{
    ‘origin’: ‘JFK’,
    ‘destination’: ‘SEA’,
    ‘date’: ‘2024-05-20’
}

DEBUG: Tool result:
{
    ‘toolUseId’: ‘tooluse_search_direct_flight_716684779’,
    ‘status’: ‘success’,
    ‘content’: [
        {
            ‘text’: ‘{“flights”: [
                {
                    “flight_number”: “HAT069”,
                    “origin”: “JFK”,
                    “destination”: “SEA”,
                    “scheduled_departure_time_est”: “06:00:00”,
                    “scheduled_arrival_time_est”: “12:00:00”,
                    “status”: “available”,
                    “available_seats”: {
                        “basic_economy”: 17,
                        “economy”: 12,
                        “business”: 3
                    },
                    “prices”: {
                        “basic_economy”: 51,
                        “economy”: 121,
                        “business”: 239
                    },
                    “date”: “2024-05-20”
                },
                {
                    “flight_number”: “HAT083”,
                    “origin”: “JFK”,
                    “destination”: “SEA”,
                    “scheduled_departure_time_est”: “01:00:00”,
                    “scheduled_arrival_time_est”: “07:00:00”,
                    “status”: “available”,
                    “available_seats”: {
                        “basic_economy”: 16,
                        “economy”: 7,
                        “business”: 3
                    },
                    “prices”: {
                        “basic_economy”: 87,
                        “economy”: 100,
                        “business”: 276
                    },
                    “date”: “2024-05-20”
                }
            ]}’
        }
    ]
}

Solver: Builds the final answer and presents to the user
Solver combines execution evidence from Worker with the original user query to generate the final response. It receives the structured evidence dictionary and synthesizes it into a natural language answer. The solver never calls tools. It does the following:

Evidence parsing – Reads the original task and the worker’s evidence , from worker agent node.
Plan reconstruction — normalizes evidence into a compact, ordered “plan + evidence” text block.
Final answer generation – Uses LLM with an appropriate prompt to produce the final answer, explicitly addressing constraints and trade-offs.

solve_prompt = “””Solve the following task or problem.
To solve the problem, we have made step-by-step Plan and retrieved
corresponding Evidence to each Plan. Use them with caution since long
evidence might contain irrelevant information.

{plan}

Now solve the question or task according to provided Evidence above.
Respond with the answer directly with no extra words.

Task: {task}
Response:”””

Separating synthesis from execution yields clear decision logs and stable latency. It also makes it easy to swap synthesis prompts or models without touching planning/execution logic.
Architecture at a glance: Reflexion (Self-Critiquing)
Reflexion is an orchestration pattern where an agent generates a candidate answer and a critique of that answer, then uses the critique to revise the answer in a bounded loop. The goal isn’t to “try again” blindly, but to target revisions based on explicit, machine-parsable feedback (e.g., violated constraints, missing checks, weak rationale). In other words, Reflexion turns model feedback into a structured control signal that governs one or more additional passes, stopping as soon as the answer meets the stated criteria.
Reflexion wraps the existing flight tool-executor with a deliberate draft → critique → (optional) revision loop. Instead of accepting the first output, the system generates a candidate answer, evaluates it against explicit criteria, and only revises when the critique says it should. The motivation is that this method would give higher answer quality. The Reflexion graph has 2 nodes built with GraphBuilder.

Draft (plan only). Produces an initial answer and initial reflection.
Revisor (execute only). Loops between improving the query, revising and reflecting on the answer.

Although the orchestration is modeled as a DAG, the revisor node encapsulates up to three revision cycles, invoking tools as needed. Each node returns an AgentResult; the runtime forwards the upstream result to the downstream node and records the full trace in GraphResult.
Draft: Generates initial answer and critique
The draft node uses the same airline tool-executor as used by the other patterns to produce an initial answer. Immediately afterward, it runs a focused “reflection” pass by invoking LLM with a reflection prompt that flags gaps (violated constraints, missing checks, weak rationale) and outputs a compact, labeled payload the revisor can parse deterministically:

reflection_system_prompt=”””You are analyzing a flight assistant’s response that uses real flight database tools.
        
IMPORTANT: The flight data comes from real database queries, NOT hallucination.
        
Analyze the response quality on these dimensions:
: Does it address all parts of the user’s query?
: If the user query clearly states the final goal and if it can be
fulfiled as per the policy, then does the response show that?
: Is the information presented clearly and logically?
: Are next steps or options clearly presented?
: Is the tone helpful and appropriate?
: What important details are missing?
: REVISE or ACCEPT
: Why this decision was made
“””

Formatted payload after revision:**Answer**: …**Self-Reflection**: …**Needs-Revision**: True|False**User-Query**: …
Revisor: Loops through revision and generation phase
The revisor reads the draft payload, parses the labels (Answer, Self-Reflection, Needs-Revision, User-Query), and decides whether revision is warranted. If so, it improves the original user query using the critique (e.g., “limit to departures ≥ 11:00, ≤ 1 stop, min layover 70m”) and re-invokes the tool-executor to produce a revised answer. It then reflects again using the same labels. This cycle is bounded (e.g., up to 3 passes) and stops as soon as the critique returns Needs Revision:False. The query is improved by a LLM using a specially designed prompt .

query_improver_system_prompt=”””You are a query improvement specialist.
Based on reflection analysis, improve the original user query to address
identified issues and guide better responses.

Examples:
Original: “Book me a flight from NYC to LA tomorrow”
Issue: “Agent booked immediately without showing options”
Improved: “Please SEARCH and SHOW ME available flight options from NYC to
LA tomorrow. I want to see different times, prices, and airlines before deciding.
DO NOT book anything until I confirm.”
Now improve the provided query based on the specific reflection issues identified.”””

Results: Responses from different orchestration patterns
In this section, we look at some examples from the dataset and how each orchestration pattern behaves.
Example 1:
“User Query: I am Lucas Brown (user id is lucas_brown_4047). I need to change the date of my flight reservation EUJUY6 and move it out 2 days due to a family emergency.“
Winner: ReWOO (28s) — policy-aligned refusal without unsafe changes
Summary

ReAct (17s): Fast but incorrect—claims date change + charge on a Basic Economy fare.
ReWOO (28s): Correct—blocks modification; points to cancel/rebook path.
Reflexion (60s): Policy-incorrect—acknowledges Basic Economy yet says it can proceed with the change; self-evaluates as “ACCEPT” instead of catching the violation.

Example 2:  
User Query: “My user id is mohamed_silva_9265. I want to know the sum of my gift card balances and sum of my certificate balances… Then I want to change my recent reservation to the cheapest business round trip without changing the dates… If basic economy cannot be changed, cancel and book a new one… Use certificates, then gift cards, then Mastercard; tell me how much my Mastercard will be charged.”
Winner: Reflexion (116s) — follows the user’s pre-authorized “cancel → rebook” path, preserves dates, gives totals, and computes the exact Mastercard remainder.
Summary

ReAct (67s): Detailed search and payment plan but changes a date (May 29) and mutates before a clean confirm; conflicts with user constraint.
ReWOO (43s): Strong plan, correctly identifies Basic Economy and totals, suggests cancel→ rebook; pricing inconsistent for full return and no final Mastercard figure.
Reflexion (116s): End-to-end: totals → constraint check → cheapest RT on same dates → cancel (authorized) → compute Mastercard = $1,286. Slowest, but most aligned with the user’s exact instructions.

Example3:
User Query: “My user id is james_taylor_7043. I want to change my upcoming one-stop flight from LAS to IAH to a nonstop flight. My reservation ID is 1N99U6. I also want to remove my checked bag and want the agent to refund me for the same.”
Winner: Reflexion (27s) — offers valid nonstop options and correctly denies bag-removal refund per policy, without premature changes.
Summary

ReAct (9s): Safe but under-specified—no options surfaced; proposes an action that violates policy.
ReWOO (25s): Over-eager mutation—updates reservation with two flights and issues refund pre-choice; unnecessary transfer for baggage.
Reflexion (27s): Policy-aligned and user-centric—presents concrete nonstop choices, explains no bag refund, and waits for selection.

Example 4: 
User Query: “I am Anya Garcia (ID: anya_garcia_5901). I booked a flight (3RK2T9) and I want to change the passenger name from Mei Lee to Mei Garcia. Please make this change.“
Winner: ReAct (8s) — correct, minimal path: shows precise update preview and awaits a single yes/no.
Summary

ReAct (8s): Correct preview → one-tap confirm; fastest and faithful to user intent.
ReWOO (14s): Good decomposition (identity check + update call) but inconsistent evidence (passenger remained “Lee” in #E4).
Reflexion (40s): Over-thinks a straightforward edit; no change executed.

When to use which pattern

ReAct (fast, linear “do the obvious”)

Use when: A simple, unambiguous update or lookup with 1–2 tool calls and no trade-offs (e.g., rename, toggle, fetch → reply).
Strength: Lowest latency; minimal planning, however latency might increase if it overcalls tools.
Watch out: Can skip policy/eligibility checks and mutate unsafely if you’re not careful.

ReWOO (plan → execute → synthesize with governance)

Use when: You need ordered dependencies and policy gates before any mutation (e.g., verify fare class, then search, then update).
Strength: Transparent dataflow; auditable Graph/Agent results; safer by design.
Watch out (arguments):

If not using an LLM, argument parsing/validation must be meticulous (types/enums/required).
If using an LLM for arguments resolution, pass rich context (schemas + examples) to bind correctly—adds latency but improves reliability on complex params.

Reflexion (analyze options, then act)

Use when: Multi-constraint decisions, trade-offs, or policy nuances require comparing options (cheapest itinerary under payment rules, etc.).
Strength: Better at reasoning over alternatives and producing justified choices.
Watch out: Slower; can over-ask on trivial edits unless reflection is capped.

Architecture at a glance: Hybrid Orchestration — ReWOO-Guided ReAct
Taking into account the pros/cons of ReWOO (governance and auditability), ReAct (speed and flexibility), and Reflexion (quality via critique), we use a hybrid that takes ReWOO’s plan discipline and ReAct’s within-step agility. A ReWOO Planner first emits a strict, step-indexed program (#E1…#En) that names the tools and their order. Execution then switches to a plan-guided ReAct loop that runs inside each step: the agent thinks → validates arguments from prior evidence → calls the authorized tool → observes and (if needed) does one light refinement pass. This preserves global guarantees (no new tools, no reordering, policy gates before mutations) while keeping local flexibility for argument binding and micro-decisions. In Strands, this hybrid maps to a two-node graph:
Planner (ReWOO): Generates the step program only (no tool calls in this node). Output is a typed plan artifact with #E-steps (e.g., get balances → fetch most-recent reservation → search options → compare costs → conditionally mutate).
Plan-Guided ReAct Worker: Consumes the plan and the user task; for each #E step it performs a local ReAct loop but never reorders steps or calls tools not in the plan. It validates arguments, applies policy gates (e.g., Basic Economy ⇒ cancel→ rebook), and synthesizes the final answer. Both planner and executor use the same τ-Bench airline toolbelt (search/book/modify/cancel, user/reservation lookups, math, etc.), exposed as Strands tools.

Local ReAct loop (per step #Ek, bounded):

THINK: derive & validate args from {task, policy, evidence #E1..#Ek-1}
ACT: call the authorized tool for #Ek (no placeholders, no extra tools)
OBSERVE: parse result; at most one refinement pass if needed
COMMIT: append #Ek evidence and advance strictly to #Ek+1

Compared to vanilla ReAct, the plan provides governance and idempotence—the agent can’t wander or mutate early. Compared to pure ReWOO, the in-step loop handles real-world messiness (argument binding, minor retries) without re-planning. Unlike Reflexion, it avoids multi-pass critique overhead on straightforward tasks while still producing an auditable trace (plan + per-step evidence). In practice, we see it shine on multi-step requests that need ordered checks (e.g., totals → fare rules → search → cancel/rebook → payment split) but benefit from small, local reasoning inside each tool call.
Conclusion
In this post, we showed how custom orchestration on Amazon Strands helps users move beyond a single, monolithic agent and engineer explicit control over reasoning, tool use, and information flow. Using the same τ-Bench airline toolkit, we compared three patterns—ReAct, ReWOO, and Reflexion—under real constraints and observed distinct trade-offs in latency, cost, and answer quality. ReAct remains the lowest-overhead path for simple lookups and single-field updates. ReWOO is the right default when correctness depends on good planning and ordered dependencies: users can stage policy gates before mutations, resolve arguments with richer context, and keep a typed evidence trail for audit. Reflexion adds self-critique to handle multi-constraint choices and payment/itinerary trade-offs, at the cost of extra deliberation. Strands’ graph execution model provides typed handoffs, execution traces, and enforceable tool contracts so users can tune these patterns per use case—tight loops for CRUD, plan→ execute→ synthesize for governed updates, reflect→ revise for option analysis—while bounding side effects and model drift.
To build production agents, treat orchestration as the control plane: pick the pattern that matches your dependency structure and risk profile, then instrument it. Visit this GitHub repo for end-to-end examples, prompts, and runnable graphs.

About the authors
Baishali Chaudhury is an Applied Scientist at the Generative AI Innovation Center at AWS, where she focuses on advancing Generative AI solutions for real-world applications. She has a strong background in computer vision, machine learning, and AI for healthcare. Baishali holds a PhD in Computer Science from University of South Florida and PostDoc from Moffitt Cancer Centre.
Rahul Ghosh is an Applied Scientist at Amazon’s Generative AI Innovation Center, where he works with AWS customers across different verticals to expedite their use of Generative AI. Rahul holds a Ph.D. in Computer Science from the University of Minnesota.
Isaac Privitera is a Principal Data Scientist with the AWS Generative AI Innovation Center, where he develops bespoke generative AI-based solutions to address customers’ business problems. His primary focus lies in building responsible AI systems, using techniques such as RAG, multi-agent systems, and model fine-tuning. When not immersed in the world of AI, Isaac can be found on the golf course, enjoying a football game, or hiking trails with his loyal canine companion, Barry.

OpenAI has Released the ‘circuit-sparsity’: A Set of Open Tools fo …

OpenAI team has released their openai/circuit-sparsity model on Hugging Face and the openai/circuit_sparsity toolkit on GitHub. The release packages the models and circuits from the paper ‘Weight-sparse transformers have interpretable circuits‘.

https://arxiv.org/pdf/2511.13653

What is a weight sparse transformer?

The models are GPT-2 style decoder only transformers trained on Python code. Sparsity is not added after training, it is enforced during optimization. After each AdamW step, the training loop keeps only the largest magnitude entries in every weight matrix and bias, including token embeddings, and zeros the rest. All matrices maintain the same fraction of nonzero elements.

The sparsest models have approximately 1 in 1000 nonzero weights. In addition, the OpenAI team enforced mild activation sparsity so that about 1 in 4 node activations are nonzero, covering residual reads, residual writes, attention channels and MLP neurons.

Sparsity is annealed during training. Models start dense, then the allowed nonzero budget gradually moves toward the target value. This design lets the research team scale width while holding the number of nonzero parameters fixed, and then study the capability interpretability tradeoff as they vary sparsity and model size. The research team show that, for a given pretraining loss, circuits recovered from sparse models are roughly 16 times smaller than those from dense models.

https://arxiv.org/pdf/2511.13653

So, what is a sparse circuit?

The central object in this research work is a sparse circuit. The research team defines nodes at a very fine granularity, each node is a single neuron, attention channel, residual read channel or residual write channel. An edge is a single nonzero entry in a weight matrix that connects two nodes. Circuit size is measured by the geometric mean number of edges across tasks.

To probe the models, the research team built 20 simple Python next token binary tasks. Each task forces the model to choose between 2 completions that differ in one token. Examples include:

single_double_quote, predict whether to close a string with a single or double quote

bracket_counting, decide between ] and ]] based on list nesting depth

set_or_string, track whether a variable was initialized as a set or a string

For each task, they prune the model to find the smallest circuit that still achieves a target loss of 0.15 on that task distribution. Pruning operates at the node level. Deleted nodes are mean ablated, their activations are frozen to the mean over the pretraining distribution. A learned binary mask per node is optimized with a straight through style surrogate so that the objective trades off task loss and circuit size.

https://arxiv.org/pdf/2511.13653

Example circuits, quote closing and counting brackets

The most compact example is the circuit for single_double_quote. Here the model must emit the correct closing quote type given an opening quote. The pruned circuit has 12 nodes and 9 edges.

The mechanism is two step. In layer 0.mlp, 2 neurons specialize:

a quote detector neuron that activates on both ” and ‘

a quote type classifier neuron that is positive on ” and negative on ‘

A later attention head in layer 10.attn uses the quote detector channel as a key and the quote type classifier channel as a value. The final token has a constant positive query, so the attention output copies the correct quote type into the last position and the model closes the string correctly.

https://arxiv.org/pdf/2511.13653

bracket_counting yields a slightly larger circuit but with a clear algorithm. The embedding of [ writes into several residual channels that act as bracket detectors. A value channel in a layer 2 attention head averages this detector activation over the context, effectively computing nesting depth and storing it in a residual channel. A later attention head thresholds this depth and activates a nested list close channel only when the list is nested, which leads the model to output ]].

A third circuit, for set_or_string_fixedvarname, shows how the model tracks the type of a variable called current. One head copies the embedding of current into the set() or “” token. A later head uses that embedding as query and key to copy the relevant information back when the model must choose between .add and +=.

https://arxiv.org/pdf/2511.13653

https://arxiv.org/pdf/2511.13653

Bridges, connecting sparse models to dense models

The research team also introduces bridges that connect a sparse model to an already trained dense model. Each bridge is an encoder decoder pair that maps dense activations into sparse activations and back once per sublayer. The encoder uses a linear map with an AbsTopK activation, the decoder is linear.

Training adds losses that encourage hybrid sparse dense forward passes to match the original dense model. This lets the research team perturb interpretable sparse features such as the quote type classifier channel and then map that perturbation into the dense model, changing its behavior in a controlled way.

https://arxiv.org/pdf/2511.13653

What Exactly has OpenAI Team released?

The OpenAI team as released openai/circuit-sparsity model on Hugging Face. This is a 0.4B parameter model tagged with custom_code, corresponding to csp_yolo2 in the research paper. The model is used for the qualitative results on bracket counting and variable binding. It is licensed under Apache 2.0.

Copy CodeCopiedUse a different Browserimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer

if __name__ == “__main__”:
PROMPT = “def square_sum(xs):n return sum(x * x for x in xs)nnsquare_sum([1, 2, 3])n”
tok = AutoTokenizer.from_pretrained(“openai/circuit-sparsity”, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
“openai/circuit-sparsity”,
trust_remote_code=True,
torch_dtype=”auto”,
)
model.to(“cuda” if torch.cuda.is_available() else “cpu”)

inputs = tok(PROMPT, return_tensors=”pt”, add_special_tokens=False)[“input_ids”].to(
model.device
)
with torch.no_grad():
out = model.generate(
inputs,
max_new_tokens=64,
do_sample=True,
temperature=0.8,
top_p=0.95,
return_dict_in_generate=False,
)

print(tok.decode(out[0], skip_special_tokens=True))
“` :contentReference[oaicite:14]{index=14}

Key Takeaways

Weight sparse training, not post hoc pruning: Circuit sparsity trains GPT-2 style decoder models with extreme weight sparsity enforced during optimization, most weights are zero so each neuron has only a few connections.

Small, task specific circuits with explicit nodes and edges: The research team defines circuits at the level of individual neurons, attention channels and residual channels, and recovers circuits that often have tens of nodes and few edges for 20 binary Python next token tasks.

Quote closing and type tracking are fully instantiated circuits: For tasks like single_double_quote, bracket_counting and set_or_string_fixedvarname, the research team isolate circuits that implement concrete algorithms for quote detection, bracket depth and variable type tracking, with the string closing circuit using 12 nodes and 9 edges.

Models and tooling on Hugging Face and GitHub: OpenAI released the 0.4B parameter openai/circuit-sparsity model on Hugging Face and the full openai/circuit_sparsity codebase on GitHub under Apache 2.0, including model checkpoints, task definitions and a circuit visualization UI.

Bridge mechanism to relate sparse and dense models: The work introduces encoder-decoder bridges that map between sparse and dense activations, which lets researchers transfer sparse feature interventions into standard dense transformers and study how interpretable circuits relate to real production scale models.

Check out the Paper and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OpenAI has Released the ‘circuit-sparsity’: A Set of Open Tools for Connecting Weight Sparse Models and Dense Baselines through Activation Bridges appeared first on MarkTechPost.

5 AI Model Architectures Every AI Engineer Should Know

Everyone talks about LLMs—but today’s AI ecosystem is far bigger than just language models. Behind the scenes, a whole family of specialized architectures is quietly transforming how machines see, plan, act, segment, represent concepts, and even run efficiently on small devices. Each of these models solves a different part of the intelligence puzzle, and together they’re shaping the next generation of AI systems.

In this article, we’ll explore the five major players: Large Language Models (LLMs), Vision-Language Models (VLMs), Mixture of Experts (MoE), Large Action Models (LAMs) & Small Language Models (SLMs).

Large Language Models (LLMs)

LLMs take in text, break it into tokens, turn those tokens into embeddings, pass them through layers of transformers, and generate text back out. Models like ChatGPT, Claude, Gemini, Llama, and others all follow this basic process.

At their core, LLMs are deep learning models trained on massive amounts of text data. This training allows them to understand language, generate responses, summarize information, write code, answer questions, and perform a wide range of tasks. They use the transformer architecture, which is extremely good at handling long sequences and capturing complex patterns in language.

Today, LLMs are widely accessible through consumer tools and assistants—from OpenAI’s ChatGPT and Anthropic’s Claude to Meta’s Llama models, Microsoft Copilot, and Google’s Gemini and BERT/PaLM family. They’ve become the foundation of modern AI applications because of their versatility and ease of use.

Vision-Language Models (VLMs)

VLMs combine two worlds:

A vision encoder that processes images or video

A text encoder that processes language

Both streams meet in a multimodal processor, and a language model generates the final output.

Examples include GPT-4V, Gemini Pro Vision, and LLaVA.

A VLM is essentially a large language model that has been given the ability to see. By fusing visual and text representations, these models can understand images, interpret documents, answer questions about pictures, describe videos, and more.

Traditional computer vision models are trained for one narrow task—like classifying cats vs. dogs or extracting text from an image—and they can’t generalize beyond their training classes. If you need a new class or task, you must retrain them from scratch.

VLMs remove this limitation. Trained on huge datasets of images, videos, and text, they can perform many vision tasks zero-shot, simply by following natural language instructions. They can do everything from image captioning and OCR to visual reasoning and multi-step document understanding—all without task-specific retraining.

This flexibility makes VLMs one of the most powerful advances in modern AI.

Mixture of Experts (MoE)

Mixture of Experts models build on the standard transformer architecture but introduce a key upgrade: instead of one feed-forward network per layer, they use many smaller expert networks and activate only a few for each token. This makes MoE models extremely efficient while offering massive capacity.

In a regular transformer, every token flows through the same feed-forward network, meaning all parameters are used for every token. MoE layers replace this with a pool of experts, and a router decides which experts should process each token (Top-K selection). As a result, MoE models may have far more total parameters, but they only compute with a small fraction of them at a time—giving sparse compute.

For example, Mixtral 8×7B has 46B+ parameters, yet each token uses only about 13B.

This design drastically reduces inference cost. Instead of scaling by making the model deeper or wider (which increases FLOPs), MoE models scale by adding more experts, boosting capacity without raising per-token compute. This is why MoEs are often described as having “bigger brains at lower runtime cost.”

Large Action Models (LAMs)

Large Action Models go a step beyond generating text—they turn intent into action. Instead of just answering questions, a LAM can understand what a user wants, break the task into steps, plan the required actions, and then execute them in the real world or on a computer.

A typical LAM pipeline includes:

Perception – Understanding the user’s input

Intent recognition – Identifying what the user is trying to achieve

Task decomposition – Breaking the goal into actionable steps

Action planning + memory – Choosing the right sequence of actions using past and present context

Execution – Carrying out tasks autonomously

Examples include Rabbit R1, Microsoft’s UFO framework, and Claude Computer Use, all of which can operate apps, navigate interfaces, or complete tasks on behalf of a user.

LAMs are trained on massive datasets of real user actions, giving them the ability to not just respond, but act—booking rooms, filling forms, organizing files, or performing multi-step workflows. This shifts AI from a passive assistant into an active agent capable of complex, real-time decision-making.

Small Language Models (SLMs)

SLMs are lightweight language models designed to run efficiently on edge devices, mobile hardware, and other resource-constrained environments. They use compact tokenization, optimized transformer layers, and aggressive quantization to make local, on-device deployment possible. Examples include Phi-3, Gemma, Mistral 7B, and Llama 3.2 1B.

Unlike LLMs, which may have hundreds of billions of parameters, SLMs typically range from a few million to a few billion. Despite their smaller size, they can still understand and generate natural language, making them useful for chat, summarization, translation, and task automation—without needing cloud computation.

Because they require far less memory and compute, SLMs are ideal for:

Mobile apps

IoT and edge devices

Offline or privacy-sensitive scenarios

Low-latency applications where cloud calls are too slow

SLMs represent a growing shift toward fast, private, and cost-efficient AI, bringing language intelligence directly onto personal devices.

The post 5 AI Model Architectures Every AI Engineer Should Know appeared first on MarkTechPost.

Nanbeige4-3B-Thinking: How a 23T Token Pipeline Pushes 3B Models Past …

Can a 3B model deliver 30B class reasoning by fixing the training recipe instead of scaling parameters? Nanbeige LLM Lab at Boss Zhipin has released Nanbeige4-3B, a 3B parameter small language model family trained with an unusually heavy emphasis on data quality, curriculum scheduling, distillation, and reinforcement learning.

The research team ships 2 primary checkpoints, Nanbeige4-3B-Base and Nanbeige4-3B-Thinking, and evaluates the reasoning tuned model against Qwen3 checkpoints from 4B up to 32B parameters.

https://arxiv.org/pdf/2512.06266

Benchmark results

On AIME 2024, Nanbeige4-3B-2511 reports 90.4, while Qwen3-32B-2504 reports 81.4. On GPQA-Diamond, Nanbeige4-3B-2511 reports 82.2, while Qwen3-14B-2504 reports 64.0 and Qwen3-32B-2504 reports 68.7. These are the 2 benchmarks where the research’s “3B beats 10× larger” framing is directly supported.

The research team also showcase strong tool use gains on BFCL-V4, Nanbeige4-3B reports 53.8 versus 47.9 for Qwen3-32B and 48.6 for Qwen3-30B-A3B. On Arena-Hard V2, Nanbeige4-3B reports 60.0, matching the highest score listed in that comparison table inside the research paper. At the same time, the model is not best across every category, on Fullstack-Bench it reports 48.0, below Qwen3-14B at 55.7 and Qwen3-32B at 58.2, and on SuperGPQA it reports 53.2, slightly below Qwen3-32B at 54.1.

https://arxiv.org/pdf/2512.06266

The training recipe, the parts that move a 3B model

Hybrid Data Filtering, then resampling at scale

For pretraining, the research team combine multi dimensional tagging with similarity based scoring. They reduce their labeling space to 20 dimensions and report 2 key findings, content related labels are more predictive than format labels, and a fine grained 0 to 9 scoring scheme outperforms binary labeling. For similarity based scoring, they build a retrieval database with hundreds of billions of entries supporting hybrid text and vector retrieval.

They filter to 12.5T tokens of high quality data, then select a 6.5T higher quality subset and upsample it for 2 or more epochs, producing a final 23T token training corpus. This is the first place where the report diverges from typical small model training, the pipeline is not just “clean data”, it is scored, retrieved, and resampled with explicit utility assumptions.

FG-WSD, a data utility scheduler instead of uniform sampling

Most similar research projects treat warmup stable decay as a learning rate schedule only. Nanbeige4-3B adds a data curriculum inside the stable phase via FG-WSD, Fine-Grained Warmup-Stable-Decay. Instead of sampling a fixed mixture throughout stable training, they progressively concentrate higher quality data later in training.

https://arxiv.org/pdf/2512.06266

In a 1B ablation trained on 1T tokens, the above Table shows GSM8K improving from 27.1 under vanilla WSD to 34.3 under FG-WSD, with gains across CMATH, BBH, MMLU, CMMLU, and MMLU-Pro. In the full 3B run, the research team splits training into Warmup, Diversity-Enriched Stable, High-Quality Stable, and Decay, and uses ABF in the decay stage to extend context length to 64K.

https://arxiv.org/pdf/2512.06266

Multi-stage SFT, then fix the supervision traces

Post training starts with cold start SFT, then overall SFT. The cold start stage uses about 30M QA samples focused on math, science, and code, with 32K context length, and a reported mix of about 50% math reasoning, 30% scientific reasoning, and 20% code tasks. The research team also claim that scaling cold start SFT instructions from 0.5M to 35M keeps improving AIME 2025 and GPQA-Diamond, with no early saturation in their experiments.

https://arxiv.org/pdf/2512.06266

Overall SFT shifts to a 64K context length mix including general conversation and writing, agent style tool use and planning, harder reasoning that targets weaknesses, and coding tasks. This stage introduces Solution refinement plus Chain-of-Thought reconstruction. The system runs iterative generate, critique, revise cycles guided by a dynamic checklist, then uses a chain completion model to reconstruct a coherent CoT that is consistent with the final refined solution. This is meant to avoid training on broken reasoning traces after heavy editing.

https://arxiv.org/pdf/2512.06266

DPD distillation, then multi stage RL with verifiers

Distillation uses Dual-Level Preference Distillation, DPD. The student learns token level distributions from the teacher model, while a sequence level DPO objective maximizes the margin between positive and negative responses. Positives come from sampling the teacher Nanbeige3.5-Pro, negatives are sampled from the 3B student, and distillation is applied on both sample types to reduce confident errors and improve alternatives.

Reinforcement learning is staged by domain, and each stage uses on policy GRPO. The research team describes on policy data filtering using avg@16 pass rate and retaining samples strictly between 10% and 90% to avoid trivial or impossible items. STEM RL uses an agentic verifier that calls a Python interpreter to check equivalence beyond string matching. Coding RL uses synthetic test functions, validated via sandbox execution, and uses pass fail rewards from those tests. Human preference alignment RL uses a pairwise reward model designed to produce preferences in a few tokens and reduce reward hacking risk compared to general language model rewarders.

https://arxiv.org/pdf/2512.06266

Comparison Table

Benchmark, metricQwen3-14B-2504Qwen3-32B-2504Nanbeige4-3B-2511AIME2024, avg@879.381.490.4AIME2025, avg@870.472.985.6GPQA-Diamond, avg@364.068.782.2SuperGPQA, avg@346.854.153.2BFCL-V4, avg@345.447.953.8Fullstack Bench, avg@355.758.248.0ArenaHard-V2, avg@339.948.460.0

Key Takeaways

3B can lead much larger open models on reasoning, under the paper’s averaged sampling setup. Nanbeige4-3B-Thinking reports AIME 2024 avg@8 90.4 vs Qwen3-32B 81.4, and GPQA-Diamond avg@3 82.2 vs Qwen3-14B 64.0.

The research team is careful about evaluation, these are avg@k results with specific decoding, not single shot accuracy. AIME is avg@8, most others are avg@3, with temperature 0.6, top p 0.95, and long max generation.

Pretraining gains are tied to data curriculum, not just more tokens. Fine-Grained WSD schedules higher quality mixtures later, and the 1B ablation shows GSM8K moving from 27.1 to 34.3 versus vanilla scheduling.

Post-training focuses on supervision quality, then preference aware distillation. The pipeline uses deliberative solution refinement plus chain-of-thought reconstruction, then Dual Preference Distillation that combines token distribution matching with sequence level preference optimization.

Check out the Paper and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Nanbeige4-3B-Thinking: How a 23T Token Pipeline Pushes 3B Models Past 30B Class Reasoning appeared first on MarkTechPost.

How to Design a Fully Local Agentic Storytelling Pipeline Using Gripta …

In this tutorial, we build a fully local, API-free agentic storytelling system using Griptape and a lightweight Hugging Face model. We walk through creating an agent with tool-use abilities, generating a fictional world, designing characters, and orchestrating a multi-stage workflow that produces a coherent short story. By dividing the implementation into modular snippets, we can clearly understand each component as it comes together into an end-to-end creative pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q “griptape[drivers-prompt-huggingface-pipeline]” “transformers” “accelerate” “sentencepiece”

import textwrap
from griptape.structures import Workflow, Agent
from griptape.tasks import PromptTask
from griptape.tools import CalculatorTool
from griptape.rules import Rule, Ruleset
from griptape.drivers.prompt.huggingface_pipeline import HuggingFacePipelinePromptDriver

local_driver = HuggingFacePipelinePromptDriver(
model=”TinyLlama/TinyLlama-1.1B-Chat-v1.0″,
max_tokens=256,
)

def show(title, content):
print(f”n{‘=’*20} {title} {‘=’*20}”)
print(textwrap.fill(str(content), width=100))

We set up our environment by installing Griptape and initializing a local Hugging Face driver. We configure a helper function to display outputs cleanly, allowing us to follow each step of the workflow. As we build the foundation, we ensure everything runs locally without relying on external APIs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsermath_agent = Agent(
prompt_driver=local_driver,
tools=[CalculatorTool()],
)

math_response = math_agent.run(
“Compute (37*19)/7 and explain the steps briefly.”
)

show(“Agent + CalculatorTool”, math_response.output.value)

We create an agent equipped with a calculator tool and test it with a simple mathematical prompt. We observe how the agent delegates computation to the tool and then formulates a natural-language explanation. By running this, we validate that our local driver and tool integration work correctly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserworld_task = PromptTask(
input=”Create a vivid fictional world using these cues: {{ args[0] }}.nDescribe geography, culture, and conflicts in 3–5 paragraphs.”,
id=”world”,
prompt_driver=local_driver,
)

def character_task(task_id, name):
return PromptTask(
input=(
“Based on the world below, invent a detailed character named {{ name }}.n”
“World description:n{{ parent_outputs[‘world’] }}nn”
“Describe their background, desires, flaws, and one secret.”
),
id=task_id,
parent_ids=[“world”],
prompt_driver=local_driver,
context={“name”: name},
)

scotty_task = character_task(“scotty”, “Scotty”)
annie_task = character_task(“annie”, “Annie”)

We build the world-generation task and dynamically construct character-generation tasks that depend on the world’s output. We define a reusable function to create character tasks conditioned on shared context. As we assemble these components, we see how the workflow begins to take shape through hierarchical dependencies. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserstyle_ruleset = Ruleset(
name=”StoryStyle”,
rules=[
Rule(“Write in a cinematic, emotionally engaging style.”),
Rule(“Avoid explicit gore or graphic violence.”),
Rule(“Keep the story between 400 and 700 words.”),
],
)

story_task = PromptTask(
input=(
“Write a complete short story using the following elements.nn”
“World:n{{ parent_outputs[‘world’] }}nn”
“Character 1 (Scotty):n{{ parent_outputs[‘scotty’] }}nn”
“Character 2 (Annie):n{{ parent_outputs[‘annie’] }}nn”
“The story must have a clear beginning, middle, and end, with a meaningful character decision near the climax.”
),
id=”story”,
parent_ids=[“world”, “scotty”, “annie”],
prompt_driver=local_driver,
rulesets=[style_ruleset],
)

story_workflow = Workflow(tasks=[world_task, scotty_task, annie_task, story_task])
topic = “tidally locked ocean world with floating cities powered by storms”
story_workflow.run(topic)

We introduce stylistic rules and create the final storytelling task that merges worldbuilding and characters into a coherent narrative. We then assemble all tasks into a workflow and run it with a chosen topic. Through this, we witness how Griptape chains multiple prompts into a structured creative pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserworld_text = world_task.output.value
scotty_text = scotty_task.output.value
annie_text = annie_task.output.value
story_text = story_task.output.value

show(“Generated World”, world_text)
show(“Character: Scotty”, scotty_text)
show(“Character: Annie”, annie_text)
show(“Final Story”, story_text)

def summarize_story(text):
paragraphs = [p for p in text.split(“n”) if p.strip()]
length = len(text.split())
structure_score = min(len(paragraphs), 10)
return {
“word_count”: length,
“paragraphs”: len(paragraphs),
“structure_score_0_to_10”: structure_score,
}

metrics = summarize_story(story_text)
show(“Story Metrics”, metrics)

We retrieve all generated outputs and display the world, characters, and final story. We also compute simple metrics to evaluate structure and length, giving us a quick analytical summary. As we wrap up, we observe that the full workflow produces measurable, interpretable results.

In conclusion, we demonstrate how easily we can orchestrate complex reasoning steps, tool interactions, and creative generation using local models within the Griptape framework. We experience how modular tasks, rulesets, and workflows merge into a powerful agentic system capable of producing structured narrative outputs. By running everything without external APIs, we gain full control, reproducibility, and flexibility, opening the door to more advanced experiments in local agent pipelines, automated writing systems, and multi-task orchestration.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Fully Local Agentic Storytelling Pipeline Using Griptape Workflows, Hugging Face Models, and Modular Creative Task Orchestration appeared first on MarkTechPost.

Building a voice-driven AWS assistant with Amazon Nova Sonic

As cloud infrastructure becomes increasingly complex, the need for intuitive and efficient management interfaces has never been greater. Traditional command-line interfaces (CLI) and web consoles, while powerful, can create barriers to quick decision-making and operational efficiency. What if you could speak to your AWS infrastructure and get immediate, intelligent responses?
In this post, we explore how to build a sophisticated voice-powered AWS operations assistant using Amazon Nova Sonic for speech processing and Strands Agents for multi-agent orchestration. This solution demonstrates how natural language voice interactions can transform cloud operations, making AWS services more accessible and operations more efficient.
The multi-agent architecture we demonstrate extends beyond basic AWS operations to support diverse use cases including customer service automation, internet-of-things (IoT) device management, financial data analysis, and enterprise workflow orchestration. This foundational pattern can be adapted for any domain requiring intelligent task routing and natural language interaction.
Architecture deep dive
This section explores the technical architecture that powers our voice-driven AWS assistant. The following diagram illustrates how Amazon Nova Sonic integrates with Strands Agents to create a seamless multi-agent system that processes voice commands and executes AWS operations in real-time.

Core components
The multi-agent architecture consists of several specialized components that work together to process voice commands and execute AWS operations:

Supervisor Agent: Acts as the central coordinator, analyzing incoming voice queries and routing them to the appropriate specialized agent based on context and intent.
Specialized Agents:

EC2 Agent: Handles instance management, status monitoring, and compute operations
SSM Agent: Manages Systems Manager operations, command execution, and patch management
Backup Agent: Oversees AWS Backup configurations, job monitoring, and restore operations

Voice Integration Layer: Uses Amazon Nova Sonic for bidirectional voice processing, converting speech to text for processing and text back to speech for responses.

Solution overview
The Strands Agents Nova Voice Assistant demonstrates a new paradigm for AWS infrastructure management through conversational artificial intelligence (AI). Instead of navigating complex web consoles or memorizing CLI commands, users can simply speak their intentions and receive immediate responses. This solution bridges the gap between natural human communication and technical AWS operations, making cloud management accessible to both technical and non-technical team members.
Technology stack
The solution uses modern, cloud-native technologies to deliver a robust and scalable voice interface:

Backend: Python 3.12+ with Strands Agents framework for agent orchestration
Frontend: React with AWS Cloudscape Design System for consistent AWS UI/UX
AI models: Amazon Bedrock and Claude 3 Haiku for natural language understanding and generation
Voice processing: Amazon Nova Sonic for high-quality speech synthesis and recognition
Communication: WebSocket server for real-time bidirectional communication

Key features and capabilities
Our voice-driven assistant offers several advanced features that make AWS operations more intuitive and efficient. The system understands natural voice queries and converts them into appropriate AWS API calls. For example:

“Show me all running EC2 instances in us-east-1”
“Install Amazon CloudWatch agent using SSM on my Dev instances”
“Check the status of last night’s backup jobs”

The responses are specifically optimized for voice delivery, with concise summaries limited to 800 characters, clear structured information delivery, and conversational phrasing that sounds natural when spoken aloud (avoiding technical jargon and using complete sentences suitable for speech synthesis).
Implementation overview
Getting started with the voice-driven AWS assistant involves three main steps:
Environment setup

Configure AWS credentials with access to Bedrock, Nova Sonic, and target AWS services
Set up Python 3.12+ backend environment and React frontend
Ensure proper AWS Identity and Access Management (IAM) permissions for multi-agent operations

Launch the application

Start the Python WebSocket server for voice processing
Launch the React frontend with AWS Cloudscape components
Configure voice settings and WebSocket connections

Begin voice interactions

Grant browser microphone permissions for voice input
Test with example commands like “List my EC2 instances” or “Check backup status”
Experience real-time voice responses through Amazon Nova Sonic

Ready to build your own? Complete deployment instructions, code examples, and troubleshooting guides are available in the GitHub repository.
Example prompts to test through audio
Test your voice assistant with these example commands:
EC2 instance management:

“List my dev EC2 instances where tag key is ‘env’”
“What’s the status of those instances?”
“Start those instances”
“Do these instances have SSM permissions?”

Backup management:

“Make sure these instances are backed up daily”

SSM management:

“Install CloudWatch agent using SSM on these instances”
“Scan these instances for patches using SSM”

Demo video
The following video demonstrates the voice assistant in action, showing how natural language commands are processed and executed against AWS services via real-time voice interaction, agent coordination, and AWS API responses.

Implementation examples
The following code examples demonstrate key integration patterns and best practices for implementing your voice-driven AWS assistant. These examples show how to integrate Amazon Nova Sonic for voice processing and configure the supervisor agent for intelligent task routing.
AWS Strands Agents setup
The implementation uses a multi-agent orchestrator pattern with specialized agents:

from strands import Agent
from config.conversation_config import ConversationConfig
from config.config import create_bedrock_model

class SupervisorAgent(Agent):
def __init__(self, specialized_agents, config=None):
bedrock_model = create_bedrock_model(config)
conversation_manager = ConversationConfig.create_conversation_manager(“supervisor”)

super().__init__(
model=bedrock_model,
system_prompt=self._get_routing_instructions(),
tools=[], # No tools for pure router
conversation_manager=conversation_manager,
)
self.specialized_agents = specialized_agents

Nova Sonic integration
The implementation uses a WebSocket server with session management for real-time voice processing:

class S2sSessionManager:
def __init__(self, model_id=’amazon.nova-sonic-v1:0′, region=’us-east-1′, config=None):
self.model_id = model_id
self.region = region
self.audio_input_queue = asyncio.Queue()
self.output_queue = asyncio.Queue()
self.supervisor_agent = SupervisorAgentIntegration(config)

async def processToolUse(self, toolName, toolUseContent):
if toolName == “supervisoragent”:
result = await self.supervisor_agent.query(content)
if len(result) > 800:
result = result[:800] + “… (truncated for voice)”
return {“result”: result}

Security best practices
This solution is designed for development and testing purposes. Before deploying to production environments, implement appropriate security controls including:

Authentication and authorization mechanisms
Network security controls and access restrictions
Monitoring and logging for audit compliance
Cost controls and usage monitoring

Note: Always follow AWS security best practices and the principle of least privilege when configuring IAM permissions.
Production Considerations
While this solution demonstrates Strands Agents capabilities using a development-focused deployment approach, organizations planning production implementations should consider Amazon Bedrock AgentCore Runtime for enterprise-grade hosting and management. Amazon Bedrock AgentCore Benefits for production deployment:

Serverless runtime: Purpose-built for deploying and scaling dynamic AI agents without managing infrastructure
Session isolation: Complete session isolation with dedicated microVMs for each user session, critical for agents performing privileged operations
Auto-scaling: Scale up to thousands of agent sessions in seconds with pay-per-usage pricing
Enterprise security: Built-in security controls with seamless integration to identity providers (Amazon Cognito, Microsoft Entra ID, Okta)
Observability: Built-in distributed tracing, metrics, and debugging capabilities through Cloudwatch integration
Session persistence: Highly reliable with session persistence for long-running agent interactions

For organizations ready to move beyond development and testing, Amazon Bedrock AgentCore Runtime provides the production-ready foundation needed to deploy voice-driven AWS assistants at enterprise scale.
Integration with additional AWS services
The system can be extended to support additional AWS services:

AWS Lambda Functions: Execute serverless functions via voice commands
CloudWatch: Query metrics and logs through natural language
Amazon Relational Database Service (RDS): Database management and monitoring operations

Conclusion
The Strands Agents Nova Voice Assistant demonstrates the powerful potential of combining voice interfaces with intelligent agent orchestration across diverse domains. By leveraging Amazon Nova Sonic for speech processing and Strands Agents for multi-agent coordination, organizations can create more intuitive and efficient ways to interact with complex systems and workflows.
This foundational architecture extends far beyond cloud operations to enable voice-driven solutions for customer service automation, financial analysis, IoT device management, healthcare workflows, supply chain optimization, and countless other enterprise applications. The combination of natural language processing, intelligent routing, and specialized domain knowledge creates a versatile platform for transforming how users interact with any complex system. The modular architecture ensures scalability and extensibility, allowing organizations to customize the solution for their specific domains and use cases. As voice interfaces continue to evolve and AI capabilities advance, solutions like this are likely to become increasingly important for managing complex environments across all industries.
Getting Started
Ready to build your own voice-powered AWS operations assistant? The complete source code and documentation are available in the GitHub repository. Follow this implementation guide to get started, and don’t hesitate to customize the solution for your specific use cases.
For questions, feedback, or contributions, please visit the project repository or reach out through the AWS community forums.

About the authors:
Jagdish Komakula is a passionate Sr. Delivery Consultant working with AWS Professional Services. With over two decades of experience in Information Technology, he helped numerous enterprise clients successfully navigate their digital transformation journeys and cloud adoption initiatives.
Aditya Ambati is an experienced DevOps Engineer with 14 plus years of experience in IT. He has an excellent reputation for resolving problems, improving customer satisfaction, and driving overall operational improvements.
Anand Krishna Varanasi is a seasoned AWS builder and architect who began his career over 17 years ago. He guides customers with cutting-edge cloud technology migration strategies (the 7 Rs) and modernization. He is passionate about the role that technology plays in bridging the present with all the possibilities for our future.
D.T.V.R.L Phani Kumar is a visionary DevOps Consultant with 10+ years of groundbreaking technology leadership, specializing in transformative automation strategies. As a distinguished engineer, he expertly bridges AI/ML innovations with DevOps practices, consistently delivering revolutionary solutions that redefine operational excellence and customer experiences. His strategic approach and technical mastery have positioned him as a thought leader in driving technological paradigm shifts.

OpenAI Introduces GPT 5.2: A Long Context Workhorse For Agents, Coding …

OpenAI has just introduced GPT-5.2, its most advanced frontier model for professional work and long running agents, and is rolling it out across ChatGPT and the API.

GPT-5.2 is a family of three variants. In ChatGPT, users see ChatGPT-5.2 Instant, Thinking and Pro. In the API, the corresponding models are gpt-5.2-chat-latest, gpt-5.2, and gpt-5.2-pro. Instant targets everyday assistance and learning, Thinking targets complex multi step work and agents, and Pro allocates more compute for hard technical and analytical tasks.

Benchmark profile, from GDPval to SWE Bench

GPT-5.2 Thinking is positioned as the main workhorse for real world knowledge work. On GDPval, an evaluation of well specified knowledge tasks across 44 occupations in 9 large industries, it beats or ties top industry professionals on 70.9 percent of comparisons, while producing outputs at more than 11 times the speed and under 1 percent of the estimated expert cost. For engineering teams this means the model can reliably generate artifacts such as presentations, spreadsheets, schedules, and diagrams given structured instructions.

On an internal benchmark of junior investment banking spreadsheet modeling tasks, average scores rise from 59.1 percent with GPT-5.1 to 68.4 percent with GPT-5.2 Thinking and 71.7 percent with GPT-5.2 Pro. These tasks include three statement models and leveraged buyout models with constraints on formatting and citations, which is representative of many structured enterprise workflows.

In software engineering, GPT-5.2 Thinking reaches 55.6 percent on SWE-Bench Pro and 80.0 percent on SWE-bench Verified. SWE-Bench Pro evaluates repository level patch generation over multiple languages, while SWE-bench Verified focuses on Python.

Long context and agentic workflows

Long context is a core design target. GPT-5.2 Thinking sets a new state of the art on OpenAI MRCRv2, a benchmark that inserts multiple identical ‘needle’ queries into long dialogue “haystacks” and measures whether the model can reproduce the correct answer. It is the first model reported to reach near 100 percent accuracy on the 4 needle MRCR variant out to 256k tokens.

For workloads that exceed even that context, GPT-5.2 Thinking integrates with the Responses /compact endpoint, which performs context compaction to extend the effective window for tool heavy, long running jobs. This is relevant if you are building agents that iteratively call tools over many steps and need to maintain state beyond the raw token limit.

On tool usage, GPT-5.2 Thinking reaches 98.7 percent on Tau2-bench Telecom, a multi turn customer support benchmark where the model must orchestrate tool calls across a realistic workflow. The official examples from OpenAI release post show scenarios like a traveler with a delayed flight, missed connection, lost bag and medical seating requirement, where GPT-5.2 manages rebooking, special assistance seating and compensation in a consistent sequence while GPT-5.1 leaves steps unfinished.

Vision, science and math

Vision quality also moves up. GPT-5.2 Thinking roughly halves error rates on chart reasoning and user interface understanding benchmarks like CharXiv Reasoning and ScreenSpot Pro when a Python tool is enabled. The model shows improved spatial understanding of images, for example when labeling motherboard components with approximate bounding boxes, GPT-5.2 identifies more regions with tighter placement than GPT-5.1.

For scientific workloads, GPT-5.2 Pro scores 93.2 percent and GPT-5.2 Thinking 92.4 percent on GPQA Diamond, and GPT-5.2 Thinking solves 40.3 percent of FrontierMath Tier 1 to Tier 3 problems with Python tools enabled. These benchmarks cover graduate level physics, chemistry, biology and expert mathematics, and OpenAI highlights early use where GPT-5.2 Pro contributed to a proof in statistical learning theory under human verification.

Comparison Table

ModelPrimary positioningContext window / max outputKnowledge cutoffNotable benchmarks (Thinking / Pro vs GPT-5.1 Thinking)GPT-5.1 Flagship model for coding and agentic tasks with configurable reasoning effort400,000 tokens context, 128,000 max output2024-09-30SWE-Bench Pro 50.8 percent, SWE-bench Verified 76.3 percent, ARC-AGI-1 72.8 percent, ARC-AGI-2 17.6 percentGPT-5.2 (Thinking) New flagship model for coding and agentic tasks across industries and for long running agents400,000 tokens context, 128,000 max output2025-08-31GDPval wins or ties 70.9 percent vs industry professionals, SWE-Bench Pro 55.6 percent, SWE-bench Verified 80.0 percent, ARC-AGI-1 86.2 percent, ARC-AGI-2 52.9 percentGPT-5.2 ProHigher compute version of GPT-5.2 for the hardest reasoning and scientific workloads, produces smarter and more precise responses400,000 tokens context, 128,000 max output2025-08-31GPQA Diamond 93.2 percent vs 92.4 percent for GPT-5.2 Thinking and 88.1 percent for GPT-5.1 Thinking, ARC-AGI-1 90.5 percent and ARC-AGI-2 54.2 percent

Key Takeaways

GPT-5.2 Thinking is the new default workhorse model: It replaces GPT-5.1 Thinking as the main model for coding, knowledge work and agents, while keeping the same 400k context and 128k max output, but with clearly higher benchmark performance across GDPval, SWE-Bench, ARC-AGI and scientific QA.

Substantial accuracy jump over GPT-5.1 at similar scale: On key benchmarks, GPT-5.2 Thinking moves from 50.8 percent to 55.6 percent on SWE-Bench Pro and from 76.3 percent to 80.0 percent on SWE-bench Verified, and from 72.8 percent to 86.2 percent on ARC-AGI-1 and from 17.6 percent to 52.9 percent on ARC-AGI-2, while keeping token limits comparable.

GPT-5.2 Pro is targeted at high end reasoning and science: GPT-5.2 Pro is a higher compute variant that mainly improves hard reasoning and scientific tasks, for example reaching 93.2 percent on GPQA Diamond versus 92.4 percent for GPT-5.2 Thinking and 88.1 percent for GPT-5.1 Thinking, and higher scores on ARC-AGI tiers.

The post OpenAI Introduces GPT 5.2: A Long Context Workhorse For Agents, Coding And Knowledge Work appeared first on MarkTechPost.

CopilotKit v1.50 Brings AG-UI Agents Directly Into Your App With the N …

Agent frameworks are now good at reasoning and tools, but most teams still write custom code to turn agent graphs into robust user interfaces with shared state, streaming output and interrupts. CopilotKit targets this last mile. It is an open source framework for building AI copilots and in-app agents directly in your app, with real time context and UI control. ( Check out the CopilotKit GitHub)

The release of of CopilotKit’s v1.50 rebuilds the project on the Agent User Interaction Protocol (AG-UI) natively.The key idea is simple; Let AG-UI define all traffic between agents and UIs as a typed event stream to any  app through a single hook, useAgent.

useAgent, one React hook per AG-UI agent

AG-UI defines how an agent backend and a frontend exchange a single ordered sequence of JSON encoded events. These events include messages, tool calls, state updates and lifecycle signals, and they can stream any transport like HTTP, Web Sockets, or even WebRTC. 

CopilotKit v1.50 uses this protocol as the native transport layer. Instead of separate adapters for each framework, everything  now communicates via AG-UI directly. This is all made easily accessible by the new useAgent – a React hook that  provides programmatic control of any AG-UI agent. It subscribes to the event stream, keeps a local model of messages and shared state, and exposes a small API for sending user input and UI intents.

At a high level, a React component does three things:

Call useAgent with connection details for the backend agent.

Read current state, such as message list, streaming deltas and agent status flags.

Call useAgent methods from the hook to send user messages, trigger tools or update shared state.

Because the hook only depends on AG-UI, the same UI code can work with different agent frameworks, as long as they expose an AG-UI endpoint.

Context messaging and shared state

AG-UI assumes that agentic apps are stateful. The protocol standardizes how context moves between UI and agent. 

On the frontend, CopilotKit already lets developers register app data as context, for example with hooks that make parts of React state readable to the agent. In the AG-UI model this becomes explicit. State snapshots and state patch events keep the backend and the UI in sync. The agent sees a consistent view of the application, and the UI can render the same state without custom synchronization logic.

For an early level engineer this removes a common pattern. You no longer push props into prompts manually on every call. The state is then updated, and the AG-UI client encodes those updates as events, and the backend agent consumes the same state through its AG-UI library.

AG-UI, protocol layer between agents and users

AG-UI is defined as an open, lightweight protocol that standardizes how agents connect to user facing applications.It focuses on event semantics rather than transport. Core SDKs provide strongly typed event models and clients in TypeScript, Python and other languages.

The JavaScript package @ag-ui/core implements the streaming event based architecture on the client side. It exposes message and state models, run input types and event utilities, and currently records about 178,751 weekly downloads on npm for version 0.0.41. On the Python side, the ag-ui-protocol package provides the canonical event models, with around 619,035 downloads in the last week and about 2,172,180 in the last month.

CopilotKit v1.50 builds directly on these components. Frontend code uses CopilotKit React primitives, but under the hood the connection to the backend is an AG-UI client that sends and receives standard events.

First party integrations across the 3 hyperscalers

The AG-UI overview lists Microsoft Agent Framework, Google Agent Development Kit, ADK, and AWS Strands Agents as supported frameworks, each with dedicated documentation and demos. These are first party integrations maintained by the protocol and framework owners.

Microsoft published a tutorial that shows how to build both server and client applications using AG-UI with Agent Framework in .NET or Python. Google documents AG-UI under the Agentic UI section of the ADK docs, and CopilotKit provides a full guide on building an ADK along with AG-UI and CopilotKit stack. AWS Strands exposes AG-UI integration through official tutorials and a CopilotKit quickstart, which wires a Strands agent backend to a React client in one scaffolded project.

For a React team this means that useAgent can attach to agents defined in any of these frameworks, as long as the backend exposes an AG-UI endpoint. The frontend code stays the same, while the agent logic and hosting environment can change.

Ecosystem growth around CopilotKit and AG-UI

CopilotKit presents itself as the agentic framework for in-app copilots, with more than 20,000 GitHub stars and being trusted by over 100,000 developers. 

AG-UI itself has moved from a protocol proposal to a shared layer across multiple frameworks. The partnerships or integrations include with LangGraph, CrewAI, Mastra, Pydantic AI, Agno, LlamaIndex and others, plus SDKs in Kotlin, Go, Java, Rust and more.This cross framework adoption is what makes a generic hook like useAgent viable, because it can rely on a consistent event model.

Key Takeaways

CopilotKit v1.50 standardizes its frontend layer on AG-UI, so all agent to UI communication is a single event stream instead of custom links per backend.

The new useAgent React hook lets a component connect to any AG-UI compatible agent, and exposes messages, streaming tokens, tools and shared state through a typed interface.

AG-UI formalizes context messaging and shared state as replicated stores with event sourced deltas, so both agent and UI share a consistent application view without manual prompt wiring.

AG-UI has first party integrations with Microsoft Agent Framework, Google Agent Development Kit and AWS Strands Agents, which means the same CopilotKit UI code can target agents across all 3 major clouds.

CopilotKit and AG-UI show strong ecosystem traction, with high GitHub adoption and significant weekly downloads for @ag-ui/core on npm and ag-ui-protocol on PyPI, which signals that the protocol is becoming a common layer for agentic applications.

If you’re interested in using CopilotKit in a production product or business, you can schedule time with the team here: Scheduling link
The post CopilotKit v1.50 Brings AG-UI Agents Directly Into Your App With the New useAgent Hook appeared first on MarkTechPost.

The Machine Learning Divide: Marktechpost’s Latest ML Global Impact …

Los Angeles, December 11, 2025 — Marktechpost has released ML Global Impact Report 2025 (AIResearchTrends.com). This educational report’s analysis includes over 5,000 articles from more than 125 countries, all published within the Nature family of journals between January 1 and September 30, 2025. The scope of this report is strictly confined to this specific body of work and is not a comprehensive assessment of global research.This report focuses solely on the specific work presented and does not represent a full evaluation of worldwide research.

The ML Global Impact Report 2025 focuses on three core questions:

In which disciplines has ML become part of the standard methodological toolkit, and where is adoption still sparse.

Which kinds of problems are most likely to rely on ML, such as high-dimensional imaging, sequence data, or complex physical simulations.

How ML usage patterns differ by geography and research ecosystem, based on the global footprint of these selected 5,000 papers.

ML has most frequently become part of the standard methodological toolkit within the disciplines of applied sciences and health research, where it is often employed as a critical step within a larger experimental workflow rather than being the main subject of research itself. The analysis of the papers indicates that ML’s adoption is concentrated in these domains, with the tools serving to augment existing research pipelines. The report aims to distinguish these areas of common use from other fields where the integration of machine learning remains less frequent.

The kinds of problems most likely to rely on machine learning are those involving complex data analysis tasks, such as high-dimensional imaging, sequence data analysis, and intricate physical simulations. The report tracks the specific task types, including prediction, classification, segmentation, sequence modeling, feature extraction, and simulation, to understand where ML is being applied. This categorization highlights the utility of machine learning across different stages of the research process, from initial data processing to final output generation.

ML usage patterns show a distinct geographical separation between the origins of the tools and the heavy users of the technology. The majority of machine learning tools cited in the corpus originate from organizations based in the United States, which maintains many widely used frameworks and libraries. In contrast, China is identified as the largest contributor to the research papers, accounting for about 40% of all ML-tagged papers, significantly more than the United States’ contribution of around 18%. The report also highlights the global ecosystem by citing frequently used non-US tools, such as Scikit-learn (France), U-Net (Germany), and CatBoost (Russia), along with tools originated from Canada including GAN and RNN families.Overall, the ML Global Impact Report 2025 provides deep insights into the global research ecosystem, highlighting that Machine Learning has become a standard methodological tool primarily within applied sciences and health research. The analysis reveals a concentration of ML usage on complex data challenges, such as high-dimensional imaging and physical simulations. A core finding is the clear geographical split between the origin of ML tools—many maintained by US organizations—and the heaviest users of the technology, with China accounting for a significantly higher number of ML-tagged research papers in the analyzed corpus. These patterns are specific to the 5,000+ Nature family articles analysed, underscoring the report’s focused view on current research workflows.
The post The Machine Learning Divide: Marktechpost’s Latest ML Global Impact Report Reveals Geographic Asymmetry Between ML Tool Origins and Research Adoption appeared first on MarkTechPost.

How Harmonic Security improved their data-leakage detection system wit …

This post was written with Bryan Woolgar-O’Neil, Jamie Cockrill and Adrian Cunliffe from Harmonic Security
Organizations face increasing challenges protecting sensitive data while supporting third-party generative AI tools. Harmonic Security, a cybersecurity company, developed an AI governance and control layer that spots sensitive data in line as employees use AI, giving security teams the power to keep PII, source code, and payroll information safe while the business accelerates.
The following screenshot demonstrates Harmonic Security’s software tool, highlighting the different data leakage detection types, including Employee PII, Employee Financial Information, and Source Code.

Harmonic Security’s solution is also now available on AWS Marketplace, enabling organizations to deploy enterprise-grade data leakage protection with seamless AWS integration. The platform provides prompt-level visibility into GenAI usage, real-time coaching at the point of risk, and detection of high-risk AI applications—all powered by the optimized models described in this post.
The initial version of their system was effective, but with a detection latency of 1–2 seconds, there was an opportunity to further enhance its capabilities and improve the overall user experience. To achieve this, Harmonic Security partnered with the AWS Generative AI Innovation Center to optimize their system with four key objectives:

Reduce detection latency to under 500 milliseconds at the 95th percentile
Maintain detection accuracy across monitored data types
Continue to support EU data residency compliance
Enable scalable architecture for production loads

This post walks through how Harmonic Security used Amazon SageMaker AI, Amazon Bedrock, and Amazon Nova Pro to fine-tune a ModernBERT model, achieving low-latency, accurate, and scalable data leakage detection.
Solution overview
Harmonic Security’s initial data leakage detection system relied on an 8 billion (8B) parameter model, which effectively identified sensitive data but incurred 1–2 second latency, which ran close to the threshold of impacting user experience. To achieve sub-500 millisecond latency while maintaining accuracy, we developed two classification approaches using a fine-tuned ModernBERT model.
First, a binary classification model was prioritized to detect Mergers & Acquisitions (M&A) content, a critical category for helping prevent sensitive data leaks. We initially focused on binary classification because it was the simplest approach that would seamlessly integrate within their current system that invokes multiple binary classification models in parallel. Secondly, as an extension, we explored a multi-label classification model to detect multiple sensitive data types (such as billing information, financial projections, and employment records) in a single pass, aiming to reduce the computational overhead of running multiple parallel binary classifiers for greater efficiency. Although the multi-label approach showed promise for future scalability, Harmonic Security decided to stick with the binary classification model for the initial version.The solution uses the following key services:

Amazon SageMaker AI – For fine-tuning and deploying the model
Amazon Bedrock – For accessing industry-leading large language models (LLMs)
Amazon Nova Pro – A highly capable multimodal model that balances accuracy, speed, and cost

The following diagram illustrates the solution architecture for low-latency inference and scalability.

The architecture consists of the following components:

Model artifacts are stored in Amazon Simple Storage Service (Amazon S3)
A custom container with inference code is hosted in Amazon Elastic Container Registry (Amazon ECR)
A SageMaker endpoint uses ml.g5.4xlarge instances for GPU-accelerated inference
Amazon CloudWatch monitors invocations, triggering auto scaling to adjust instances (1–5) based on an 830 requests per minute (RPM) threshold.

The solution supports the following features:

Sub-500 milliseconds inference latency
EU AWS Region deployment support
Automatic scaling between 1–5 instances based on demand
Cost optimization during low-usage periods

Synthetic data generation
High-quality training data for sensitive information (such as M&A documents and financial data) is scarce. We used Meta Llama 3.3 70B Instruct and Amazon Nova Pro to generate synthetic data, expanding upon Harmonic’s existing dataset that included examples of data in the following categories: M&A, billing information, financial projection, employment records, sales pipeline, and investment portfolio. The following diagram provides a high-level overview of the synthetic data generation process.

Data generation framework
The synthetic data generation framework is comprised of a series of steps, including:

Smart example selection – K-means clustering on sentence embeddings supports diverse example selection
Adaptive prompts – Prompts incorporate domain knowledge, with temperature (0.7–0.85) and top-p sampling adjusted per category
Near-miss augmentation – Negative examples resembling positive cases to improve precision
Validation – An LLM-as-a-judge approach using Amazon Nova Pro and Meta Llama 3 validates examples for relevance and quality

Binary classification
For the binary M&A classification task, we generated three distinct types of examples:

Positive examples – These contained explicit M&A information while maintaining realistic document structures and finance-specific language patterns. They included key indicators like “merger,” “acquisition,” “deal terms,” and “synergy estimates.”
Negative examples – We created domain-relevant content that deliberately avoided M&A characteristics while remaining contextually appropriate for business communications.
Near-miss examples – These resembled positive examples but fell just outside the classification boundary. For instance, documents discussing strategic partnerships or joint ventures that didn’t constitute actual M&A activity.

The generation process maintained careful proportions between these example types, with particular emphasis on near-miss examples to address precision requirements.
Multi-label classification
For the more complex multi-label classification task across four sensitive information categories, we developed a sophisticated generation strategy:

Single-label examples – We generated examples containing information relevant to exactly one category to establish clear category-specific features
Multi-label examples – We created examples spanning multiple categories with controlled distributions, covering various combinations (2–4 labels)
Category-specific requirements – For each category, we defined mandatory elements to maintain explicit rather than implied associations:

Financial projections – Forward-looking revenue and growth data
Investment portfolio – Details about holdings and performance metrics
Billing and payment information – Invoices and supplier accounts
Sales pipeline – Opportunities and projected revenue

Our multi-label generation prioritized realistic co-occurrence patterns between categories while maintaining sufficient representation of individual categories and their combinations. As a result, synthetic data increased training examples by 10 times (binary) and 15 times (multi-label) more. It also improved the class balance because we made sure to generate the data with a more balanced label distribution.
Model fine-tuning
We fine-tuned ModernBERT models on SageMaker to achieve low latency and high accuracy. Compared with decoder-only models such as Meta Llama 3.2 3B and Google Gemma 2 2B, ModernBERT’s compact size (149M and 395M parameters) translated into faster latency while still delivering higher accuracy. We therefore selected ModernBERT over fine-tuning those alternatives. In addition, ModernBERT is one of the few BERT-based models that supports context lengths of up to 8,192 tokens, which was a key requirement for our project.
Binary classification model
Our first fine-tuned model used ModernBERT-base, and we focused on binary classification of M&A content.We approached this task methodically:

Data preparation – We enriched our M&A dataset with the synthetically generated data
Framework selection – We used the Hugging Face transformers library with the Trainer API in a PyTorch environment, running on SageMaker
Training process – Our process included:

Stratified sampling to maintain label distribution across training and evaluation sets
Specialized tokenization with sequence lengths up to 3,000 tokens to match what the client had in production
Binary cross-entropy loss optimization
Early stopping based on F1 score to prevent overfitting.

The result was a fine-tuned model that could distinguish M&A content from non-sensitive information with a higher F1 score than the 8B parameter model.
Multi-label classification model
For our second model, we tackled the more complex challenge of multi-label classification (detecting multiple sensitive data types simultaneously within single text passages).We fine-tuned a ModernBERT-large model to identify various sensitive data types like billing information, employment records, and financial projections in a single pass. This required:

Multi-hot label encoding – We converted our categories into vector format for simultaneous prediction.
Focal loss implementation – Instead of standard cross-entropy loss, we implemented a custom FocalLossTrainer class. Unlike static weighted loss functions, Focal Loss adaptively down-weights straightforward examples during training. This helps the model concentrate on challenging cases, significantly improving performance for less frequent or harder-to-detect classes.
Specialized configuration – We added configurable class thresholds (for example, 0.1 to 0.8) for each class probability to determine label assignment as we observed varying performance in different decision boundaries.

This approach enabled our system to identify multiple sensitive data types in a single inference pass.
Hyperparameter optimization
To find the optimal configuration for our models, we used Optuna to optimize key parameters. Optuna is an open-source hyperparameter optimization (HPO) framework that helps find the best hyperparameters for a given machine learning (ML) model by running many experiments (called trials). It uses a Bayesian algorithm called Tree-structured Parzen Estimator (TPE) to choose promising hyperparameter combinations based on past results.
The search space explored numerous combinations of key hyperparameters, as listed in the following table.

Hyperparameter
Range

Learning rate
5e-6–5e-5

Weight decay
0.01–0.5

Warmup ratio
0.0–0.2

Dropout rates
0.1–0.5

Batch size
16, 24, 32

Gradient accumulation steps
1, 4

Focal loss gamma (multi-label only)
1.0–3.0

Class threshold (multi-label only)
0.1–0.8

To optimize computational resources, we implemented pruning logic to stop under-performing trials early, so we could discard configurations that were less optimal. As seen in the following Optuna HPO history plot, trial 42 had the most optimal parameters with the highest F1 score for the binary classification, whereas trial 32 was the most optimal for the multi-label.

Moreover, our analysis showed that dropout and learning rate were the most important hyperparameters, accounting for 48% and 21% of the variance of the F1 score for the binary classification model. This explained why we noticed the model overfitting quickly during previous runs and stresses the importance of regularization.

After the optimization experiments, we discovered the following:

We were able to identify the optimal hyperparameters for each task
The models converged faster during training
The final performance metrics showed measurable improvements over configurations we tested manually

This allowed our models to achieve a high F1 score efficiently by running hyperparameter tuning in an automated fashion, which is crucial for production deployment.
Load testing and autoscaling policy
After fine-tuning and deploying the optimized model to a SageMaker real-time endpoint, we performed load testing to validate the performance and autoscaling under pressure to meet Harmonic Security’s latency, throughput, and elasticity needs. The objectives of the load testing were:

Validate latency SLA with an average of less than 500 milliseconds and P95 of approximately 1 second varying loads
Determine throughput capacity with maximum RPM using ml.g5.4xlarge instances within latency SLA
Inform the auto scaling policy design

The methodology involved the following:

Traffic simulation – Locust simulated concurrent user traffic with varying text lengths (50–9,999 characters)
Load pattern – We stepped ramp-up tests (60–2,000 RPM, 60 seconds each) and identified bottlenecks and stress-tested limits

As shown in the following graph, we found that the maximum throughput under a latency of 1 second was 1,185 RPM, so we decided to set the auto scaling threshold to 70% of that at 830 RPM.

Based on the performance observed during load testing, we configured a target-tracking auto scaling policy for the SageMaker endpoint using Application Auto Scaling. The following figure illustrates this policy workflow.

The key parameters defined were:

Metric – SageMakerVariantInvocationsPerInstance (830 invocations/instance/minute)
Min/Max Instances – 1–5
Cooldown – Scale-out 300 seconds, scale-in 600 seconds

This target-tracking policy adjusts instances based on traffic, maintaining performance and cost-efficiency. The following table summarizes our findings.

Model
Requests per Minute

8B model
800

ModernBERT with auto scaling (5 instances)
1,185-5925

Additional capacity (ModernBERT vs. 8B model)
48%-640%

Results
This section showcases the significant impact of the fine-tuning and optimization efforts on Harmonic Security’s data leakage detection system, with a primary focus on achieving substantial latency reductions. Absolute latency improvements are detailed first, underscoring the success in meeting the sub-500 millisecond target, followed by an overview of performance enhancements. The following subsections provide detailed results for binary M&A classification and multi-label classification across multiple sensitive data types.
Binary classification
We evaluated the fine-tuned ModernBERT-base model for binary M&A classification against the baseline 8B model, introduced in the solution overview. The most striking achievement was a transformative reduction in latency, addressing the initial 1–2 second delay that risked disrupting user experience. This leap to sub-500 millisecond latency is detailed in the following table, marking a pivotal enhancement in system responsiveness.

Model
median_ms
p95_ms
p99_ms
p100_ms

Modernbert-base-v2
46.03
81.19
102.37
183.11

8B model
189.15
259.99
286.63
346.36

Difference
-75.66%
-68.77%
-64.28%
-47.13%

Building on this latency breakthrough, the following performance metrics reflect percentage improvements in accuracy and F1 score.

Model
Accuracy Improvement
F1 Improvement

ModernBERT-base-v2
+1.56%
+2.26%

8B model

These results highlight that ModernBERT-base-v2 delivers a groundbreaking latency reduction, complemented by modest accuracy and F1 improvements of 1.56% and 2.26%, respectively, aligning with Harmonic Security’s objectives to enhance data leakage detection without impacting user experience.
Multi-label classification
We evaluated the fine-tuned ModernBERT-large model for multi-label classification against the baseline 8B model, with latency reduction as the cornerstone of this approach. The most significant advancement was a substantial decrease in latency across all evaluated categories, achieving sub-500 millisecond responsiveness and addressing the previous 1–2 second bottleneck. The latency results shown in the following table underscore this critical improvement.

Dataset
model
median_ms
p95_ms
p99_ms

Billing and payment
8B model
198
238
321

ModernBERT-large
158
199
246

Difference

-20.13%
-16.62%
-23.60%

Sales pipeline
8B model
194
265
341

ModernBERT-large
162
243
293

Difference

-16.63%
-8.31%
-13.97%

Financial projections
8B model
384
510
556

ModernBERT-large
160
275
310

Difference

-58.24%
-46.04%
-44.19%

Investment portfolio
8B model
397
498
703

ModernBERT-large
160
259
292

Difference

-59.69%
-47.86%
-58.46%

This approach also delivered a second key benefit: a reduction in computational parallelism by consolidating multiple classifications into a single pass. However, the multi-label model encountered challenges in maintaining consistent accuracy across all classes. Although categories like Financial Projections and Investment Portfolio showed promising accuracy gains, others such as Billing and Payment and Sales Pipeline experienced significant accuracy declines. This indicates that, despite its latency and parallelism advantages, the approach requires further development to maintain reliable accuracy across data types.
Conclusion
In this post, we explored how Harmonic Security collaborated with the AWS Generative AI Innovation Center to optimize their data leakage detection system achieving transformative results:
Key performance improvements:

Latency reduction: From 1–2 seconds to under 500 milliseconds (76% reduction at median)
Throughput increase: 48%–640% additional capacity with auto scaling
Accuracy gains: +1.56% for binary classification, with maintained precision across categories

By using SageMaker, Amazon Bedrock, and Amazon Nova Pro, Harmonic Security fine-tuned ModernBERT models that deliver sub-500 millisecond inference in production, meeting stringent performance goals while supporting EU compliance and establishing a scalable architecture.
This partnership showcases how tailored AI solutions can tackle critical cybersecurity challenges without hindering productivity. Harmonic Security’s solution is now available on AWS Marketplace, enabling organizations to adopt AI tools safely while protecting sensitive data in real time. Looking ahead, these high-speed models have the potential to add further controls for additional AI workflows.
To learn more, consider the following next steps:

Try Harmonic Security – Deploy the solution directly from AWS Marketplace to protect your organization’s GenAI usage
Explore AWS services – Dive into SageMaker, Amazon Bedrock, and Amazon Nova Pro to build advanced AI-driven security solutions. Visit the AWS Generative AI page for resources and tutorials.
Deep dive into fine-tuning – Explore the AWS Machine Learning Blog for in-depth guides on fine-tuning LLMs for specialized use cases.
Stay updated – Subscribe to the AWS Podcast for weekly insights on AI innovations and practical applications.
Connect with experts – Join the AWS Partner Network to collaborate with experts and scale your AI initiatives.
Attend AWS events – Register for AWS re: Invent. to explore cutting-edge AI advancements and network with industry leaders.

By adopting these steps, organizations can harness AI-driven cybersecurity to maintain robust data protection and seamless user experiences across diverse workflows.

About the authors
Babs Khalidson is a Deep Learning Architect at the AWS Generative AI Innovation Centre in London, where he specializes in fine-tuning large language models, building AI agents, and model deployment solutions. He has over 6 years of experience in artificial intelligence and machine learning across finance and cloud computing, with expertise spanning from research to production deployment.
Vushesh Babu Adhikari is a Data scientist at the AWS Generative AI Innovation center in London with extensive expertise in developing Gen AI solutions across diverse industries. He has over 7 years of experience spanning across a diverse set of industries including Finance , Telecom , Information Technology with specialized expertise in Machine learning & Artificial Intelligence.
Zainab Afolabi is a Senior Data Scientist at the AWS Generative AI Innovation Centre in London, where she leverages her extensive expertise to develop transformative AI solutions across diverse industries. She has over nine years of specialized experience in artificial intelligence and machine learning, as well as a passion for translating complex technical concepts into practical business applications.
Nuno Castro is a Sr. Applied Science Manager at the AWS Generative AI Innovation Center. He leads Generative AI customer engagements, helping AWS customers find the most impactful use case from ideation, prototype through to production. He’s has 19 years experience in the field in industries such as finance, manufacturing, and travel, leading ML teams for 11 years.
Christelle Xu is a Senior Generative AI Strategist who leads model customization and optimization strategy across EMEA within the AWS Generative AI Innovation Center, working with customers to deliver scalable Generative AI solutions, focusing on continued pre-training, fine-tuning, reinforcement learning, and training and inference optimization. She holds a Master’s degree in Statistics from the University of Geneva and a Bachelor’s degree from Brigham Young University.
Manuel Gomez is a Solutions Architect at AWS supporting generative AI startups across the UK and Ireland. He works with model producers, fine-tuning platforms, and agentic AI applications to design secure and scalable architectures. Before AWS, he worked in startups and consulting, and he has a background in industrial technologies and IoT. He is particularly interested in how multi-modal AI can be applied to real industry problems.
Bryan Woolgar-O’Neil is the co-founder & CTO at Harmonic Security. With over 20 years of software development experience, the last 10 were dedicated to building the Threat Intelligence company Digital Shadows, which was acquired by Reliaquest in 2022. His expertise lies in developing products based on cutting-edge software, focusing on making sense of large volumes of data.
Jamie Cockrill is the Director of Machine Learning at Harmonic Security, where he leads a team focused on building, training, and refining Harmonic’s Small Language Models.
Adrian Cunliffe is a Senior Machine Learning Engineer at Harmonic Security, where he focuses on scaling Harmonic’s Machine Learning engine that powers Harmonic’s proprietary models.

How Swisscom builds enterprise agentic AI for customer support and sal …

This post was written with Arun Sittampalam and Maxime Darcot from Swisscom.
As we navigate the constantly shifting AI ecosystem, enterprises face challenges in translating AI’s potential into scalable, production-ready solutions. Swisscom, Switzerland’s leading telecommunications provider with an estimated $19B revenue (2025) and over $37B Market capitalization as of June 2025 exemplifies how organizations can successfully navigate this complexity while maintaining their commitment to sustainability and excellence.
Swisscom has been recognized as the Most Sustainable Company in the Telecom industry for 3 consecutive years by World Finance magazine, Swisscom has established itself as an innovation leader committed to achieving net-zero greenhouse gas emissions by 2035 in alignment with the Paris Climate Agreement. This sustainability-first approach extends to their AI strategy where they’re breaking through what they call the “automation ceiling” – where traditional automation approaches fail to meet modern business demands.
In this post, we’ll show how Swisscom implemented Amazon Bedrock AgentCore to build and scale their enterprise AI agents for customer support and sales operations. As an early adopter of Amazon Bedrock in the AWS Europe Region (Zurich), Swisscom leads in enterprise AI implementation with their Chatbot Builder system and various AI initiatives. Their successful deployments include Conversational AI powered by Rasa and fine-tuned LLMs on Amazon SageMaker, and the Swisscom Swisscom myAI assistant, built to meet Swiss data protection standards.
Solution overview: Swisscom’s agentic AI enabler framework
The challenge of enterprise-wide scaling of AI agents lies in managing siloed agentic solutions while facilitating cross-departmental coordination. Swisscom addresses this through Model Context Protocol (MCP) servers and the Agent2Agent protocol (A2A), for seamless agent communication across domains. Operating under Switzerland’s strict data protection laws, they’ve developed a framework that balances compliance requirements with efficient scaling capabilities, helping prevent redundant efforts while maintaining high security standards.
Swisscom’s multi-agent architecture: System design and implementation challenges
Swisscom’s vision for enterprise-level agentic AI focuses on addressing fundamental challenges that organizations face when scaling AI solutions. They recognise that successful implementation requires more than just innovative technology, it demands a comprehensive approach to infrastructure and operations. One of the key challenges lies in orchestrating AI agents across different departments and systems while maintaining security and efficiency.
To illustrate these challenges in practice, let’s examine a common customer service scenario where an agent is tasked with helping a customer restore their Internet router connectivity. There are three potential causes for the connectivity loss: 1) a billing issue, 2) a network outage, or 3) a configuration mismatch known as a pairing issue. These issues typically reside in departments different from where the assigned agent operates, highlighting the need for seamless cross-departmental coordination.
The architecture diagram below illustrates the vision and associated challenges for a generic customer agent without the Amazon Bedrock AgentCore. The shared VPC setup of Swisscom is explained in more detail in the blog post, Automated networking with shared VPCs at Swisscom.

This architecture includes the following components:

A customer-facing generic agent deployed as a containerized runtime within a shared VPC, requiring both foundation model invocation capabilities and robust session management.
For task completion, the agent requires to access to other agents and MCP servers. These resources are typically distributed across multiple AWS accounts and are deployed as containerised runtimes within the shared VPC.
Internal application access primarily occurs through SAIL (Service and Interface Library), Swisscom’s central system for API hosting and service integration. Corporate network resources are accessible via AWS Direct Connect, with a VPC Transit Gateway facilitating secure cross-network communication.
Security compliance is paramount: each interaction requires temporary access tokens that authenticate both the agent and the customer context. This bidirectional validation is essential to the system components – agents, MCP servers, and tools must verify incoming tokens for service requests.
Gaining long-term insights from the stored sessions, such as customer preferences, demands a sophisticated analysis.

To build the solution mentioned above at scale, Swisscom identified several critical challenges that needed to be addressed:

Security and Authentication:

How to implement secure, transitive authentication and authorization that enforces least-privilege access based on intersecting permissions (customer, agent, department)?
How to enable controlled resource sharing across departments, cloud systems, and on-premises networks?

Integration and Interoperability:

How to make MCP servers and other agents centrally available to other use cases?
How to integrate and maintain compatibility with existing agentic use cases across Swisscom’s infrastructure?

Customer Intelligence and Operations:

How to effectively capture and utilize customer insights across multiple agentic interactions?
How to implement standardized evaluation and observability practices across the agents?

How Amazon Bedrock AgentCore addresses the challenges
Amazon Bedrock AgentCore provides Swisscom with a comprehensive solution that addresses their enterprise-scale agentic AI challenges.

AgentCore Runtime: Enables Swisscom’s developers to focus on building agents while the system handles secure, cost-efficient hosting and automatic scaling through Docker container deployment that maintains session-level isolation. Hosted in the shared VPC allows access to internal APIs.
AgentCore Identity: Seamlessly integrates with Swisscom’s existing identity provider, managing both inbound and outbound authentication, alleviating the need for custom token exchange servers and simplifying secure interactions between agents, tools, and data sources.
AgentCore Memory: Delivers a robust solution for managing both session-based and long-term memory storage with custom memory strategies. This is particularly valuable for B2C operations where understanding customer context across interactions is crucial. Keeping each user’s data separate also supports security and compliance efforts.
Strands Agents Framework: Demonstrates high adoption among Swisscom’s developers due to its simplified agent construction, faster development cycles, seamless integration with Bedrock AgentCore services, and built-in capabilities for tracing, evaluation, and OpenTelemetry logging.

This solution does the following:

The client sends a request to the Strands agent running on AgentCore Runtime, passing an authentication token from the Swisscom IdP.
The client’s token is validated and a new token for the agent’s downstream tool usage is generated and passed back to the agent.
The agent invokes the foundation model on Bedrock and stores the sessions in the AgentCore Memory. The traffic traverses the VPC endpoints for Bedrock and Bedrock AgentCore, keeping the traffic private.
The agent accesses internal APIs, MCP & A2A servers inside the shared VPC, authenticating with the temporary token from AgentCore Identity.

With the flexibility to use a subset of features of Amazon Bedrock AgentCore and their Amazon VPC integration Swisscom could remain secure and flexible to use the Bedrock AgentCore services for their specific needs, for example to integrate with existing agents on Amazon EKS. Amazon Bedrock AgentCore integrates with VPC to facilitate secure communication between agents and internal resources.
Results and benefits: Real-world implementation with self-service use case
Swisscom partnered with AWS to implement Amazon Bedrock AgentCore for two B2C cases: 1) generating personalized sales pitches, and 2) providing automated customer support for technical issues like self-service troubleshooting. Both agents are being integrated into Swisscom’s existing customer generative AI-powered chatbot system called SAM, necessitating high-performance agent-to-agent communication protocols due to the high volume of Swisscom customers and strict latency requirements. Throughout the development process, the team created an agent for each use case designed to be shared across the organization through MCP and A2A.
Amazon Bedrock AgentCore has proven instrumental in these implementations. By using the Bedrock AgentCore Memory long-term insights Swisscom can track and analyze customer interactions across different touchpoints, continuously improving the customer experience across domains. AgentCore Identity facilitates robust security, implementing precise access controls that limit agents to only those resources authorized for the specific customer interaction. The scalability of AgentCore Runtime allows these agents to efficiently handle thousands of requests per month each, maintaining low latency while optimizing costs.
The adoption of Strands Agents framework has been particularly valuable in this journey:

Development teams achieved their first business stakeholder demos within 3-4 weeks, despite having no prior experience with Strands Agents.
One project team migrated from their LangGraph implementation to Strands Agents, citing reduced complexity and faster development cycles.
The framework’s native OpenTelemetry integration supported seamless export of performance traces to Swisscom’s existing observability infrastructure, maintaining consistency with enterprise-wide monitoring standards.
The Strands evaluation test cases allowed teams quickly put an evaluation pipeline together without the need of additional tools, for a quick validation of the PoC.

Conclusion: Enterprise AI at scale – Key insights and Strategic implications
Swisscom’s implementation of Amazon Bedrock AgentCore demonstrates how enterprises can successfully navigate the complexities of production-ready Agentic AI while maintaining regulatory compliance and operational excellence. Swisscom’s journey offers 3 critical insights:

Architectural foundation matters: By addressing the fundamental challenges of secure cross-org authentication, standardized agent orchestration, and comprehensive observability, Swisscom established a scalable foundation that accelerates deployment rather than constraining it. The integration of AgentCore Runtime, Identity, and Memory services accelerated the infrastructure setup so teams could focus on business value.
Framework selection drives velocity: The adoption of Strands Agents framework exemplifies how the right development tools can dramatically reduce time-to-value. Teams achieving stakeholder demos within 3-4 weeks, coupled with successful migrations from alternative frameworks, validates the importance of developer experience in enterprise AI adoption.
Compliance as an enabler: Swisscom proved that regulatory compliance need not impede innovation. The system’s ability to scale while maintaining data sovereignty and user privacy has proven particularly valuable in the Swiss industry, where regulatory compliance is paramount.

As enterprises increasingly recognize AI agents as fundamental to competitive advantage, Swisscom’s implementation provides a proven reference architecture. Their success with high-volume B2C applications—from personalized sales assistance to automated technical support—illustrates that agentic AI can deliver measurable business outcomes at scale when built on appropriate infrastructure. This implementation serves as a blueprint for organizations seeking to deploy enterprise-scale AI solutions, showing how careful architectural planning and the right technology choices can lead to successful outcomes in both customer service and sales operations.
Next steps and looking ahead
The future roadmap focuses on three key areas: agent sharing, cross-domain integration, and governance. A centralized agent registry will facilitate discovery and reuse across the organization, supported by standardized documentation and shared best practices. Cross-domain integration will enable seamless collaboration between different business units, with clear standards for agent communication and interoperability. The implementation of robust governance mechanisms, including version control, usage monitoring, and regular security audits, will facilitate sustainable growth of the system while maintaining compliance with enterprise standards. This comprehensive approach will help drive continuous improvement based on real-world usage patterns and feedback.
Check out these additional links for relevant Agentic related information:

Transforming network operations with AI: How Swisscom built a network assistant using Amazon Bedrock
Introducing Amazon Bedrock AgentCore: Securely deploy and operate AI agents at any scale
Amazon Bedrock AgentCore Runtime, Browser, and Code Interpreter add support for VPC, AWS PrivateLink, CloudFormation, and tagging
Secure ingress connectivity to Amazon Bedrock AgentCore Gateway using interface VPC endpoints

About the authors
Arun Sittampalam, Director of Product Management AI at Swisscom, leads the company’s transformation toward Agentic AI, designing frameworks that scale large language model (LLM)–driven agents across enterprise environments. His team is building Swisscom’s agentic platform, integrating Amazon Bedrock, AgentCore and internal orchestration frameworks to empower Swisscom’s AI product teams to build and scale intelligent agents faster. Arun focuses on operationalizing multi-agent architectures that deliver automation, reliability, and scalability.
Maxime is a System and Security Architect at Swisscom, responsible for the architecture of Conversational and Agentic AI enablement. He is originally a Data Scientist with 10 years of experience in developing, deploying and maintaining NLP solutions which have been helping millions of Swisscom customers.
Julian Grüber is a Data Science Consultant at Amazon Web Services. He partners with strategic customers to scale GenAI solutions that unlock business value, working at both the use case and enterprise architecture level. Drawing on his background in applied mathematics, machine learning, business, and cloud infrastructure, Julian bridges technical depth with business outcomes to address complex AI/ML challenges.
Marco Fischer is a Senior Solutions Architect at Amazon Web Services. He works with leading telecom operators to design and deploy scalable, production-ready solutions. With over two decades of experience spanning software engineering, architecture, and cloud infrastructure, Marco combines deep technical expertise with a passion for solving complex enterprise challenges.
Akarsha Sehwag is a Generative AI Data Scientist for Amazon Bedrock AgentCore GTM team. With over six years of expertise in AI/ML, she has built production-ready enterprise solutions across diverse customer segments in Generative AI, Deep Learning and Computer Vision domains. Outside of work, she likes to hike, bike or play Badminton.
Ruben Merz is a Principal Solutions Architect at AWS, specializing in digital sovereignty, AI, and networking solutions for enterprise customers. With deep expertise in distributed systems and networking, he architects secure, compliant cloud solutions that help organizations navigate complex regulatory requirements while accelerating their digital transformation journeys.

Scaling MLflow for enterprise AI: What’s New in SageMaker AI with ML …

Today we’re announcing Amazon SageMaker AI with MLflow, now including a serverless capability that dynamically manages infrastructure provisioning, scaling, and operations for artificial intelligence and machine learning (AI/ML) development tasks. It scales resources up during intensive experimentation and down to zero when not in use, reducing operational overhead. It introduces enterprise-scale features including seamless access management with cross-account sharing, automated version upgrades, and integration with SageMaker AI capabilities like model customization and pipelines. With no administrator configuration needed and at no additional cost, data scientists can immediately begin tracking experiments, implementing observability, and evaluating model performance without infrastructure delays, making it straightforward to scale MLflow workloads across your organization while maintaining security and governance.
In this post, we explore how these new capabilities help you run large MLflow workloads—from generative AI agents to large language model (LLM) experimentation—with improved performance, automation, and security using SageMaker AI with MLflow.
Enterprise scale features in SageMaker AI with MLflow
The new MLflow serverless capability in SageMaker AI delivers enterprise-grade management with automatic scaling, default provisioning, seamless version upgrades, simplified AWS Identity and Access Management (IAM) authorization, resource sharing through AWS Resource Access Manager (AWS RAM), and integration with both Amazon SageMaker Pipelines and model customization. The term MLflow Apps replaces the previous MLflow tracking servers terminology, reflecting the simplified, application-focused approach. You can access the new MLflow Apps page in Amazon SageMaker Studio, as shown in the following screenshot.

A default MLflow App is automatically provisioned when you create a SageMaker Studio domain, streamlining the setup process. It’s enterprise-ready out of the box, requiring no additional provisioning or configuration. The MLflow App scales elastically with your usage, alleviating the need for manual capacity planning. Your training, tracking, and experimentation workloads can get the resources they need automatically, simplifying operations while maintaining performance.
Administrators can define a maintenance window during the creation of the MLflow App, during which in-place version upgrades of the MLflow App take place. This helps the MLflow App be standardized, secure, and continuously up to date, minimizing manual maintenance overhead. MLflow version 3.4 is supported with this launch, and as shown in the following screenshot, extends MLflow to ML, generative AI applications, and agent workloads.

Simplified identity management with MLflow Apps
We’ve simplified access control and IAM permissions for ML teams with the new MLflow App. A streamlined permissions set, such as sagemaker:CallMlflowAppApi, now covers common MLflow operations—from creating and searching experiments to updating trace information—making access control more straightforward to enforce.
By enabling simplified IAM permissions boundaries, users and platform administrators can standardize IAM roles across teams, personas, and projects, facilitating consistent and auditable access to MLflow experiments and metadata. For complete IAM permission and policy configurations, see Set up IAM permissions for MLflow Apps.
Cross-account sharing of MLflow Apps using AWS RAM
Administrators want to centrally manage their MLflow infrastructure while provisioning access across different AWS accounts. MLflow Apps support AWS cross-account sharing for collaborative enterprise AI development. Using AWS RAM, this feature helps AI platform administrators share an MLflow App seamlessly across data scientists with consumer AWS accounts, as illustrated in the following diagram.

Platform administrators can maintain a centralized, governed SageMaker domain that provisions and manages the MLflow App, and data scientists in separate consuming accounts can launch and interact with the MLflow App securely. Combined with the new simplified IAM permissions, enterprises can launch and manage an MLflow App from a centralized administrative AWS account. Using the shared MLflow App, a downstream data scientist consumer can log their MLflow experimentation and generative AI workloads while maintaining governance, auditability, and compliance from a single platform administrator control plane. To learn more about cross-account sharing, see Getting Started with AWS RAM.
SageMaker Pipelines and MLflow integration
SageMaker Pipelines is integrated with MLflow. SageMaker Pipelines is a serverless workflow orchestration service purpose-built for MLOps and LLMOps automation. You can seamlessly build, execute, and monitor repeatable end-to-end ML workflows with an intuitive drag-and-drop UI or the Python SDK. From a SageMaker pipeline, a default MLflow App will be created if one doesn’t already exist, an MLflow experiment name can be defined, and metrics, parameters, and artifacts are logged to the MLflow App as defined in your SageMaker pipeline code. The following screenshot shows an example ML pipeline using MLflow.

SageMaker model customization and MLflow integration
By default, SageMaker model customization integrates with MLflow, providing automatic linking between model customization jobs and MLflow experiments. When you run model customization fine-tuning jobs, the default MLflow App is used, an experiment is selected, and metrics, parameters, and artifacts are logged for you automatically. On the SageMaker model customization job page, you can view metrics sourced from MLflow and drill into additional metrics within the MLflow UI, as shown in the following screenshot.

Conclusion
These features make the new MLflow Apps in SageMaker AI ready for enterprise-scale ML and generative AI workloads with minimal administrative burden. You can get started with the examples provided in the GitHub samples repository and AWS workshop.
MLflow Apps are generally available in the AWS Regions where SageMaker Studio is available, except China and US GovCloud Regions. We invite you to explore the new capability and experience the enhanced efficiency and control it brings to your ML projects. Get started now by visiting the SageMaker AI with MLflow product detail page and Accelerate generative AI development using managed MLflow on Amazon SageMaker AI, and send your feedback to AWS re:Post for SageMaker or through your usual AWS support contacts.

About the authors
Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customers through their AIOps journey across model training, generative AI applications like agents, and scaling generative AI use cases. He also focuses on go-to-market strategies helping AWS build and align products to solve industry challenges in the generative AI space. You can connect with Sandeep on LinkedIn to learn about generative AI solutions.
Rahul Easwar is a Senior Product Manager at AWS, leading managed MLflow and Partner AI Apps within the Amazon SageMaker AIOps team. With over 20 years of experience spanning startups to enterprise technology, he leverages his entrepreneurial background and MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations worldwide. Connect with Rahul on LinkedIn to learn more about his work in ML platforms and enterprise AI solutions.
Jessica Liao is a Senior UX Designer at AWS who leads design for MLflow, model governance, and inference within Amazon SageMaker AI, shaping how data scientists evaluate, govern, and deploy models. She brings expertise in handling complex problems and driving human-centered innovation from her experience designing DNA life science systems, which she now applies to make machine learning tools more accessible and intuitive through cross-functional collaboration.

Mistral AI Ships Devstral 2 Coding Models And Mistral Vibe CLI For Age …

Mistral AI has introduced Devstral 2, a next generation coding model family for software engineering agents, together with Mistral Vibe CLI, an open source command line coding assistant that runs inside the terminal or IDEs that support the Agent Communication Protocol.

https://mistral.ai/news/devstral-2-vibe-cli

Devstral 2 and Devstral Small 2, model sizes, context and benchmarks

Devstral 2 is a 123B parameter dense transformer with a 256K token context window. It reaches 72.2 percent on SWE-bench Verified, which places it among the strongest open weight models for software engineering tasks. The model is released as open weights under a modified MIT license and is currently free to use via the Mistral API.

Devstral Small 2 is a 24B parameter model with the same 256K context window. It scores 68.0 percent on SWE-bench Verified and sits in the range of models that are up to 5 times larger in parameter count. It is released under the Apache 2.0 license, which is a standard permissive license for production use.

Both models are described as open source and permissively licensed and are positioned as state of the art coding models for agentic workloads. Mistral reports that Devstral 2 is up to 7 times more cost efficient than Claude Sonnet on real world coding tasks at similar quality, which is important for continuous agent workloads.

https://mistral.ai/news/devstral-2-vibe-cli

In terms of model size relative to frontier systems, Devstral 2 and Devstral Small 2 are 5 times and 28 times smaller than DeepSeek V3.2, and 8 times and 41 times smaller than Kimi K2.

Built for production grade coding workflows

Devstral 2 is designed for software engineering agents that need to explore repositories, track dependencies and orchestrate edits across many files while maintaining architecture level context. The model can detect failures, retry with corrections and support tasks such as bug fixing or modernization of legacy systems at repository scale.

Mistral states that Devstral 2 can be fine tuned to favor specific programming languages or to optimize for very large enterprise codebases. Devstral Small 2 brings the same design goals to a smaller footprint that is suitable for local deployment, tight feedback loops and fully private runtimes. It also supports image inputs and can drive multimodal agents that must reason over both code and visual artifacts such as diagrams or screenshots.

https://mistral.ai/news/devstral-2-vibe-cli

Human evaluations against DeepSeek V3.2 and Claude Sonnet 4.5

To test real world coding behavior, Mistral evaluated Devstral 2 against DeepSeek V3.2 and Claude Sonnet 4.5 using tasks scaffolded through the Cline agent tool. In these human evaluations Devstral 2 shows a clear advantage over DeepSeek V3.2 with a 42.8 percent win rate versus a 28.6 percent loss rate.

Mistral Vibe CLI, a terminal native coding agent

Mistral Vibe CLI is an open source command line coding assistant written in Python and powered by Devstral models. It explores, modifies and executes changes across a codebase using natural language in the terminal, or inside IDEs that support the Agent Communication Protocol such as Zed where it is available as an extension.The project is released under the Apache 2.0 license on GitHub.

Vibe CLI provides a chat style interface on top of several key tools:

Project aware context, it scans the file structure and Git status to build a working view of the repository.

Smart references, it supports @ autocomplete for files, ! for shell commands and slash commands for configuration changes.

Multi file orchestration, it reasons over the full codebase, not only the active buffer, to coordinate architecture level changes and reduce pull request cycle time.

Persistent history, autocompletion and themes tuned for daily use in the terminal.

Developers configure Vibe CLI through a simple config.toml file where they can point to Devstral 2 via the Mistral API or to other local or remote models. The tool supports programmatic runs, auto approval toggles for tool execution and granular permissions so that risky operations in sensitive repositories require confirmation.

Key Takeaways

Devstral 2 is a 123B parameter dense coding model with 256K context, it reaches 72.2 percent on SWE bench Verified and is released as open weights under a modified MIT license.

Devstral Small 2 has 24B parameters with the same 256K context, it scores 68.0 percent on SWE bench Verified and uses an Apache 2.0 license for easier production adoption.

Both Devstral models are optimized for agentic coding workloads, they are designed to explore full repositories, track dependencies and apply multi file edits with failure detection and retries.

Mistral Vibe CLI is an open source Python based terminal native coding agent that connects to Devstral, it provides project aware context, smart references and multi file orchestration through a chat style interface in the terminal or IDEs that support the Agent Communication Protocol.

Check out the Full Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Mistral AI Ships Devstral 2 Coding Models And Mistral Vibe CLI For Agentic, Terminal Native Development appeared first on MarkTechPost.