Thinking Machines Lab Makes Tinker Generally Available: Adds Kimi K2 T …

Thinking Machines Lab has moved its Tinker training API into general availability and added 3 major capabilities, support for the Kimi K2 Thinking reasoning model, OpenAI compatible sampling, and image input through Qwen3-VL vision language models. For AI engineers, this turns Tinker into a practical way to fine tune frontier models without building distributed training infrastructure.

What Tinker Actually Does?

Tinker is a training API that focuses on large language model fine tuning and hides the heavy lifting of distributed training. You write a simple Python loop that runs on a CPU only machine. You define the data or RL environment, the loss, and the training logic. The Tinker service maps that loop onto a cluster of GPUs and executes the exact computation you specify.

The API exposes a small set of primitives, such as forward_backward to compute gradients, optim_step to update weights, sample to generate outputs, and functions for saving and loading state. This keeps the training logic explicit for people who want to implement supervised learning, reinforcement learning, or preference optimization, but do not want to manage GPU failures and scheduling.

Tinker uses low rank adaptation, LoRA, rather than full fine tuning for all supported models. LoRA trains small adapter matrices on top of frozen base weights, which reduces memory and makes it practical to run repeated experiments on large mixture of experts models in the same cluster.

General Availability and Kimi K2 Thinking

The flagship change in the December 2025 update is that Tinker no longer has a waitlist. Anyone can sign up, see the current model lineup and pricing, and run cookbook examples directly.

On the model side, users can now fine tune moonshotai/Kimi-K2-Thinking on Tinker. Kimi K2 Thinking is a reasoning model with about 1 trillion total parameters in a mixture of experts architecture. It is designed for long chains of thought and heavy tool use, and it is currently the largest model in the Tinker catalog.

In the Tinker model lineup, Kimi K2 Thinking appears as a Reasoning MoE model, alongside Qwen3 dense and mixture of experts variants, Llama-3 generation models, and DeepSeek-V3.1. Reasoning models always produce internal chains of thought before the visible answer, while instruction models focus on latency and direct responses.

OpenAI Compatible Sampling While Training

Tinker already had a native sampling interface through its SamplingClient. The typical inference pattern builds a ModelInput from token ids, passes SamplingParams, and calls sample to get a future that resolves to outputs

The new release adds a second path that mirrors the OpenAI completions interface. A model checkpoint on Tinker can be referenced through a URI like:

Copy CodeCopiedUse a different Browserresponse = openai_client.completions.create(
model=”tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080″,
prompt=”The capital of France is”,
max_tokens=20,
temperature=0.0,
stop=[“n”],
)

Vision Input With Qwen3-VL On Tinker

The second major capability is image input. Tinker now exposes 2 Qwen3-VL vision language models, Qwen/Qwen3-VL-30B-A3B-Instruct and Qwen/Qwen3-VL-235B-A22B-Instruct. They are listed in the Tinker model lineup as Vision MoE models and are available for training and sampling through the same API surface.

To send an image into a model, you construct a ModelInput that interleaves an ImageChunk with text chunks. The research blog uses the following minimal example:

Copy CodeCopiedUse a different Browsermodel_input = tinker.ModelInput(chunks=[
tinker.types.ImageChunk(data=image_data, format=”png”),
tinker.types.EncodedTextChunk(tokens=tokenizer.encode(“What is this?”)),
])

Here image_data is raw bytes and format identifies the encoding, for example png or jpeg. You can use the same representation for supervised learning and for RL fine tuning, which keeps multimodal pipelines consistent at the API level. Vision inputs are fully supported in Tinker’s LoRA training setup.

https://thinkingmachines.ai/blog/tinker-general-availability/

Qwen3-VL Versus DINOv2 On Image Classification

To show what the new vision path can do, the Tinker team fine tuned Qwen3-VL-235B-A22B-Instruct as an image classifier. They used 4 standard datasets:

Caltech 101

Stanford Cars

Oxford Flowers

Oxford Pets

Because Qwen3-VL is a language model with visual input, classification is framed as text generation. The model receives an image and generates the class name as a text sequence.

As a baseline, they fine tuned a DINOv2 base model. DINOv2 is a self supervised vision transformer that encodes images into embeddings and is often used as a backbone for vision tasks. For this experiment, a classification head is attached on top of DINOv2 to predict a distribution over the N labels in each dataset.

Both Qwen3-VL-235B-A22B-Instruct and DINOv2 base are trained using LoRA adapters within Tinker. The focus is data efficiency. The experiment sweeps the number of labeled examples per class, starting from only 1 sample per class and increasing. For each setting, the team measures classification accuracy.

Key Takeaways

Tinker is now generally available, so anyone can sign up and fine tune open weight LLMs through a Python training loop while Tinker handles the distributed training backend.

The platform supports Kimi K2 Thinking, a 1 trillion parameter mixture of experts reasoning model from Moonshot AI, and exposes it as a fine tunable reasoning model in the Tinker lineup.

Tinker adds an OpenAI compatible inference interface, which lets you sample from in training checkpoints using a tinker://… model URI through standard OpenAI style clients and tooling.

Vision input is enabled through Qwen3-VL models, Qwen3-VL 30B and Qwen3-VL 235B, so developers can build multimodal training pipelines that combine ImageChunk inputs with text using the same LoRA based API.

Thinking Machines demonstrates that Qwen3-VL 235B, fine tuned on Tinker, achieves stronger few shot image classification performance than a DINOv2 base baseline on datasets such as Caltech 101, Stanford Cars, Oxford Flowers, and Oxford Pets, highlighting the data efficiency of large vision language models.

The post Thinking Machines Lab Makes Tinker Generally Available: Adds Kimi K2 Thinking And Qwen3-VL Vision Input appeared first on MarkTechPost.

Tracking and managing assets used in AI development with Amazon SageM …

Building custom foundation models requires coordinating multiple assets across the development lifecycle such as data assets, compute infrastructure, model architecture and frameworks, lineage, and production deployments. Data scientists create and refine training datasets, develop custom evaluators to assess model quality and safety, and iterate through fine-tuning configurations to optimize performance. As these workflows scale across teams and environments, tracking which specific dataset versions, evaluator configurations, and hyperparameters produced each model becomes challenging. Teams often rely on manual documentation in notebooks or spreadsheets, making it difficult to reproduce successful experiments or understand the lineage of production models.
This challenge intensifies in enterprise environments with multiple AWS accounts for development, staging, and production. As models move through deployment pipelines, maintaining visibility into their training data, evaluation criteria, and configurations requires significant coordination. Without automated tracking, teams lose the ability to trace deployed models back to their origins or share assets consistently across experiments. Amazon SageMaker AI supports tracking and managing assets used in generative AI development. With Amazon SageMaker AI you can register and version models, datasets, and custom evaluators, then automatically capturing relationships and lineage as you fine-tune, evaluate, and deploy generative AI models. This reduces manual tracking overhead and provides complete visibility into how models were created, from base foundation model through production deployment.
In this post, we’ll explore the new capabilities and core concepts that help organizations track and manage models development and deployment lifecycles. We will show you how the features are configured to train models with automatic end-to-end lineage, from dataset upload and versioning to model fine-tuning, evaluation, and seamless endpoint deployment.
Managing dataset versions across experiments
As you refine training data for model customization, you typically create multiple versions of datasets. You can register datasets and create new versions as your data evolves, with each version tracked independently. When you register a dataset in SageMaker AI, you provide the S3 location and metadata describing the dataset. As you refine your data—whether adding more examples, improving quality, or adjusting for specific use cases—you can create new versions of the same dataset. Each version, as shown in the following image, maintains its own metadata and S3 location so you can track the evolution of your training data over time.

When you use a dataset for fine-tuning, Amazon SageMaker AI automatically links the specific dataset version to the resulting model. This supports the comparison between models trained with different dataset versions and helps you understand which data refinements led to better performance. You can also reuse the same dataset version across multiple experiments for consistency when testing different hyperparameters or fine-tuning techniques.
Creating reusable custom evaluators
Evaluating custom models often requires domain-specific quality, safety, or performance criteria. A custom evaluator consists of Lambda function code that receives input data and returns evaluation results including scores and validation status. You can define evaluators for various purposes—checking response quality, assessing safety and toxicity, validating output format, or measuring task-specific accuracy. You can track custom evaluators using AWS Lambda functions that implement your evaluation logic, then version and reuse these evaluators across models and datasets, as shown in the following image.

Automatic lineage tracking throughout the development lifecycle
SageMaker AI lineage tracking capability automatically captures relationships between assets as you build and evaluate models. When you create a fine-tuning job, Amazon SageMaker AI links the training job to input datasets, base foundation models, and output models. When you run evaluation jobs, it connects evaluations to the models being assessed and the evaluators used. This automatic lineage capture means you don’t need to manually document which assets were used for each experiment. You can view the complete lineage for a model, showing its base foundation model, training datasets with specific versions, hyperparameters, evaluation results, and deployment locations, as shown in the image below.

With the lineage view, you can trace any deployed models back to their origins. For example, if you need to understand why a production model behaves in a certain way, you can see exactly which training data, fine-tuning configuration, and evaluation criteria were used. This is particularly valuable for governance, reproducibility, and debugging purposes. You can also use lineage information to reproduce experiments. By identifying the exact dataset version, evaluator version, and configuration used for a successful model, you can recreate the training process with confidence that you’re using identical inputs.
Integrating with MLflow for experiment tracking
The model customization capabilities of Amazon SageMaker AI are by default behavior integrated with SageMaker AI MLflow Apps, providing automatic linking between model training jobs and MLflow experiments. When you run model customization jobs, all the necessary MLflow actions are automatically performed for you – the default SageMaker AI MLflow App is automatically used, an MLflow experiment selected for you and all the metrics, parameters, and artifacts are logged for you. From the SageMaker AI Studio model page, you will be able to see metrics sourced from MLflow (as shown in the following image) and further view full metrics within the associated MLflow experiment.

With MLflow integration it is straightforward to compare multiple model candidates. You can use MLflow to visualize performance metrics across experiments, identify the best-performing model, then use the lineage to understand which specific datasets and evaluators produced that result. This helps you make informed decisions about which models to promote to production based on both quantitative metrics and asset provenance.
Getting started with tracking and managing generative AI assets
By bringing these various model customization assets and processes—dataset versioning, evaluator tracking, model performance, model deployment – you can turn the scattered model assets into a traceable, reproducible, and production ready workflow with automatic end-to-end lineage. This capability is now available in supported AWS Regions. You can access this capability through Amazon SageMaker AI Studio, and the SageMaker python SDK.
To get started:

Open Amazon SageMaker AI Studio and navigate to the Models section.
Customize the JumpStart base models to create a model.
Navigate to the Assets section to manage datasets and evaluators.
Register your first dataset by providing an S3 location and metadata.
Create a custom evaluator using an existing Lambda function or create a new one.
Use registered datasets in your fine-tuning jobs—lineage is captured automatically.
View lineage for the model to see complete relationships.

For more information, visit the Amazon SageMaker AI documentation.

About the authors
Amit Modi is the product leader for SageMaker AI MLOps, ML Governance, and Inference at AWS. With over a decade of B2B experience, he builds scalable products and teams that drive innovation and deliver value to customers globally.
Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customer through their AIOps journey across model training, GenAI applications like Agents, and scaling GenAI use-cases. He also focuses on go-to-market strategies helping AWS build and align products to solve industry challenges in the generative AI space. You can connect with Sandeep on LinkedIn to learn about GenAI solutions.

Track machine learning experiments with MLflow on Amazon SageMaker usi …

A user can conduct machine learning (ML) data experiments in data environments, such as Snowflake, using the Snowpark library. However, tracking these experiments across diverse environments can be challenging due to the difficulty in maintaining a central repository to monitor experiment metadata, parameters, hyperparameters, models, results, and other pertinent information. In this post, we demonstrate how to integrate Amazon SageMaker managed MLflow as a central repository to log these experiments and provide a unified system for monitoring their progress.
Amazon SageMaker managed MLflow offers fully managed services for experiment tracking, model packaging, and model registry. The SageMaker Model Registry streamlines model versioning and deployment, facilitating seamless transitions from development to production. Additionally, integration with Amazon S3, AWS Glue, and SageMaker Feature Store enhances data management and model traceability. The key benefits of using MLflow with SageMaker are that it allows organizations to standardize ML workflows, improve collaboration, and accelerate artificial intelligence (AI)/ML adoption with a more secure and scalable infrastructure. In this post, we show how to integrate Amazon SageMaker managed MLflow with Snowflake.
Snowpark allows Python, Scala, or Java to create custom data pipelines for efficient data manipulation and preparation when storing training data in Snowflake. Users can conduct experiments in Snowpark and track them in Amazon SageMaker managed MLflow. This integration allows data scientists to run transformations and feature engineering in Snowflake and utilise the managed infrastructure within SageMaker for training and deployment, facilitating a more seamless workflow orchestration and more secure data handling.
Solution overview
The integration leverages Snowpark for Python, a client-side library that allows Python code to interact with Snowflake from Python kernels, such as SageMaker’s Jupyter notebooks. One workflow could include data preparation in Snowflake, along with feature engineering and model training within Snowpark. Amazon SageMaker managed MLflow can then be used for experiment tracking and model registry integrated with the capabilities of SageMaker.

Figure 1: Architecture diagram

Capture key details with MLflow Tracking
MLflow Tracking is important in the integration between SageMaker, Snowpark, and Snowflake by providing a centralized environment for logging and managing the entire machine learning lifecycle. As Snowpark processes data from Snowflake and trains models, MLflow Tracking can be used to capture key details including model parameters, hyperparameters, metrics, and artifacts. This allows data scientists to monitor experiments, compare different model versions, and verify reproducibility. With MLflow’s versioning and logging capabilities, teams can seamlessly trace the results back to the specific dataset and transformations used, making it simpler to track the performance of models over time and maintain a transparent and efficient ML workflow.
This approach offers several benefits. It allows for scalable and managed MLflow tracker in SageMaker, while utilizing the processing capabilities of Snowpark for model inference within the Snowflake environment, creating a unified data system. The workflow remains within the Snowflake environment, which enhances data security and governance. Additionally, this setup helps to reduce cost by utilizing the elastic compute power of Snowflake for inference without maintaining a separate infrastructure for model serving.
Prerequisites
Create/configure the following resources and confirm access to the aforementioned resources prior to establishing Amazon SageMaker MLflow:

A Snowflake account
An S3 bucket to track experiments in MLflow
An Amazon SageMaker Studio account
An AWS Identity and Access Management (IAM) role that is an Amazon SageMaker Domain Execution Role in the AWS account.
A new user with permission to access the S3 bucket created above; follow these steps.

Confirm access to an AWS account through the AWS Management Console and AWS Command Line Interface (AWS CLI). The AWS Identity and Access Management (IAM) user must have permissions to make the necessary AWS service calls and manage AWS resources mentioned in this post. While providing permissions to the IAM user, follow the principle of least-privilege.

Configure access to the Amazon S3 bucket created above following these steps.
Follow these steps to set up external access for Snowflake Notebooks.

Steps to call SageMaker’s MLflow Tracking Server from Snowflake
We now establish the Snowflake environment and connect it to the Amazon SageMaker MLflow Tracking Server that we previously set up.

Follow these steps to create an Amazon SageMaker Managed MLflow Tracking Server in Amazon SageMaker Studio.
Log in to Snowflake as an admin user.
Create a new Notebook in Snowflake

Projects > Notebooks > +Notebook
Change role to a non-admin role
Give a name, select a database (DB), schema, warehouse, and select ‘Run on container’
Notebook settings > External access> toggle on to allow all integration

Install libraries

!pip install sagemaker-mlflow

Run the MLflow code, by replacing the arn value from the below code:

import mlflow
import boto3
import logging

sts = boto3.client(“sts”)
assumed = sts.assume_role(
RoleArn=”<AWS-ROLE-ARN>”,
RoleSessionName=”sf-session”
)
creds = assumed[“Credentials”]

arn = “<ml-flow-arn>”

try:
mlflow.set_tracking_uri(arn)
mlflow.set_experiment(“Default”)
with mlflow.start_run():
mlflow.log_param(“test_size”, 0.2)
mlflow.log_param(“random_state”, 42)
mlflow.log_param(“model_type”, “LinearRegression”)
except Exception as e:
logging.error(“Failed to set tracking URI: {e}”)

Figure 3: Install sagemaker-mlflow library

Figure 4: Configure MLflow and do experiments.

On a successful run, the experiment can be tracked on Amazon SageMaker:

Figure 5: Track experiments in SageMaker MLflow

To get into details of an experiment, click on the respective “Run name:”

Figure 6: Experience detailed experiment insights

Clean up
Follow these steps to clear up the resources that we have configured in this post to help avoid ongoing costs.

Delete the SageMaker Studio account by following these steps, this deletes the MLflow tracking server as well
Delete the S3 bucket with its contents
Drop the Snowflake notebook
Verify that the Amazon SageMaker account is deleted

Conclusion
In this post, we explored how Amazon SageMaker managed MLflow can provide a comprehensive solution for managing a machine learning lifecycle. The integration with Snowflake through Snowpark further enhances this solution, helping to enable seamless data processing and model deployment workflows.
To get started, follow the step-by-step instructions provided above to set up MLflow Tracking Server in Amazon SageMaker Studio and integrate it with Snowflake. Remember to follow AWS security best practices by implementing proper IAM roles and permissions and securing all credentials appropriately.
The code samples and instructions in this post serve as a starting point – they can be adapted to match a specific use cases and requirements while maintaining security and scalability best practices.

About the authors
Ankit Mathur is a Solutions Architect at AWS focused on modern data platforms, AI-driven analytics, and AWS–Partner integrations. He helps customers and partners design secure, scalable architectures that deliver measurable business outcomes.
Mark Hoover is a Senior Solutions Architect at AWS where he is focused on helping customers build their ideas in the cloud. He has partnered with many enterprise clients to translate complex business strategies into innovative solutions that drive long-term growth.

Governance by design: The essential guide for successful AI scaling

Picture this: Your enterprise has just deployed its first generative AI application. The initial results are promising, but as you plan to scale across departments, critical questions emerge. How will you enforce consistent security, prevent model bias, and maintain control as AI applications multiply?
It turns out you’re not alone. A McKinsey survey spanning 750+ leaders across 38 countries reveals both challenges and opportunities when building a governance strategy. While organizations are committing significant resources—most planning to invest over $1 million in responsible AI—implementation hurdles persist. Knowledge gaps represent the primary barrier for over 50% of respondents, with 40% citing regulatory uncertainty.
Yet companies with established responsible AI programs report substantial benefits: 42% see improved business efficiency, while 34% experience increased consumer trust. These results point to why robust risk management is fundamental to realizing AI’s full potential.
Responsible AI: A non-negotiable from day one
At the AWS Generative AI Innovation Center, we’ve observed that organizations achieving the strongest results embed governance into their DNA from the start. This aligns with the AWS commitment to responsible AI development, evidenced by our recent launch of the AWS Well-Architected Responsible AI Lens, a comprehensive framework for implementing responsible practices throughout the development lifecycle.
The Innovation Center has consistently applied these principles by embracing a responsible by design philosophy, carefully scoping use cases, and following science-backed guidance. This approach led to our AI Risk Intelligence (AIRI) solution, which transforms these best practices into actionable, automated governance controls—making responsible AI implementation both attainable and scalable.
Four tips for responsible and secure generative AI deployments
Drawing from our experience helping more than one thousand organizations across industries and geographies, here are key strategies for integrating robust governance and security controls into the development, review, and deployment of AI applications through an automated and seamless process.
1 – Adopt a governance-by-design mindset
At the Innovation Center, we work daily with organizations at the forefront of generative and agentic AI adoption. We’ve observed a consistent pattern: while the promise of generative AI captivates business leaders, they often struggle to chart a path toward responsible and secure implementation. The organizations achieving the most impressive results establish a governance-by-design mindset from the start—treating AI risk management and responsible AI considerations as foundational elements rather than compliance checkboxes. This approach transforms governance from a perceived barrier into a strategic advantage for faster innovation while maintaining appropriate controls. By embedding governance into the development process itself, these organizations can scale their AI initiatives more confidently and securely.
2 – Align technology, business, and governance
The primary mission of the Innovation Center is helping customers develop and deploy AI solutions to meet business needs, while leveraging the most optimal AWS services. However, technical exploration must go hand-in-hand with governance planning. Think of it like conducting an orchestra—you wouldn’t coordinate a symphony without understanding how each instrument works and how they harmonize together. Similarly, effective AI governance requires a deep understanding of the underlying technology before implementing controls. We help organizations establish clear connections between technology capabilities, business objectives, and governance requirements from the start, making sure these three elements work in concert.
3 – Embed security as the governance gateway
After establishing a governance-by-design mindset and aligning business, technology, and governance objectives, the next crucial step is implementation. We’ve found that security serves as the most effective entry point for operationalizing comprehensive AI governance. Security not only provides vital protection but also supports responsible innovation by building trust into the foundation of AI systems. The approach used by the Innovation Center emphasizes security-by-design throughout the implementation journey, from basic infrastructure protection to sophisticated threat detection in complex workflows.
To support this approach, we help customers leverage capabilities like the AWS Security Agent, which automates security validation across the development lifecycle. This frontier agent conducts customized security reviews and penetration testing based on centrally defined standards, helping organizations scale their security expertise to match development velocity.
This security-first approach anchors a broader set of governance controls. The AWS Responsible AI framework unites fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency into a cohesive approach. As AI systems integrate deeper into business processes and autonomous decision-making, automating these controls while maintaining rigorous oversight becomes crucial for scaling successfully.
4 – Automate governance at enterprise scale
With the foundational elements in place—mindset, alignment, and security controls—organizations need a way to systematically scale their governance efforts. This is where the AIRI solution comes in. Rather than creating new processes, it operationalizes the principles and controls we’ve discussed through automation, in a phased approach.

The solution’s architecture integrates seamlessly with existing workflows through a three-step process: user input, automated assessment, and actionable insights. It analyzes everything from source code to system documentation, using advanced techniques like automated document processing and LLM-based evaluations to conduct comprehensive risk assessments. Most importantly, it performs dynamic testing of generative AI systems, checking for semantic consistency and potential vulnerabilities while adapting to each organization’s specific requirements and industry standards.

From theory to practice
The true measure of effective AI governance is how it evolves with an organization while maintaining rigorous standards at scale. When implemented successfully, automated governance enables teams to focus on innovation, confident that their AI systems operate within appropriate guardrails. A compelling example comes from our collaboration with Ryanair, Europe’s largest airline group. As they scale towards 300 million passengers by 2034, Ryanair needed responsible AI governance for their cabin crew application, which provides frontline staff with crucial operational information. Using Amazon Bedrock, the Innovation Center conducted an AI-powered evaluation. This established transparent, data-driven risk management where risks were previously difficult to quantify—creating a model for responsible AI governance that Ryanair can now expand across their AI portfolio.
This implementation demonstrates the broader impact of systematic AI governance. Organizations using this framework consistently report accelerated paths to production, reduced manual work, and enhanced risk management capabilities. Most importantly, they’ve achieved strong cross-functional alignment, from technology to legal to security teams—all working from clear, measurable objectives.
A foundation for innovation
Responsible AI governance isn’t a constraint—it’s a catalyst. By embedding governance into the fabric of AI development, organizations can innovate with confidence, knowing they have the controls to scale securely and responsibly. The example above demonstrates how automated governance transforms theoretical frameworks into practical solutions that drive business value while maintaining trust.
Learn more about the AWS Generative AI Innovation Center and how we’re helping organizations of different sizes implement responsible AI to complement their business objectives.

About the Authors
Segolene Dessertine-Panhard is the global tech lead for Responsible AI and AI governance initiatives at the AWS Generative AI Innovation Center. In this role, she supports AWS customers in scaling their generative AI strategies by implementing robust governance processes and effective AI and cybersecurity risk management systems, leveraging AWS capabilities and state-of-the-art scientific models. Prior to joining AWS in 2018, she was a full-time professor of Finance at New York University’s Tandon School of Engineering. She also served for several years as an independent consultant in financial disputes and regulatory investigations. She holds a Ph.D. from Paris Sorbonne University.
Sri Elaprolu serves as Director of the AWS Generative AI Innovation Center, where he leverages nearly three decades of technology leadership experience to drive artificial intelligence and machine learning innovation. In this role, he leads a global team of machine learning scientists and engineers who develop and deploy advanced generative and agentic AI solutions for enterprise and government organizations facing complex business challenges. Throughout his nearly 13-year tenure at AWS, Sri has held progressively senior positions, including leadership of ML science teams that partnered with high-profile organizations such as the NFL, Cerner, and NASA. These collaborations enabled AWS customers to harness AI and ML technologies for transformative business and operational outcomes. Prior to joining AWS, he spent 14 years at Northrop Grumman, where he successfully managed product development and software engineering teams. Sri holds a Master’s degree in Engineering Science and an MBA with a concentration in general management, providing him with both the technical depth and business acumen essential for his current leadership role.
Randi Larson connects AI innovation with executive strategy for the AWS Generative AI Innovation Center, shaping how organizations understand and translate technical breakthroughs into business value. She hosts the Innovation Center’s podcast series and combines strategic storytelling with data-driven insight through global keynotes and executive interviews on AI transformation. Before Amazon, Randi refined her analytical precision as a Bloomberg journalist and consultant to economic institutions, think tanks, and family offices on financial technology initiatives. Randi holds an MBA from Duke University’s Fuqua School of Business and a B.S. in Journalism and Spanish from Boston University.

How Tata Power CoE built a scalable AI-powered solar panel inspection …

This post is co-written with Vikram Bansal from Tata Power, and Gaurav Kankaria, Omkar Dhavalikar from Oneture.
The global adoption of solar energy is rapidly increasing as organizations and individuals transition to renewable energy sources. India is on the brink of a solar energy revolution, with a national goal to empower 10 million households with rooftop solar installations by 2027. However, as the number of installations surges into the millions, a critical need has emerged: ensuring each solar panel system is properly installed and maintained. Traditional manual inspection methods—which involve physical site visits, visual assessments, and paper-based documentation—have become a significant bottleneck. They’re prone to human error, inconsistent, and can create substantial time delays. To address these challenges, Tata Power Center of Technology Excellence (CoE) collaborated with Oneture Technologies as their AI analytics partner to develop an AI-powered solar panel installation inspection solution using Amazon SageMaker AI, Amazon Bedrock and other AWS services.

In this post, we explore how Tata Power CoE and Oneture Technologies use AWS services to automate the inspection process end-to-end.
Challenges
As Tata Power scales up their solar panel installations, several key challenges emerge with the current process:
Time-consuming manual inspection: Traditional inspection processes require engineers to visually inspect every panel and manually document their findings. This approach is time-consuming and susceptible to human error. Engineers must carefully examine multiple aspects of the installation, from panel alignment to wiring connections, making the process lengthy and mentally taxing.
Limited scalability: The current manual inspection process cannot keep pace with the rapidly increasing volume of installations, creating a widening gap between inspection capacity and demand. As Tata Power aims to handle millions of new installations, the limitations of manual processes become increasingly apparent, potentially creating bottlenecks in installations.
Inconsistent quality standard: The deployment of multiple inspection teams across various locations affects maintaining uniform quality standards. Different teams might interpret and apply quality guidelines differently, resulting in variations in how assessments are conducted and documented. This lack of standardization makes it difficult to help achieve consistent quality across all installations.
Increasing customer escalations: Inconsistent installation quality and delays in completion results in a growing number of customer complaints and escalations. These issues directly affect customers’ experience, with customers expressing dissatisfaction over varying quality standards and extended waiting periods.
Solution overview
Implementing an AI-powered inspection system to perform more than 22 distinct checks across six different solar installation components required complex technical solutions. The inspection criteria ranged from simple visual verifications to sophisticated quality assessments requiring specialized approaches for detecting tiny defects, verifying placement accuracy, and evaluating installation completeness. The absence of a standard operating procedure (SOP) to capture images, resulting in variation in angles, lighting, object distance, and background clutter across the dataset, further complicated processes. Some criteria had abundant training data, while others had limited and imbalanced datasets, making model generalization difficult. Certain installation criteria demanded accurate distance measurements, such as verifying whether components were installed at the correct height or maintaining proper spacing between elements. Traditional computer vision models proved inadequate for these metric-based evaluations without the support of specialized sensors or depth estimation capabilities. The diversity of inspection requirements demanded a sophisticated multi-model approach, because no single computer vision model could adequately address all inspection criteria. An essential aspect lay in carefully mapping each inspection criterion to its most appropriate AI model type, ranging from object detection for component presence verification to semantic segmentation for detailed analysis, and incorporating generative AI-based reasoning for complex interpretative tasks.
To address these challenges, Tata Power CoE collaborated with Oneture to create a secure, scalable, and intelligent inspection platform using AWS services. Before technical development, the team conducted extensive field research to understand real-world installation conditions. This approach revealed key operational realities: installations occurred in tight spaces with poor lighting conditions, equipment varied significantly across sites, and image quality was often compromised by environmental factors (demonstrated in the following image). One crucial insight emerged during these field observations: certain inspection requirements, particularly measurements like the gap between inverters and walls, demanded sophisticated spatial analysis capabilities that went beyond basic object detection.

Figure 1: Example image of solar panel components

The solution includes SageMaker AI for training and inference at scale, Amazon SageMaker Ground Truth for data labeling, Amazon Bedrock for image understanding and recommendations, Amazon Rekognition for OCR, and additional AWS services. The following diagram illustrates the solution architecture.

Figure 2: Solution Architecture

Data labeling with Amazon SageMaker Ground Truth
The foundation of accurate AI-powered inspections lies in high-quality training data. To help achieve comprehensive model coverage, the team collected more than 20,000 images, capturing a wide range of real-world scenarios including varying lighting conditions and different hardware conditions. They chose SageMaker Ground Truth as their data labeling solution, using its capabilities to create custom annotation workflows and manage the labeling process efficiently. SageMaker Ground Truth proved instrumental in maintaining data quality through its human-in-the-loop workflow features. Its built-in validation mechanisms, including stratified and random sampling, helped achieve dataset robustness. Tata Power’s quality assurance experts conducted direct reviews of labeled data through the SageMaker Ground Truth interface, providing an additional layer of validation. This meticulous attention to data quality was crucial, because even minor visual misclassifications could potentially trigger incorrect warranty claims or installation rejections.
Model training with Amazon SageMaker AI
To select and train the right model, the team use the comprehensive ML capabilities of SageMaker AI to streamline both experimentation and production deployment. SageMaker AI provided an ideal environment for rapid prototyping—the team could quickly spin up Jupyter Notebook instances, which they used to evaluate various architectures for object detection, pattern classification, OCR, and spatial estimation tasks. Through this experimentation, they selected YOLOv5x6 as their primary model, which proved particularly effective at identifying small solar panel components within high-resolution installation images. The training process, initially spanning 1.5 months, was optimized through parallel experimentation and automated workflows, resulting in streamlined, 2-day iteration cycles. Through more than 100 training jobs, the team uncovered crucial insights that significantly improved model performance. They found that increasing input image resolution enhanced small object detection accuracy, while implementing pre-processing checks for image quality factors like brightness and blurriness helped maintain consistent results. Edge cases were strategically handled by generative AI models, allowing the computer vision models to focus on mainstream scenarios. By analyzing inspection criteria overlap, the team successfully consolidated the original 22 inspection points into 10 efficient models, optimizing both processing time and costs.
Amazon SageMaker Pipelines enabled rapid feedback loops from field performance data and seamless incorporation of learnings through a federated learning approach. The team could quickly adjust hyperparameters, fine-tune confidence thresholds, and evaluate model performance using metrics like F1-score and Intersection over Union (IoU), all while maintaining advanced accuracy standards. This streamlined approach transformed a complex, multi-faceted training process into an agile, production-ready solution capable of meeting stringent quality requirements at scale.

Figure 3: F1-Confidence Curve

Model inference at scale with Amazon SageMaker AI
Deploying the model presented unique requirements for Tata Power, particularly when handling high-resolution images captured in remote locations with unreliable network connectivity. While SageMaker AI real-time inference is powerful, it comes with specific limitations that didn’t align with Tata Power’s requirements, such as a 60-second timeout for endpoint invocation and a 6 MB maximum payload size. These constraints could potentially impact the processing of high-resolution inspection images and complex inference logic.
To address these operational constraints, the team implemented SageMaker AI asynchronous inference, which proved to be an ideal solution for their distributed inspection workflow. The inference ability to handle large payload sizes accommodated the high-resolution inspection images without compression, helping to ensure that no detail was lost in the analysis process. The endpoints automatically scaled based on incoming request volume, optimizing both performance and cost efficiency.
Maintaining model accuracy with SageMaker Pipelines
To help ensure sustained model performance in production, the team implemented an automated retraining system using SageMaker AI. This system continuously monitored model predictions, automatically triggering data collection when confidence scores fell below defined thresholds. This approach to model maintenance helped combat model drift and ensure that the system remained accurate as field conditions evolved. The retraining pipeline, built on SageMaker Pipelines, automated the entire process from data collection to production deployment. When new training data was collected, the pipeline orchestrated a sequence of steps: data validation, model retraining, performance evaluation in a staging environment, and finally, controlled deployment to production through continuous integration and delivery (CI/CD) integration.
OCR with Amazon Rekognition
While custom machine learning models powered much of Tata Power’s inspection platform, the CoE team recognized that some tasks could be solved more efficiently Amazon Rekognition, for example reading Ohm Meter values during inspections, as shown in the following figure.

Figure 4: Omh Meter

By integrating the OCR capabilities of Amazon Rekognition, the team avoided the time-consuming process of developing and training custom OCR models, while still achieving the advanced accuracy levels required for production use.
Enhancing the inspection process with Amazon Bedrock
While computer vision models delivered advanced accuracy for most inspection points, they had limitations with specific scenarios involving extremely small object sizes in the image, variable camera angles, and partially obscured elements. To address these limitations, The team implemented Amazon Bedrock to enhance the inspection process, focusing on six critical criteria that required additional intelligence beyond traditional computer vision. Amazon Bedrock enabled a critical pre-check phase before initiating computer vision inference operations. This pre-inference system evaluates three key image quality parameters: visibility clarity, object obstruction status, and capture angle suitability. When images fail to meet these quality benchmarks, the system automatically triggers one of two response pathways—either flagging the image for immediate recapture or routing it through specialized Generative AI reasoning processes. This intelligent pre-screening mechanism optimizes computational efficiency by preventing unnecessary inference cycles on suboptimal images, while helping to ensure high-quality input for accurate inspection results.
To close the loop, Amazon Bedrock Knowledge Bases provides real-time, contextual guidance from internal guideline documents. This automated feedback loop accelerates the inspection cycle and improves installation quality by providing instant, actionable recommendations at the point of inspection.
The mobile app
The mobile app provides an intuitive interface designed specifically for on-site use, so that engineers can efficiently complete installation inspections through a streamlined workflow. With this app, field engineers can capture installation photos, receive immediate analysis results, and validate AI findings all through a single interface
Results and impact
The implementation of the AI-powered automated inspection tool delivered measurable improvements across Tata Power’s solar installation operations.

The solution achieves more than 90% AI/ML accuracy across most of the points with object detection precision of 95%, enabling near real-time feedback to channel partners instead of delayed offline reviews.
Automated quality checks now instantly verify most installations, significantly reducing manual inspection needs. AI model training continues to improve accuracy in detecting missing checkpoints.
Re-inspection rates have dropped by more than 80%. These efficiency gains led to faster site handovers, directly improving customer satisfaction metrics.
The automated system’s ability to provide immediate feedback enhanced channel partner productivity and satisfaction, creating a more streamlined installation process from initial setup to final customer handover.

Conclusion
In this post, we explained how Tata Power CoE, Oneture Technologies, and AWS transformed traditional manual inspection processes into efficient, AI-powered solutions. By using Amazon SageMaker AI, Amazon Bedrock, and Amazon Rekognition, the team successfully automated solar panel installation inspections, achieving more than 90% accuracy while cutting re-inspection rates by 80%.See the following resources to learn more:

Visit the AWS Community to discover how our builder communities are using Amazon SageMaker AI and Amazon Bedrock in their solutions.
Learn more about Amazon SageMaker AI
Learn more about Amazon Bedrock

About the authors

Vikram Bansal is a business-focused technology leader with over 20 years of experience in enterprise architecture and delivery. During the last two decades, he has lead multiple strategic digital initiatives and large scale transformation programs across telecom (OSS/BSS), media and entertainment, and the power and utility sector (energy distribution, renewables). His expertise spans enterprise application modernization, data and analytics platforms, and end-to-end digital transformation delivery.

Gaurav H Kankaria is a passionate technologist and ISB alumnus with nearly a decade of experience in data science, analytics, and the AWS Cloud. As an AWS Partner Ambassador and certified expert across multiple specialties, he is known for simplifying complex cloud concepts and driving impactful AI/ML solutions.

Omkar Dhavalikar is the AI/ML Lead at Oneture Technologies, where he helps enterprises design and implement cost-effective machine learning solutions on AWS. He specializes in crafting innovative, AI-driven approaches to solve complex business problems with speed, scalability, and impact.

Chetan Makvana is an Enterprise Solutions Architect at Amazon Web Services. He helps enterprise customers design scalable, resilient, secure, and cost effective enterprise-grade solutions using AWS services. He is a technology enthusiast and a builder with interests in generative AI, serverless, app modernization, and DevOps.

Unlocking video understanding with TwelveLabs Marengo on Amazon Bedroc …

Media and entertainment, advertising, education, and enterprise training content combines visual, audio, and motion elements to tell stories and convey information, making it far more complex than text where individual words have clear meanings. This creates unique challenges for AI systems that need to understand video content. Video content is multidimensional, combining visual elements (scenes, objects, actions), temporal dynamics (motion, transitions), audio components (dialogue, music, sound effects), and text overlays (subtitles, captions). This complexity creates significant business challenges as organizations struggle to search through video archives, locate specific scenes, categorize content automatically and extract insights from their media assets for effective decision-making.
The model addresses this problem with a multi-vector architecture that creates separate embeddings for different content modalities. Instead of forcing all information into one vector, the model generates specialized representations. This approach preserves the rich, multifaceted nature of video data, enabling more accurate analysis across visual, temporal, and audio dimensions.
Amazon Bedrock has expanded its capabilities to support the TwelveLabs Marengo Embed 3.0 model with real-time text and image processing through synchronous inference. With this integration businesses can implement faster video search functionality using natural language queries, while also supporting interactive product discovery through sophisticated image similarity matching.
In this post, we’ll show how the TwelveLabs Marengo embedding model, available on Amazon Bedrock, enhances video understanding through multimodal AI. We’ll build a video semantic search and analysis solution using embeddings from the Marengo model with Amazon OpenSearch Serverless as the vector database, for semantic search capabilities that go beyond simple metadata matching to deliver intelligent content discovery.
Understanding video embeddings
Embeddings are dense vector representations that capture the semantic meaning of data in a high-dimensional space. Think of them as numerical fingerprints that encode the essence of content in a way machines can understand and compare. For text, embeddings might capture that “king” and “queen” are related concepts, or that “Paris” and “France” have a geographical relationship. For images, embeddings can understand that a golden retriever and labrador are both dogs, even if they look different. The following heat map shows the semantic similarity scores between these sentence fragments: “two people having a conversation,” “a man and a woman talking,” and “cats and dogs are lovely animals.”
Video embeddings challenges
Video presents unique challenges because it’s inherently multimodal:

Visual information: Objects, scenes, people, actions, and visual aesthetics
Audio information: Speech, music, sound effects, and ambient noise
Textual information: Captions, on-screen text, and transcribed speech

Traditional single-vector approaches compress all this rich information into one representation, often losing important nuances. This is where the approach by TwelveLabs Marengo is unique in addressing this challenge effectively.
Twelvelabs Marengo: A multimodal embedding model
The Marengo 3.0 model generates multiple specialized vectors, each capturing different aspects of the video content. A typical movie or TV show combines visual and auditory elements to create a unified storytelling experience. Marengo’s multi-vector architecture provides significant advantages for understanding this complex video content. Each vector captures a specific modality, avoiding information loss from compressing diverse data types into single representations. This enables flexible searches targeting specific content aspects—visual-only, audio-only, or combined queries. Specialized vectors deliver superior accuracy in complex multimodal scenarios while maintaining efficient scalability for large enterprise video datasets.
Solution overview: Marengo model capabilities
In the following section, we’ll demonstrate the power of Marengo’s embedding technology through code samples. The examples illustrate how Marengo processes different types of content and deliver exceptional search accuracy. The complete code sample can be found in this GitHub repository.
Prerequisites
Before we begin, verify you have:

An AWS account with appropriate permissions.
Access to Amazon Bedrock (with the TwelveLabs Marengo model enabled)
Access to create an OpenSearch Serverless collection and index
Basic familiarity with vector databases and embeddings

Sample video
Netflix Open Content is an open source content available under the Creative Commons Attribution 4.0 International license. We will be using one of the videos called Meridian for demonstrating the TwelveLabs Marengo model on Amazon Bedrock.

Create a video embedding
Amazon Bedrock uses asynchronous API for Marengo video embedding generations. The following is a python code snippet that shows an example of invoking an API that takes a video from an S3 bucket location. Please refer to the documentation for complete supported functionality.

bedrock_client = boto3.client(“bedrock-runtime”)
model_id = ‘us.twelvelabs.marengo-embed-3-0-v1:0′
video_s3_uri = “<s3 bucket location for the video>” # Replace by your s3 URI
aws_account_id = “<the AWS account owner for the bucket>” # Replace by bucket owner ID
s3_bucket_name = “<s3 bucket name>” # Replace by output S3 bucket name
s3_output_prefix = “<output prefix>” # Replace by output prefix

response = bedrock_client.start_async_invoke(
modelId=model_id,
modelInput={
“inputType”: “video”,
“video”: {
“mediaSource”: {
“s3Location”: {
“uri”: video_s3_uri,
“bucketOwner”: aws_account_id
}
}
}
},
outputDataConfig={
“s3OutputDataConfig”: {
“s3Uri”: f’s3://{s3_bucket_name}/{s3_output_prefix}’
}
}
)

The example above produces 280 individual embeddings from a single video – one for each segment, enabling precise temporal search and analysis. The type of embeddings for multi-vector output from the video could contain the following:

[
{’embedding’: [0.053192138671875,…], ’embeddingOption’: “visual”, ’embeddingScope’ : “clip”, “startSec” : 0.0, “endSec” : 4.3 },
{’embedding’: [0.053192138645645,…], ’embeddingOption’: “transcription”, ’embeddingScope’ : “clip”, “startSec” : 3.9, “endSec” : 6.5 },
{’embedding’: [0.3235554er443524,…], ’embeddingOption’: “audio”, ’embeddingScope’ : “clip”, “startSec” : 4.9, “endSec” : 7.5 }
]

visual – visual embeddings of the video
transcription – embeddings of the transcribed text
audio – embeddings of the audio in the video

When processing audio or video content, you can set how long each clip segment should be for embedding creation. By default, video clips are automatically divided at natural scene changes (shot boundaries). Audio clips are split into even segments that are as close to 10 seconds as possible—for example, a 50-second audio file becomes 5 segments of 10 seconds each, while a 16-second file becomes 2 segments of 8 seconds each. By default, a single Marengo video embedding API generates visual-text, visual-image, and audio embedding. You can also change the default setting to only output specific embedding types. Use the following code snippet to generate embeddings for a video with configurable options with the Amazon Bedrock API:

response = bedrock_client.start_async_invoke(
modelId=model_id,
modelInput={
“modelId”: model_id,
“modelInput”: {
“inputType”: “video”,
“video”: {
“mediaSource”: {
“base64String”: “base64-encoded string”, // base64String OR s3Location, exactly one
“s3Location”: {
“uri”: “s3://amzn-s3-demo-bucket/video/clip.mp4”,
“bucketOwner”: “123456789012”
}
},
“startSec”: 0,
“endSec”: 6,
“segmentation”: {
“method”: “dynamic”, // dynamic OR fixed, exactly one
“dynamic”: {
“minDurationSec”: 4
}
“method”: “fixed”,
“fixed”: {
“durationSec”: 6
}
},
“embeddingOption”: [
“visual”,
“audio”,
“transcription”
], // optional, default=all
“embeddingScope”: [
“clip”,
“asset”
] // optional, one or both
},
“inferenceId”: “some inference id”
}
}
)

Vector database: Amazon OpenSearch Serverless
In our example, we’ll use Amazon OpenSearch Serverless as vector database for storing the text, images, audio, and video embeddings generated from the given video via Marengo model. As a vector database, OpenSearch Serverless allows you to quickly find similar content using semantic search without worrying about managing servers or infrastructure. The following code snippet demonstrates how to create an Amazon OpenSearch Serverless collection:

aoss_client = boto3_session.client(‘opensearchserverless’)

try:
collection = self.aoss_client.create_collection(
name=collection_name, type=’VECTORSEARCH’
)
collection_id = collection[‘createCollectionDetail’][‘id’]
collection_arn = collection[‘createCollectionDetail’][‘arn’]
except self.aoss_client.exceptions.ConflictException:
collection = self.aoss_client.batch_get_collection(
names=[collection_name]
)[‘collectionDetails’][0]
pp.pprint(collection)
collection_id = collection[‘id’]
collection_arn = collection[‘arn’]

Once the OpenSearch Serverless collection is created, we’ll create an index that contains properties, including a vector field:

index_mapping = {
“mappings”: {
“properties”: {
“video_id”: {“type”: “keyword”},
“segment_id”: {“type”: “integer”},
“start_time”: {“type”: “float”},
“end_time”: {“type”: “float”},
“embedding”: {
“type”: “dense_vector”,
“dims”: 1024,
“index”: True,
“similarity”: “cosine”
},
“metadata”: {“type”: “object”}
}
}
}
credentials = boto3.Session().get_credentials()
awsauth = AWSV4SignerAuth(credentials, region_name, ‘aoss’)
oss_client = OpenSearch(
hosts=[{‘host’: host, ‘port’: 443}],
http_auth=self.awsauth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
timeout=300
)
response = oss_client.indices.create(index=index_name, body=index_mapping)

Index Marengo embeddings
The following code snippet demonstrates how to ingest the embedding output from the Marengo model into the OpenSearch index:

documents = []
for i, segment in enumerate(video_embeddings):
document = {
“embedding”: segment[“embedding”],
“start_time”: segment[“startSec”],
“end_time”: segment[“endSec”],
“video_id”: video_id,
“segment_id”: i,
“embedding_option”: segment.get(“embeddingOption”, “visual”)
}
documents.append(document)

# Bulk index documents
bulk_data = []
for doc in documents:
bulk_data.append({“index”: {“_index”: self.index_name}})
bulk_data.append(doc)

# Convert to bulk format
bulk_body = “n”.join(json.dumps(item) for item in bulk_data) + “n”
response = oss_client.bulk(body=bulk_body, index=self.index_name)

Cross-modal semantic search
With Marengo’s multi-vector design you can search across different modalities that is impossible with single-vector models. By creating separate but aligned embeddings for visual, audio, motion, and contextual elements, you can search videos using an input type of your choice. For example, “jazz music playing” returns video clips of musicians performing, jazz audio tracks, and concert hall scenes from one text query.
The following examples showcase Marengo’s exceptional search capabilities across different modalities:
Text search
Here’s a code snippet that demonstrates the cross modal semantic search capability using text:

text_query = “a person smoking in a room”
modelInput={
“inputType”: “text”,
“text”: {
“inputText”: text_query
}
}
response = self.bedrock_client.invoke_model(
modelId=”us.twelvelabs.marengo-embed-3-0-v1:0″,
body=json.dumps(modelInput))

result = json.loads(response[“body”].read())
query_embedding = result[“data”][0][“embedding”]

# Search OpenSearch index
search_body = {
“query”: {
“knn”: {
“embedding”: {
“vector”: query_embedding,
“k”: top_k
}
}
},
“size”: top_k,
“_source”: [“start_time”, “end_time”, “video_id”, “segment_id”]
}

response = opensearch_client.search(index=self.index_name, body=search_body)

print(f”n✅ Found {len(response[‘hits’][‘hits’])} matching segments:”)

results = []
for hit in response[‘hits’][‘hits’]:
result = {
“score”: hit[“_score”],
“video_id”: hit[“_source”][“video_id”],
“segment_id”: hit[“_source”][“segment_id”],
“start_time”: hit[“_source”][“start_time”],
“end_time”: hit[“_source”][“end_time”]
}
results.append(result)

The top search result from the text query: “a person smoking in a room” yields the following video clip:

Image search
The following code snippet demonstrates the cross modal semantic search capability for a given image:

s3_image_uri = f’s3://{self.s3_bucket_name}/{self.s3_images_path}/{image_path_basename}’
s3_output_prefix = f'{self.s3_embeddings_path}/{self.s3_images_path}/{uuid.uuid4()}’

modelInput={
“inputType”: “image”,
“image”: {
“mediaSource”: {
“s3Location”: {
“uri”: s3_image_uri,
“bucketOwner”: self.aws_account_id
}
}
}
}
response = self.bedrock_client.invoke_model(
modelId=self.cris_model_id,
body=json.dumps(modelInput),
)

result = json.loads(response[“body”].read())


query_embedding = result[“data”][0][“embedding”]

# Search OpenSearch index
search_body = {
“query”: {
“knn”: {
“embedding”: {
“vector”: query_embedding,
“k”: top_k
}
}
},
“size”: top_k,
“_source”: [“start_time”, “end_time”, “video_id”, “segment_id”]
}

response = opensearch_client.search(index=self.index_name, body=search_body)

print(f”n✅ Found {len(response[‘hits’][‘hits’])} matching segments:”)
results = []

for hit in response[‘hits’][‘hits’]:
result = {
“score”: hit[“_score”],
“video_id”: hit[“_source”][“video_id”],
“segment_id”: hit[“_source”][“segment_id”],
“start_time”: hit[“_source”][“start_time”],
“end_time”: hit[“_source”][“end_time”]
}
results.append(result)

The top search result from the image above yields the following video clip:

In addition to semantic searching using text and images on the video, the Marengo model can also search videos using audio embeddings that focus on dialogue and speech. The audio search capabilities help users find videos based on specific speakers, dialogue content, or spoken topics. This creates a comprehensive video search experience that combines text, image, audio for video understanding.
Conclusion
The combination of TwelveLabs Marengo and Amazon Bedrock opens up exciting new possibilities for video understanding through its multi-vector, multimodal approach. Throughout this post, we’ve explored practical examples like image-to-video search with temporal precision and detailed text-to-video matching. With just a single Bedrock API call, we transformed one video file into 336 searchable segments that respond to text, visual, and audio queries. These capabilities create opportunities for natural language content discovery, streamlined media asset management, and other applications that can help organizations better understand and utilize their video content at scale.
As video continues to dominate digital experiences, models like Marengo provide a solid foundation for building more intelligent video analysis systems. Check out the sample code and discover how multimodal video understanding can transform your applications.

About the authors
Wei Teh is an Machine Learning Solutions Architect at AWS. He is passionate about helping customers achieve their business objectives using cutting-edge machine learning solutions. Outside of work, he enjoys outdoor activities like camping, fishing, and hiking with his family.
Lana Zhang is a Senior Specialist Solutions Architect for Generative AI at AWS within the Worldwide Specialist Organization. She specializes in AI/ML, with a focus on use cases such as AI voice assistants and multimodal understanding. She works closely with customers across diverse industries, including media and entertainment, gaming, sports, advertising, financial services, and healthcare, to help them transform their business solutions through AI.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

How to Design a Gemini-Powered Self-Correcting Multi-Agent AI System w …

In this tutorial, we explore how we design and run a full agentic AI orchestration pipeline powered by semantic routing, symbolic guardrails, and self-correction loops using Gemini. We walk through how we structure agents, dispatch tasks, enforce constraints, and refine outputs using a clean, modular architecture. As we progress through each snippet, we see how the system intelligently chooses the right agent, validates its output, and improves itself through iterative reflection. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport os
import json
import time
import typing
from dataclasses import dataclass, asdict
from google import genai
from google.genai import types

API_KEY = os.environ.get(“GEMINI_API_KEY”, “API Key”)
client = genai.Client(api_key=API_KEY)

@dataclass
class AgentMessage:
source: str
target: str
content: str
metadata: dict
timestamp: float = time.time()

We set up our core environment by importing essential libraries, defining the API key, and initializing the Gemini client. We also establish the AgentMessage structure, which acts as the shared communication format between agents. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass CognitiveEngine:
@staticmethod
def generate(prompt: str, system_instruction: str, json_mode: bool = False) -> str:
config = types.GenerateContentConfig(
temperature=0.1,
response_mime_type=”application/json” if json_mode else “text/plain”
)
try:
response = client.models.generate_content(
model=”gemini-2.0-flash”,
contents=prompt,
config=config
)
return response.text
except Exception as e:
raise ConnectionError(f”Gemini API Error: {e}”)

class SemanticRouter:
def __init__(self, agents_registry: dict):
self.registry = agents_registry

def route(self, user_query: str) -> str:
prompt = f”””
You are a Master Dispatcher. Analyze the user request and map it to the ONE best agent.
AVAILABLE AGENTS:
{json.dumps(self.registry, indent=2)}
USER REQUEST: “{user_query}”
Return ONLY a JSON object: {{“selected_agent”: “agent_name”, “reasoning”: “brief reason”}}
“””
response_text = CognitiveEngine.generate(prompt, “You are a routing system.”, json_mode=True)
try:
decision = json.loads(response_text)
print(f” [Router] Selected: {decision[‘selected_agent’]} (Reason: {decision[‘reasoning’]})”)
return decision[‘selected_agent’]
except:
return “general_agent”

We build the cognitive layer using Gemini, allowing us to generate both text and JSON outputs depending on the instruction. We also implement the semantic router, which analyzes queries and selects the most suitable agent. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass Agent:
def __init__(self, name: str, instruction: str):
self.name = name
self.instruction = instruction

def execute(self, message: AgentMessage) -> str:
return CognitiveEngine.generate(
prompt=f”Input: {message.content}”,
system_instruction=self.instruction
)

class Orchestrator:
def __init__(self):
self.agents_info = {
“analyst_bot”: “Analyzes data, logic, and math. Returns structured JSON summaries.”,
“creative_bot”: “Writes poems, stories, and creative text. Returns plain text.”,
“coder_bot”: “Writes Python code snippets.”
}
self.workers = {
“analyst_bot”: Agent(“analyst_bot”, “You are a Data Analyst. output strict JSON.”),
“creative_bot”: Agent(“creative_bot”, “You are a Creative Writer.”),
“coder_bot”: Agent(“coder_bot”, “You are a Python Expert. Return only code.”)
}
self.router = SemanticRouter(self.agents_info)

We construct the worker agents and the central orchestrator. Each agent receives a clear role, analyst, creative, or coder, and we configure the orchestrator to manage them. As we review this section, we see how we define the agent ecosystem and prepare it for intelligent task delegation. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser def validate_constraint(self, content: str, constraint_type: str) -> tuple[bool, str]:
if constraint_type == “json_only”:
try:
json.loads(content)
return True, “Valid JSON”
except:
return False, “Output was not valid JSON.”
if constraint_type == “no_markdown”:
if ““`” in content:
return False, “Output contains Markdown code blocks, which are forbidden.”
return True, “Valid Text”
return True, “Pass”

def run_task(self, user_input: str, constraint: str = None, max_retries: int = 2):
print(f”n— New Task: {user_input} —“)
target_name = self.router.route(user_input)
worker = self.workers.get(target_name)
current_input = user_input
history = []
for attempt in range(max_retries + 1):
try:
msg = AgentMessage(source=”User”, target=target_name, content=current_input, metadata={})
print(f” [Exec] {worker.name} working… (Attempt {attempt+1})”)
result = worker.execute(msg)
if constraint:
is_valid, error_msg = self.validate_constraint(result, constraint)
if not is_valid:
print(f” [Guardrail] VIOLATION: {error_msg}”)
current_input = f”Your previous answer failed a check.nOriginal Request: {user_input}nYour Answer: {result}nError: {error_msg}nFIX IT immediately.”
continue
print(f” [Success] Final Output:n{result[:100]}…”)
return result
except Exception as e:
print(f” [System Error] {e}”)
time.sleep(1)
print(” [Failed] Max retries reached or self-correction failed.”)
return None

We implement symbolic guardrails and a self-correction loop to enforce constraints like strict JSON or no Markdown. We run iterative refinement whenever outputs violate requirements, allowing our agents to fix their own mistakes. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
orchestrator = Orchestrator()
orchestrator.run_task(
“Compare the GDP of France and Germany in 2023.”,
constraint=”json_only”
)
orchestrator.run_task(
“Write a Python function for Fibonacci numbers.”,
constraint=”no_markdown”
)

We execute two complete scenarios, showcasing routing, agent execution, and constraint validation in action. We run a JSON-enforced analytical task and a coding task with Markdown restrictions to observe the reflexive behavior. 

In conclusion, we now see how multiple components, routing, worker agents, guardrails, and self-correction, come together to create a reliable and intelligent agentic system. We witness how each part contributes to robust task execution, ensuring that outputs remain accurate, aligned, and constraint-aware. As we reflect on the architecture, we recognize how easily we can expand it with new agents, richer constraints, or more advanced reasoning strategies.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Gemini-Powered Self-Correcting Multi-Agent AI System with Semantic Routing, Symbolic Guardrails, and Reflexive Orchestration appeared first on MarkTechPost.

Checkpointless training on Amazon SageMaker HyperPod: Production-scale …

Foundation model training has reached an inflection point where traditional checkpoint-based recovery methods are becoming a bottleneck to efficiency and cost-effectiveness. As models grow to trillions of parameters and training clusters expand to thousands of AI accelerators, even minor disruptions can result in significant costs and delays.
In this post, we introduce checkpointless training on Amazon SageMaker HyperPod, a paradigm shift in model training that reduces the need for traditional checkpointing by enabling peer-to-peer state recovery. Results from production-scale validation show 80–93% reduction in recovery time (from 15–30 minutes or more to under 2 minutes) and enables up to 95% training goodput on cluster sizes with thousands of AI accelerators.
Understanding goodput
Foundation model training is one of the most resource-intensive processes in AI, often involving millions of dollars in compute spend across thousands of AI accelerators running for days to months. Because of the inherent all-or-none distributed synchrony across all ranks, even a loss of a single rank because of software or hardware faults brings the training workloads to a complete halt. To mitigate such localized faults, the industry has relied on checkpoint-based recovery; periodically saving training states (checkpoints) to a durable store based on a user-defined checkpoint interval. When a fault occurs, the training workload resumes by restoring from the latest saved checkpoint. This traditional restart-to-recover model has become increasingly untenable as model sizes grow from billions to trillions of parameters and training workloads grow from hundreds to thousands of AI accelerators.
This challenge of maintaining efficient training operations at scale has led to the concept of goodput—the actual useful work accomplished in an AI training system compared to its theoretical maximum capacity. In foundation model training, goodput is impacted by system failures and recovery overhead. The gap between the system’s theoretical maximum throughput and its actual productive output (goodput) grows larger with: increased frequency of failures (which rises with cluster size), longer recovery times (which scale with model size and cluster size), and higher costs of idle resources during recovery. This definition helps frame why measuring and optimizing goodput becomes increasingly crucial as AI training scales to larger clusters and more complex models, where even small inefficiencies can result in significant financial and time costs.
A pre-training workload on a HyperPod cluster with 256 P5 instances, checkpointing every 20 minutes, faces two challenges when disrupted: 10 minutes of lost work plus 10 minutes for recovery. With ml.p5.24xlarge instances costing $55 per hour, each disruption costs $4,693 in compute time. For a month-long training, daily disruptions would accumulate to $141,000 in extra costs and delay completion by 10 hours.

As cluster sizes grow, the probability and frequency of failures can increase.

As the training spans across thousands of nodes, disruptions caused by faults become increasingly frequent. Meanwhile, recovery becomes slower because the workload reinitialization overhead grows linearly with cluster size. The cumulative impact of large-scale AI training failures can reach millions of dollars annually and translate directly to delayed time-to-market, slower model iteration cycles, and competitive disadvantage. Every hour of idle GPU time is an hour not spent advancing model capabilities.
Checkpoint-based recovery
Checkpoint-based recovery in distributed training is far more complex and time-consuming than commonly understood. When a failure occurs in traditional distributed training, the restart process involves far more than loading the last checkpoint. Understanding what happens during recovery reveals why it takes so long and why the entire cluster must sit idle.
The all-or-none cascade
A single failure—one GPU error, one network timeout, or one hardware fault—can trigger a complete training cluster shutdown. Because distributed training treats all processes as tightly coupled, any single failure necessitates a complete restart. When any process fails, the orchestration system (for example, TorchElastic or Kubernetes) must terminate every process across the job and restart from scratch. Each restart requires navigating a complex, multi-stage recovery process where every stage is sequential and blocking:

Stage 1: Training job restart – The training job orchestrator detects a failure, terminates all processes in all nodes followed by a cluster-wide restart or the training job.
Stage 2: Process and network initialization – Every process must re-execute the training script from the beginning. That includes rank initialization, loading of Python modules from durable store such as Network File System (NFS) or object storage, establishing the training topology and communication backend through peer discovery and process groups creation. The process group initialization alone can take tens of minutes on large clusters.
Stage 3: Checkpoint retrieval – Each process must first identify the last completely saved checkpoint, then retrieve it from persistent storage (for example, NFS or object storage) and load multiple state dictionaries: the model’s parameters and buffers, the optimizer’s internal state (momentum, variance, and so on), the learning rate scheduler, and training loop metadata (epoch, batch number). This step can take tens of minutes or longer depending on cluster and model size.
Stage 4: Data loader initialization – The data-loading ranks have additional responsibility to initialize the data buffers. That includes retrieving the data checkpoint from durable storage such as Amazon FSx or Amazon Simple Storage Service (Amazon S3) and prefetching the training data to start the training loop. Data checkpointing is an essential step to avoid processing the same data samples multiple times or skipping samples upon training disruption. Depending on the data mix strategy, data locality, and bandwidth, the process can take a few minutes.
Stage 5: First step overhead – After checkpoint and training data are retrieved and loaded, there is additional overhead to run the first training step, we call it first step overhead (FSO). During this first step, there is typically time spent in memory allocation, creating and setting up the CUDA context for communication with GPUs, and compilation part of the CUDA graph, and so on.
Stage 6: Lost steps overhead – Only after all previous stages complete successfully can the training loop resume its regular progress. Because the training resumes from the last saved model checkpoint, all the steps computed between the checkpoint and the fault encountered are lost. Those lost steps need to be recomputed, we call this lost steps overhead (LSO). Following the recomputation phase, the training job resumes productive work that directly contributes to goodput.

How checkpointless training eliminates these bottlenecks
The five stages outlined above—termination and restart, process discovery and network setup, checkpoint retrieval, GPU context reinitialization, and training loop resumption—represent the fundamental bottlenecks in checkpoint-based recovery. Each stage is sequential and blocking, and training recovery can take minutes to several hours for large models. Critically, the entire cluster must wait for every stage to complete before training can resume.
Checkpointless training eliminates this cascade. Checkpointless training preserves model state coherence across the distributed cluster, eliminating the need for periodic snapshots. When failures occur, the system quickly recovers by using healthy peers, avoiding both storage I/O operations and full process restarts typically required by traditional checkpointing approaches.

Checkpointless training architecture

Checkpointless training is built on five components that work together to eliminate the traditional checkpoint-restart bottlenecks. Each component addresses a specific bottleneck in the recovery process, and together they enable automatic detection and recovery of infrastructure faults in minutes with zero manual intervention, even with thousands of AI accelerators.
Component 1: TCPStore-less/root-less NCCL and Gloo initialization (optimizing stage 2)
In a typical distributed training setup (for example, using torch.distributed), all ranks must initialize a process group. The process group creates a communication layer, allowing all processes (or ranks, that is, individual nodes) to be aware of each other and exchange information. A TCPStore is often used as a rendezvous point where all ranks check in to discover each other’s connection information. When thousands of ranks try to contact a designated root server (typically rank 0) simultaneously, it becomes a bottleneck. This leads to a flood of simultaneous network requests to a single root server that can cause network congestion, increase latency by tens of minutes, and further slow the communication process.
Checkpointless training eliminates this centralized dependency. Instead of funneling all connection requests through a single root server, the system uses a symmetric address pattern where each rank independently computes peer connection information using a global group counter. Ranks connect directly to each other using predetermined port assignments, avoiding the TCPStore bottleneck. Process group initialization drops from tens of minutes to seconds, even on clusters with thousands of nodes. The system also eliminates the single-point-of-failure risk inherent in root-based initialization.
Component 2: Memory-mapped data loading (optimizing stage 4)
One of the hidden costs in traditional recovery is reloading training data. When a process restarts, it must reload batches from disk, rebuild data loader state, and carefully position itself to avoid processing duplicate samples or skipping data. On large-scale training runs, this data loading can add minutes to every recovery cycle.
Checkpointless training uses memory-mapped data loading to maintain cached data across accelerators. Training data is mapped into shared memory regions that persist even when individual processes fail. When a node recovers, it doesn’t reload data from disk but reconnects to the existing memory-mapped cache. The data loader state is preserved, helping to ensure that training continues from the correct position without duplicate or skipped samples. MMAP also reduces host CPU memory usage by maintaining only one copy of data per node (compared to eight copies with traditional data loaders on 8-GPU nodes), and training can resume immediately using cached batches while the data loader concurrently prefetches the next data in the background.

Memory-mapped data loading workflow

Component 3: In-process recovery (optimizing stage 1, 2, and 5)
Traditional checkpoint-based recovery treats failures as job-level events: a single GPU error triggers termination of the entire distributed training job. Every process across the cluster must be killed and restarted, even though only one component failed.
Checkpointless training uses in-process recovery to isolate failures at the process level. When a GPU or process fails, only the failed process executes an in-process recovery to rejoin the training loop within seconds, overcoming recoverable or transient errors. Healthy processes continue running without interruption. The failed process stays alive (avoiding full process teardown), preserving the CUDA context, compiler cache, and GPU state, hence eliminating minutes of reinitialization overhead. In cases where the error is non-recoverable (such as hardware failure), the system automatically swaps the faulty component with a pre-warmed hot spare, enabling training to continue without disruptions.
This eliminates the need for full cluster termination and restart, dramatically reducing recovery overhead.
Component 4: Peer-to-peer state replication (optimizing stage 3 and 6)
Checkpoint-based recovery requires loading model and optimizer state from persistent storage (such as Amazon S3 or FSx for Lustre). For models with billions to trillions of parameters, this means transferring tens to hundreds of gigabytes over the network, deserializing state dictionaries, and reconstructing optimizer buffers which could take tens of minutes and create a massive I/O bottleneck.
The most critical innovation in checkpointless training is continuous peer-to-peer state replication. Instead of periodically saving model state to centralized storage, each GPU maintains redundant copies of its model shards on peer GPUs. When a failure occurs, the recovering process doesn’t load from Amazon S3. It copies state directly from a healthy peer over the high-speed Elastic Fabric Adapter (EFA) network interconnect. This peer-to-peer architecture eliminates the I/O bottleneck that dominates traditional checkpoint recovery. State transfer happens in seconds, compared to minutes for loading multi-gigabyte checkpoints from storage. The recovering node pulls only the specific shards it needs, further reducing transfer time.
Component 5: SageMaker HyperPod training operator (optimizing all stages)
The SageMaker HyperPod training operator orchestrates the checkpointless training components, serving as the coordination layer that ties together initialization, data loading, checkpointless recovery, and checkpoint fallback mechanisms. It maintains a centralized control plane with a global view of training process health across the entire cluster, coordinating fault detection, recovery decisions, and cluster-wide synchronization.
The operator implements intelligent recovery escalation: it first attempts in-process restart for failed components, and if that’s not feasible (for example, because of container crashes or node failures), it escalates to process-level recovery. During a process-level recovery, instead of restarting the entire job when failures occur, the operator restarts only training processes, keeping the containers alive. As a result, the recovery times are faster than a job-level restart, which requires tearing down and recreating the training infrastructure, involving pod rescheduling, container pulls, environment initialization, and re-loading from checkpoints. When failures occur, the operator broadcasts coordinated stop signals to prevent cascading timeouts and integrates with the SageMaker HyperPod health-monitoring agent to automatically detect hardware issues and trigger recovery without manual intervention.

Getting started with checkpointless training
This section guides you through setting up and configuring checkpointless training on SageMaker HyperPod to reduce fault recovery from hours to minutes.
Prerequisites
Before integrating checkpointless training into your training workload, verify that your environment meets the following requirements:
Infrastructure requirements:

Amazon SageMaker HyperPod cluster orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS)
HyperPod training operator v1.2 or later installed on the cluster
Recommended instance types: ml.p5., p5e., or p5en.48xlarge, ml.p6.p6-b200.48xlarge, or ml.p6e-gb200.36xlarge
Minimum cluster size: Two nodes for peer-to-peer checkpointless recovery

Software requirements:

Supported frameworks: Nemo, PyTorch, PyTorch Lightning
Training data formats: JSON, JSONGZ (compressed JSON), or ARROW
Amazon Elastic Container Registry (Amazon ECR) repository for container images. Use the HyperPod checkpointless training container—required for rootless NCCL initialization (Tier 1) and peer-to-peer checkpointless recovery (Tier 4)

658645717510.dkr.ecr.<region>.amazonaws.com/sagemaker-hyperpod/pytorch-training:2.3.0-checkpointless

Checkpointless training workflow
Checkpointless training is designed for incremental adoption. You can start with basic capabilities and progressively enable advanced features as your training scales. The integration is organized into four tiers, each building on the previous one:
Tier 1: NCCL initialization optimization
NCCL initialization optimization eliminates the centralized root process bottleneck during initialization. Nodes discover and connect to peers independently using infrastructure signals. This enables faster process group initialization (seconds instead of minutes) and elimination of single-point-of-failure during startup.
Integration steps: Enable an environment variable as part of the job specification and verify that the job runs with the checkpointless training container.

# kubernetes job spec
env:
– name: HPCT_USE_CONN_DATA # Enable Rootless
value: “1”
– name: TORCH_SKIP_TCPSTORE # Enable TCPStore Removal
value: “1”

Tier 2: Memory-mapped data loading
Memory mapped data loading keeps training data cached in shared memory across process restarts, eliminating data reload overhead during recovery. This enables instant data access during recovery. No need to reload or re-shuffle data when a process restarts.
Integration steps: Augment the existing data loader with a memory mapped cache

from hyperpod_checkpointless_training.dataloader.mmap_data_module import MMAPDataModule
from hyperpod_checkpointless_training.dataloader.config import CacheResumeMMAPConfig

base_data_module = MY_DATA_MODULE(…). # Customer’s own datamodule

mmap_config = CacheResumeMMAPConfig(
cache_dir=self.cfg.mmap.cache_dir,
)

mmap_dm = MMAPDataModule(
data_module=base_data_module,
mmap_config=CacheResumeMMAPConfig(
cache_dir=self.cfg.mmap.cache_dir,
),
)

Tier 3: In-process recovery
In-process recovery isolates failures to individual processes instead of requiring full job restarts. Failed processes recover independently while healthy processes continue training. It enables sub-minute recovery from process-level failures. Healthy processes stay alive, while failed processes recover independently.
Integration steps:

from hyperpod_checkpointless_training.inprocess.health_check import CudaHealthCheck
from hyperpod_checkpointless_training.inprocess.wrap import HPCallWrapper, HPWrapper
from hyperpod_checkpointless_training.inprocess.train_utils import HPAgentK8sAPIFactory
@HPWrapper(
health_check=CudaHealthCheck(),
hp_api_factory=HPAgentK8sAPIFactory(),
abort_timeout=60.0,
)
def re_executable_codeblock(): # The re-executable codeblock defined by user, usually it’s main function or train loop

Tier 4: Checkpointless (peer-to-peer recovery) (NeMo integration)
Checkpointless recovery enables complete peer-to-peer state replication and recovery. Failed processes recover model and optimizer state directly from healthy peers without loading from storage. This step enables elimination of checkpoint loading. Failed processes recover model and optimizer state from healthy replicas over the high-speed EFA interconnect.
Integration steps:

from hyperpod_checkpointless_training.inprocess.train_utils import wait_rank
wait_rank()

def main():
@HPWrapper(
health_check=CudaHealthCheck(),
hp_api_factory=HPAgentK8sAPIFactory(),
abort_timeout=60.0,
checkpoint_manager=PEFTCheckpointManager(enable_offload=True),
abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),
finalize=CheckpointlessFinalizeCleanup(),
)
def run_main(cfg, caller: Optional[HPCallWrapper] = None):

trainer = Trainer(
strategy=CheckpointlessMegatronStrategy(…,
num_distributed_optimizer_instances=2),
callbacks=[…, CheckpointlessCallback(…)],
)
trainer.fresume = resume
trainer._checkpoint_connector = CheckpointlessCompatibleConnector(trainer)
trainer.wrapper = caller

wait_rank: All ranks will wait for the rank information from the Hyperod training operator infrastructure.
HPWrapper: Python function wrapper that enables restart capabilities for a restart code block (RCB). The implementation uses a context manager instead of a Python decorator because the call wrapper lacks information about the number of RCBs it should monitor.
CudaHealthCheck: Helps ensure that the CUDA context for the current process is in a healthy state. It synchronizes with the GPU and uses the device corresponding to LOCAL_RANK environment variable, or the main thread’s default CUDA device if LOCAL_RANK was not specified in the environment.
HPAgentK8sAPIFactory: This is the API that checkpointless training will use to understand the training status from the other pods in a K8s training cluster. It also provides an infrastructure-level barrier, which makes sure every rank can successfully perform the abort and restart.
CheckpointManager: Manages in-memory checkpoints and peer-to-peer recovery for checkpointless fault tolerance.
We recommend starting with Tier 1 and validating it in your environment. Add Tier 2 when data loading overhead becomes a bottleneck. Adopt Tier 3 and Tier 4 for maximum resilience on the largest training clusters.
For NeMo users and HyperPod recipe users, Tier 4 is available out-of-the-box with minimal configuration changes for Llama and GPT open source recipes. NeMo examples for Llama and GPT open source models can be found in SageMaker HyperPod checkpointless training.
Performance results
Checkpointless training has been validated at production scale across multiple cluster configurations. The latest Amazon Nova models were trained using this technology on tens of thousands of AI accelerators.
In this section, we demonstrate results from extensive testing across a range of cluster sizes, spanning 16 GPUs to 2,304 GPUs. Checkpointless training demonstrated significant improvements in recovery time, consistently reducing downtime by 80–93% compared to traditional checkpoint-based recovery.

Cluster (H100s)
Model
Traditional recovery
Checkpointless recovery
Improvement

2,304 GPUs
Internal model
15–30 minutes
Less than 2 minutes
~87–93% faster

256 GPUs
Llama-3 70B (pre-training)
4 min, 52 sec
47 seconds
~84% faster

16 GPUs
Llama-3 70B (fine-tuning)
5 min 10 sec
50 seconds
~84% faster

These recovery time improvements have a direct relationship to ML goodput, defined as the percentage of time your cluster spends making forward progress on training rather than sitting idle during failures. As clusters scale to thousands of nodes, failure frequency increases proportionally. At the same time, traditional checkpoint-based recovery times also increase with cluster size due to growing coordination overhead. This creates a compounding problem: more frequent failures combined with longer recovery times rapidly erode goodput at scale.
Checkpointless training makes optimizations across the entire recovery stack, enabling more than 95% goodput even on clusters with thousands of AI accelerators. Based on our internal studies, we consistently observed goodput upwards of 95% across massive-scale deployments that exceeded 2,300 GPUs.
We also verified that model training accuracy is not impacted by checkpointless training. Specifically, we measured checksum matching for traditional checkpoint-based training and checkpointless training, and at every training step verified a bit-wise match on training loss. The following is a plot for the training loss for a Llama-3 70B pre-training workload on 32 x ml.p5.48xlarge instances for both traditional checkpointing versus checkpointless training.

Conclusion
Foundation model training has reached an inflection point. As clusters scale to thousands of AI accelerators and training runs extend to months, the traditional checkpoint-based recovery paradigm is increasingly becoming a bottleneck. A single GPU failure that previously would have caused minutes of downtime now triggers tens of minutes of cluster-wide idle time on thousands of AI accelerators, with cumulative costs reaching millions of dollars annually.
Checkpointless training rethinks this paradigm entirely by treating failures as local, recoverable events rather than cluster-wide catastrophes. Failed processes recover state from healthy peers in seconds, enabling the rest of the cluster to continue making forward progress. The shift is fundamental: from How do we restart quickly? to How do we avoid stopping at all?
This technology has enabled more than 95% goodput when training on SageMaker HyperPod. Our internal studies on 2,304 GPUs show recovery times dropped from 15–30 minutes to under 90 seconds, translating to over 80% reduction in idle GPU time per failure.
To get started, explore What is Amazon SageMaker AI?. Sample implementations and recipes are available in the AWS GitHub HyperPod checkpointless training and SageMaker HyperPod recipes repositories.

About the Authors
Anirudh Viswanathan is a Senior Product Manager, Technical, at AWS with the SageMaker team, where he focuses on Machine Learning. He holds a Master’s in Robotics from Carnegie Mellon University and an MBA from the Wharton School of Business. Anirudh is a named inventor on more than 50 AI/ML patents. He enjoys long-distance running, exploring art galleries, and attending Broadway shows. You can connect with Anirudh on LinkedIn.
Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS. He helps AWS customers, from small startups to large enterprises to train and deploy foundation models efficiently on AWS. He has a background in Microprocessor Engineering passionate about computational optimization problems and improving the performance of AI workloads. You can connect with Roy on LinkedIn.
Fei Wu is a Senior Software Developer at AWS with Sagemaker team. Fei’s focus is on ML system and distributed training techniques. He holds a PhD in Electrical Engineering from StonyBrook University. When outside of work, Fei enjoys playing basketball and watching movies. You can connect with Fei on LinkedIn.
Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. At AWS, Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.
Anirban Roy is a Principal Engineer at AWS with the SageMaker team, primarily focussing on AI training infra, resiliency and observability. He holds a Master’s in Computer Science from Indian Statistical Institute in Kolkata. Anirban is a seasoned distributed software system builder with more than 20 years of experience and multiple patents and publications. He enjoys road biking, reading non-fiction, gardening and nature traveling. You can connect with Anirban on LinkedIn
Arun Nagarajan is a Principal Engineer on the Amazon SageMaker AI team, where he currently focuses on distributed training across the entire stack. Since joining the SageMaker team during its launch year, Arun has contributed to multiple products within SageMaker AI, including real-time inference and MLOps solutions. When he’s not working on machine learning infrastructure, he enjoys exploring the outdoors in the Pacific Northwest and hitting the slopes for skiing.

Adaptive infrastructure for foundation model training with elastic tra …

Modern AI infrastructure serves multiple concurrent workloads on the same cluster, from foundation model (FM) pre-training and fine-tuning to production inference and evaluation. In this shared environment, the demands for AI accelerators fluctuates continuously as inference workloads scale with traffic patterns, and experiments complete and release resources. Despite this dynamic availability of AI accelerators, traditional training workloads remain locked into their initial compute allocation, unable to take advantage of idle compute capacity without manual intervention.
Amazon SageMaker HyperPod now supports elastic training, enabling your machine learning (ML) workloads to automatically scale based on resource availability. In this post, we demonstrate how elastic training helps you maximize GPU utilization, reduce costs, and accelerate model development through dynamic resource adaptation, while maintain training quality and minimizing manual intervention.
How static allocation impacts infrastructure utilization
Consider a 256 GPU cluster running both training and inference workloads. During off-peak hours at night, inference may release 96 GPUs. That leaves 96 GPUs sitting idle and available to speed up training. Traditional training jobs run at a fixed scale; such jobs can’t absorb idle compute capacity. As a result, a single training job that starts with 32 GPUs gets locked at this initial configuration, while 96 additional GPUs remain idle; this translates to 2,304 wasted GPU-hours per day, representing thousands of dollars spent daily on underutilized infrastructure investment. The problem is compounded as the cluster size scales.
Scaling distributed training dynamically is technically complex. Even with infrastructure that supports elasticity, you need to halt jobs, reconfigure resources, adjust parallelization, and reshard checkpoints. This complexity is compounded by the need to maintain training progress and model accuracy throughout these transitions. Despite underlying support from SageMaker HyperPod with Amazon EKS and frameworks like PyTorch and NeMo, manual intervention can still consume hours of ML engineering time. The need to repeatedly adjust training runs based on accelerator availability distracts teams from their actual work in developing models.
Resource sharing and workload preemption add another layer of complexity. Current systems lack the ability to gracefully handle partial resource requests from higher-priority workloads. Consider a scenario where a critical fine-tuning job requires 8 GPUs from a cluster where a pre-training workload occupies all 32 GPUs. Today’s systems force a binary choice: either stop the entire pre-training job or deny resources to the higher-priority workload, even though 24 GPUs would suffice for continued pre-training at reduced scale. This limitation leads organizations to over-provision infrastructure to avoid resource contention, resulting in larger queues of pending jobs, increased costs, and reduced cluster efficiency.
Solution overview
SageMaker HyperPod now offers elastic training. Training workloads can automatically scale up to utilize available accelerators and gracefully contract when resources are needed elsewhere, all while maintaining training quality. SageMaker HyperPod manages the complex orchestration of checkpoint management, rank reassignment, and process coordination, minimizing manual intervention and helping teams focus on model development rather than infrastructure management.
The SageMaker HyperPod training operator integrates with the Kubernetes control plane and resource scheduler to make scaling decisions. It monitors pod lifecycle events, node availability, and scheduler priority signals. This lets it detect scaling opportunities almost instantly, whether from newly available resources or new requests from higher-priority workloads. Before initiating any transition, the operator evaluates potential scaling actions against configured policies (minimum and maximum node boundaries, scaling frequency limits) before initiating transitions.

Elastic Training Scaling Event Workflow
Elastic training adds or removes data parallel replicas while keeping the global batch size constant. When resources become available, new replicas join and speed up throughput without affecting convergence. When a higher-priority workload needs resources, the system removes replicas instead of killing the entire job. Training continues at reduced capacity.
When a scaling event occurs, the operator broadcasts a synchronization signal to all ranks. Each process completes its current step and saves state using PyTorch Distributed Checkpoint (DCP). As new replicas join or existing replicas depart, the operator recalculates rank assignments and initiates process restarts across the training job. DCP then loads and redistributes the checkpoint data to match the new replica count, making sure each worker has the correct model and optimizer state. Training resumes with adjusted replicas, and the constant global batch size makes sure convergence remains unaffected.
For clusters using Kueue (including SageMaker HyperPod task governance), elastic training implements intelligent workload management through multiple admission requests. The operator first requests minimum required resources with high priority, then incrementally requests additional capacity with lower priority. This approach enables partial preemption: when higher-priority workloads need resources, only the lower-priority replicas are revoked, allowing training to continue on the guaranteed baseline rather than terminating completely.

Getting started with elastic training
In the following sections, we guide you through setting up and configuring elastic training on SageMaker HyperPod.
Prerequisites
Before integrating elastic training in your training workload, ensure your environment meets the following requirements:

SageMaker HyperPod cluster orchestrated by Amazon EKS with Kubernetes v1.32 and above. For information on creating a SageMaker HyperPod EKS cluster, see Creating a SageMaker HyperPod cluster with Amazon EKS orchestration.
HyperPod training operator v1.2 and above installed on the cluster.
SageMaker HyperPod task governance v1.3.1 and above for job queuing, prioritization, and scheduling.

Configure namespace isolation and resource controls
If you use cluster auto scaling (like Karpenter), set namespace-level ResourceQuotas. Without them, elastic training’s resource requests can trigger unlimited node provisioning. ResourceQuotas limit the maximum resources that jobs can request while still allowing elastic behavior within defined boundaries.
The following code is an example ResourceQuota for a namespace limited to 8 ml.p5.48xlarge instances (each instance has 8 NVIDIA H100 GPUs, 192 vCPUs, and 640 GiB memory, so 8 instances =64 GPUs, 1,536 vCPUs, and 5,120 GiB memory):

apiVersion: v1
kind: ResourceQuota
metadata:
name: training-quota
namespace: team-ml
spec:
hard:
nvidia.com/gpu: “64”
vpc.amazonaws.com/efa: “256”
requests.cpu: “1536”
requests.memory: “5120Gi”
limits.cpu: “1536”
limits.memory: “5120Gi”

We recommend organizing workloads into separate namespaces per team or project, with AWS Identity and Access Management (IAM) role-based access control (RBAC) mappings to support proper access control and resource isolation.
Build HyperPod training container
The HyperPod training operator uses a custom PyTorch launcher from the HyperPod Elastic Agent Python package to detect scaling events, coordinate checkpoint operations, and manage the rendezvous process when the world size changes. Install the elastic agent, then replace torchrun with hyperpodrun in your launch command. For more details, see HyperPod elastic agent.
The following code is an example training container configuration:

FROM <YOUR-BASE-IMAGE>

RUN pip install hyperpod-elastic-agent # insall hyperpod-elastic-agent
ENTRYPOINT [“entrypoint.sh”]

# entrypoint.sh …
hyperpodrun –nnodes=node_count –nproc-per-node=proc_count
–rdzv-backend hyperpod

Enable elastic scaling in training code:
Complete the following steps to enable elastic scaling in your training code:

Add the HyperPod elastic agent import to your training script to detect when scaling events occur:

from hyperpod_elastic_agent.elastic_event_handler import elastic_event_detected

Modify your training loop to check for elastic events after each training batch. When a scaling event is detected, your training process needs to save a checkpoint and exit gracefully, allowing the operator to restart the job with a new world size:

def train_epoch(model, dataloader, optimizer, args):
for batch_idx, batch_data in enumerate(dataloader):
# Forward and backward pass
loss = model(batch_data).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()

# Check if we should checkpoint (periodic or scaling event)
should_checkpoint = (batch_idx + 1) % args.checkpoint_freq == 0
elastic_event = elastic_event_detected() # Returns True when scaling is needed

# Save checkpoint if scaling-up or scaling down job
if should_checkpoint or elastic_event:
save_checkpoint(model, optimizer, scheduler,
checkpoint_dir=args.checkpoint_dir,
step=global_step)

if elastic_event:
# Exit gracefully – operator will restart with new world size
print(“Elastic scaling event detected. Checkpoint saved.”)
return

The key pattern here is checking for elastic_event_detected() during your training loop and returning from the training function after saving a checkpoint. This allows the training operator to coordinate the scaling transition across all workers.

Finally, implement checkpoint save and load functions using PyTorch DCP. DCP is essential for elastic training because it automatically reshards model and optimizer states when your job resumes with a different number of replicas:

import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict

def save_checkpoint(model, optimizer, lr_scheduler, user_content, checkpoint_path):
“””Save checkpoint using DCP for elastic training.”””
state_dict = {
“model”: model,
“optimizer”: optimizer,
“lr_scheduler”: lr_scheduler,
**user_content
}

dcp.save(
state_dict=state_dict,
storage_writer=dcp.FileSystemWriter(checkpoint_path)
)

def load_checkpoint(model, optimizer, lr_scheduler, checkpoint_path):
“””Load checkpoint using DCP with automatic resharding.”””
state_dict = {
“model”: model,
“optimizer”: optimizer,
“lr_scheduler”: lr_scheduler
}

dcp.load(
state_dict=state_dict,
storage_reader=dcp.FileSystemReader(checkpoint_path)
)

return model, optimizer, lr_scheduler

For single-epoch training scenarios where each data sample must be seen exactly once, you must persist your dataloader state across scaling events. Without this, when your job resumes with a different world size, previously processed samples may be repeated or skipped, affecting training quality. A stateful dataloader saves and restores the dataloader’s position during checkpointing, making sure training continues from the exact point where it stopped. For implementation details, refer to the stateful dataloader guide in the documentation.
Submit elastic training job
With your training container built and code instrumented, you’re ready to submit an elastic training job. The job specification defines how your training workload scales in response to cluster resource availability through the elasticPolicy configuration.
Create a HyperPodPyTorchJob specification that defines your elastic scaling behavior using the following code:

apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
name: elastic-training-job
spec:
elasticPolicy:
minReplicas: 2 # Minimum replicas to keep training running
maxReplicas: 8 # Maximum replicas for scale-up
replicaIncrementStep: 2 # Scale in fixed increments of 2 nodes
# Alternative: use replicaDiscreteValues: [2, 4, 8] for specific scale points
gracefulShutdownTimeoutInSeconds: 600 # Time allowed for checkpoint save
scalingTimeoutInSeconds: 60 # Delay before initiating scale-up
faultyScaleDownTimeoutInSeconds: 30 # Wait time before scaling down on failures
replicaSpecs:
– name: worker
replicas: 2 # Initial replica count
maxReplicas: 8 # Must match elasticPolicy.maxReplicas
template:
spec:
containers:
– name: pytorch
image: <your-training-container>
command: [“hyperpodrun”]
args:
– “–nnodes=2”
– “–nproc-per-node=8”
– “–rdzv-backend=hyperpod”
– “train.py”
resources:
requests:
nvidia.com/gpu: 8
vpc.amazonaws.com/efa: 32
limits:
nvidia.com/gpu: 8
vpc.amazonaws.com/efa: 32

The elasticPolicy configuration controls how your training job responds to resource changes:

minReplicas and maxReplicas: These define the scaling boundaries. Your job will always maintain at least minReplicas and never exceed maxReplicas, maintaining predictable resource usage.
replicaIncrementStep vs. replicaDiscreteValues: Choose one approach for scaling granularity. Use replicaIncrementStep for uniform scaling (for example, a step of 2 means scaling to 2, 4, 6, 8 nodes). Use replicaDiscreteValues: [2, 4, 8] to specify exact allowed configurations. This is useful when certain world sizes work better for your model’s parallelization strategy.
gracefulShutdownTimeoutInSeconds: This gives your training process time to complete checkpointing before the operator forces a shutdown. Set this based on your checkpoint size and storage performance.
scalingTimeoutInSeconds: This introduces a stabilization delay before scale-up to prevent thrashing when resources fluctuate rapidly. The operator waits this duration after detecting available resources before triggering a scale-up event.
faultyScaleDownTimeoutInSeconds: When pods fail or crash, the operator waits this duration for recovery before scaling down. This prevents unnecessary scale-downs due to transient failures.

Elastic training incorporates anti-thrashing mechanisms to maintain stability in environments with rapidly fluctuating resource availability. These protections include enforced minimum stability periods between scaling events and an exponential backoff strategy for frequent transitions. By preventing excessive fluctuations, the system makes sure training jobs can make meaningful progress at each scale point rather than being overwhelmed by frequent checkpoint operations. You can tune these anti-thrashing policies in the elastic policy configuration, enabling a balanced approach between responsive scaling and training stability that aligns with their specific cluster dynamics and workload requirements.
You can then submit the job using kubectl or the SageMaker HyperPod CLI, as covered in documentation:

kubectl apply -f elastic-job.yaml

Using SageMaker HyperPod recipes
We have created SageMaker HyperPod recipes for elastic training for publicly available FMs, including Llama and GPT-OSS. These recipes provide pre-validated configurations that handle parallelization strategy, hyperparameter adjustments, and checkpoint management automatically, requiring only YAML configuration changes to specify the elastic policy with no code modifications. Teams simply specify minimum and maximum node boundaries in their job specification, and the system manages all scaling coordination as cluster resources fluctuate.

# Enable elastic training in an existing recipe
python launcher.py
recipes=llama/llama3_1_8b_sft
recipes.elastic_policy.is_elastic=true
recipes.elastic_policy.min_nodes=2
recipes.elastic_policy.max_nodes=8

Recipes also support scale-specific configurations through the scale_config field, so you can define different hyperparameters (batch size, learning rate) for each world size. This is particularly useful when scaling requires adjusting batch distribution or enabling uneven batch sizes. For detailed examples, see the SageMaker HyperPod Recipes repository.
Performance results
To demonstrate elastic training’s impact, we fine-tuned a Llama-3 70B model on the TAT-QA dataset using a SageMaker HyperPod cluster with up to 8 ml.p5.48xlarge instances. This benchmark illustrates how elastic training performs in practice when dynamically scaling in response to resource availability, simulating a realistic environment where training and inference workloads share cluster capacity.
We evaluated elastic training across two key dimensions: training throughput and model convergence during scaling transitions. We observed a consistent improvement in throughput at different scaling configurations from 1 node to 8 nodes, as shown in the following figures. Training performance improved from 2,000 tokens/second at 1 node, and up to 14,000 tokens/second at 8 nodes. Throughout the training run, the loss continued decrease as model training continued to converge.

Training throughput with Elastic Training
Model convergence with Elastic Training
Integration with SageMaker HyperPod capabilities
Beyond its core scaling capabilities, elastic training takes advantage of the integration with the infrastructure capabilities of SageMaker HyperPod. Task governance policies automatically trigger scaling events when workload priorities shift, enabling training to yield resources to higher-priority inference or evaluation workloads. Support for SageMaker Training Plans allows training to opportunistically scale using cost-optimized capacity types while maintaining resilience through automatic scale-down when spot instances are reclaimed. The SageMaker HyperPod observability add-on complements these capabilities by providing detailed insights into scaling events, checkpoint performance, and training progression, helping teams monitor and optimize their elastic training deployments.
Conclusion
Elastic training on SageMaker HyperPod addresses the problem of wasted resources in AI clusters. Training jobs can now scale automatically as resources become available without requiring manual infrastructure adjustments. The technical architecture of elastic training maintains training quality throughout scaling transitions. By preserving the global batch size and learning rate across different data-parallel configurations, the system maintains consistent convergence properties regardless of the current scale.
You can expect three primary benefits. First, from an operational perspective, the reduction of manual reconfiguration cycles fundamentally changes how ML teams work. Engineers can focus on model innovation and development rather than infrastructure management, significantly improving team productivity and reducing operational overhead. Second, infrastructure efficiency sees dramatic improvements as training workloads dynamically consume available capacity, leading to substantial reductions in idle GPU hours and corresponding cost savings. Third, time-to-market accelerates considerably as training jobs automatically scale to utilize available resources, enabling faster model development and deployment cycles.
To get started, refer to the documentation guide. Sample implementations and recipes are available in the GitHub repository.

About the Authors
Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS. He helps AWS customers, from small startups to large enterprises to train and deploy foundation models efficiently on AWS. He has a background in Microprocessor Engineering passionate about computational optimization problems and improving the performance of AI workloads. You can connect with Roy on LinkedIn.
Anirudh Viswanathan is a Senior Product Manager, Technical, at AWS with the SageMaker team, where he focuses on Machine Learning. He holds a Master’s in Robotics from Carnegie Mellon University and an MBA from the Wharton School of Business. Anirudh is a named inventor on more than 50 AI/ML patents. He enjoys long-distance running, exploring art galleries, and attending Broadway shows. You can connect with Anirudh on LinkedIn.
Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker AI. He holds a Master’s degree from UIUC with a specialization in Data science. He specializes in Generative AI workloads, helping customers build and deploy LLM’s using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.
Oleg Talalov is a Senior Software Development Engineer at AWS, working on the SageMaker HyperPod team, where he focuses on Machine Learning and high-performance computing infrastructure for ML training. He holds a Master’s degree from Peter the Great St. Petersburg Polytechnic University. Oleg is an inventor on multiple AI/ML technologies and enjoys cycling, swimming, and running. You can connect with Oleg on LinkedIn
Qianlin Liang is a Software Development Engineer at AWS with the SageMaker team, where he focuses on AI systems. He holds a Ph.D. in Computer Science from University of Massachusetts Amherst. His research develops system techniques for efficient and resilient machine learning. Outside of works, he enjoys running and photographing. You can connect with Qianlin on LinkedIn.
Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. At AWS, Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.
Anirban Roy is a Principal Engineer at AWS with the SageMaker team, primarily focusing on AI training infra, resiliency and observability. He holds a Master’s in Computer Science from Indian Statistical Institute in Kolkata. Anirban is a seasoned distributed software system builder with more than 20 years of experience and multiple patents and publications. He enjoys road biking, reading non-fiction, gardening and nature traveling. You can connect with Anirban on LinkedIn
Arun Nagarajan is a Principal Engineer on the Amazon SageMaker AI team, where he currently focuses on distributed training across the entire stack. Since joining the SageMaker team during its launch year, Arun has contributed to multiple products within SageMaker AI, including real-time inference and MLOps solutions. When he’s not working on machine learning infrastructure, he enjoys exploring the outdoors in the Pacific Northwest and hitting the slopes for skiing.

Customize agent workflows with advanced orchestration techniques using …

Large Language Model (LLM) agents have revolutionized how we approach complex, multi-step tasks by combining the reasoning capabilities of foundation models with specialized tools and domain expertise. While single-agent systems using frameworks like ReAct work well for straightforward tasks, real-world challenges often require multiple specialized agents working in coordination. Think about planning a business trip: one agent is needed to research flights based on schedule constraints, another to find accommodations near meeting locations, and a third to coordinate ground transportation—each requiring different tools and domain knowledge. This multi-agent approach introduces a critical architectural challenge: orchestrating the flow of information between agents to ensure reliable, predictable outcomes. Without proper orchestration, agent interactions can become unpredictable, making systems difficult to debug, monitor, and scale in production environments. Agent orchestration addresses this challenge by defining explicit workflows that govern how agents communicate, when they execute, and how their outputs integrate into cohesive solutions. Rather than allowing agents to interact ad hoc, orchestration creates structured pathways that make reasoning transparent and information flow intentional.
Strands Agents is an open-source SDK designed specifically for building orchestrated artificial intelligence (AI) systems. It provides flexible agent abstractions, seamless tool integration, comprehensive observability, and orchestration components like GraphBuilder that enable developers to connect agents into directed workflows with precision and control.
In this post, we explore two powerful orchestration patterns implemented with Strands Agents. Using a common set of travel planning tools, we demonstrate how different orchestration strategies can solve the same problem through distinct reasoning approaches: ReWOO (Reasoning Without Observation), which separates planning, execution, and synthesis into discrete stages, and Reflexion, which implements iterative refinement through structured critique and improvement cycles. These examples will show you how Strands enables precise control over multi-agent workflows, resulting in more reliable, transparent, and maintainable AI systems.
Getting started with Strands Agents
Strands Agents is an open-source framework recently launched by AWS for building production-ready AI agents. It simplifies agent development by abstracting the agent loop into three core components:

Model Provider: The reasoning engine (like Claude on Amazon Bedrock)
System Prompt: Instructions that shape the agent’s role and constraints
Toolbelt: The set of APIs or functions the agent can call

This modular design lets users start with simple single-agent systems and scale up to sophisticated multi-agent architectures. Strands includes built-in support for async operations, session state management, and integrations with multiple providers including Amazon Bedrock, Anthropic, and Mistral. It also integrates seamlessly with AWS services like Lambda, Fargate, and AgentCore.
What makes Strands particularly powerful is its multi-agent orchestration capabilities. Users can compose agents in several ways: use one agent as a tool for another, pass control between agents through handoffs, or coordinate multiple agents working in parallel. The SDK’s GraphBuilder feature lets users connect agents into structured workflows, enabling them to collaborate on complex tasks in a controlled, predictable manner.
For production deployments, Strands provides enterprise-grade observability through OpenTelemetry integration. This provides distributed tracing across an entire agent system, making it easy to debug issues and monitor performance as users scale from prototypes to production workloads.
Fundamentals of Agent Orchestration with Strands
The  ReAct pattern is the default approach for most AI agents today. It combines planning, tool invocation, and answer synthesis into a single agent loop. While this works for simple tasks, it creates problems for complex scenarios. The agent might call tools repeatedly without a clear strategy, mix evidence gathering with conclusions, or rush to an answer without verification. These issues become critical in applications requiring structured reasoning, compliance checks, or multi-step validation. This is where orchestration shines.
Instead of one agent doing everything, Strands enables the creation of specialized agents with distinct roles in solving the problem. For example, one agent might plan the approach, another executes the plan, and a third synthesizes the results. Users connect these agents in controlled workflows that match exact requirements. In Strands, orchestration patterns use a graph execution model. Think of it as a flowchart where:

Each node is a specialized agent
Edges define how information flows between agents
The structure makes reasoning steps visible and debuggable

Unlike ReAct’s hidden decision-making, graphs expose every step. Users can trace which agent produced what output, when it became available, and how the next agent used it. This transparency is crucial for building reliable systems. Strands provides four fundamental components for any orchestration pattern:

Nodes: Agents that encapsulate specific logic or expertise
Edges: Connections that define execution order and data flow
AgentResult: Standardized output format from each agent
GraphResult: Complete execution trace with timing, outputs, and paths taken

The GraphBuilder API lets users wire these components together to define which agents participate, how data flows between them, and where user input enters the system. At runtime, the graph executes deterministically and returns structured results.

Consider a document Q&A pipeline:

User Query → Retriever Agent → Summarizer Agent → Final Answer
builder = GraphBuilder()
builder.add_node(retriever, “retriever”)
builder.add_node(summarizer, “summarizer”)
builder.add_edge(“retriever”, “summarizer”)
builder.set_entry_point(“retriever”)

The Retriever searches for relevant documents. The Summarizer condenses them. Each agent only sees the data it needs, when it needs it. The flow is explicit, predictable, and easy to debug. This same approach scales to complex patterns. Users can add branches for different reasoning paths, loops for iterative refinement, or parallel execution for exploring multiple strategies. The key is that control is maintained over how information flows through the system.
In the sections that follow, we implement our first pattern: ReWOO, which separates planning from execution to create more reliable agent workflows.
Dataset and default orchestration
Dataset details
We evaluated our system on the τ-Bench airline domain dataset (Yao et al., 2024), which features 300+ flight entries, 500 synthetic user profiles, 2,000+ pre-generated bookings, detailed airline policies, simulated APIs for reservation operations, and 50 structured real-world scenarios. This comprehensive benchmark provides a controlled yet realistic testbed for assessing how agents interpret policies, execute appropriate API calls, and maintain consistency across complex airline operations including upgrades, itinerary changes, and cancellations. While the original dataset presents each task as a multi-turn conversation, we’ve simplified them into single turn queries for this tutorial to better showcase the orchestration patterns.
Architecture at a glance: Default orchestration with ReAct
ReAct (Reasoning + Acting) interleaves two phases inside a single agent loop. The agent reasons in natural language to decide the next step, invokes a tool if needed, observes the tool’s output, and continues reasoning with that observation until it can produce a final answer.
In Strands Agents, the ReAct baseline maps cleanly to a single Agent that owns the τ-Bench airline toolbelt – a list of airline tools(search flights, book/modify/cancel reservations, look up profiles, etc.). The tools are the python functions provided in Tau-Bench dataset, converted to Strands tools using @tool decorator.

tools = [
    book_reservation,
    calculate,
    cancel_reservation,
    get_reservation_details,
    get_user_details,
    list_all_airports,
    search_direct_flight,
    search_onestop_flight,
    send_certificate,
    think,
    transfer_to_human_agents,
    update_reservation_baggages,
    update_reservation_flights,
    update_reservation_passengers,
]

prompt = “””
You are a helpful assistant for a travel website. Help the user answer any questions.

<instructions>
– Remember to check if the the airport city is in the state mentioned by the user. For example, Houston is in Texas.
– Infer about the the U.S. state in which the airport city resides. For example, Houston is in Texas.
– You should not use made-up or placeholder arguments.
<instructions>

<policy>
{policy}
</policy>
“””

react_agent=Agent(model = model,tools = tools,system_prompt = prompt)
react_response = react_agent(user_query) 

There is no explicit planner or critic; the policy that governs “think → act → observe → think …” lives inside the agent’s prompt and the model’s internal loop. This makes ReAct a natural baseline for tool-augmented systems because it requires minimal orchestration—one agent ‘tool-executor’ with a toolbelt—and it tends to be fast in simple tasks.

Architecture at a glance: ReWOO (Reasoning Without Observations)
ReWOO reframes “how tools are used” rather than “which tools exist.” We keep a single tool-executor for all airline APIs, but we enforce a plan → execute → synthesize separation around it. In Strands, this becomes a small, explicit graph where each node returns a typed result (AgentResult) and the runtime forwards those results downstream in a deterministic way. This leads to governance, observability, and repeatability.

Planner (plan only). Produces a strictly formatted plan.
Worker (execute only). Parses the plan, resolves arguments, call tools, and accumulates evidence in a normalized structure. Decoupling execution from planning makes tool use predictable and policy-enforceable (the worker can only run what the plan authorizes).
Solver (synthesize only). Reads evidence—results from the tools not the tools directly—then composes the final answer. It keeps tool effects and decision-making auditable; avoids “hidden” follow-up calls in the last step.

Constructed with Strands’ GraphBuilder (nodes, edges, entry point), this becomes a deterministic DAG. The runtime hands each downstream node the original task plus the upstream node’s output—captured in AgentResult.

from strands.multiagent.graph import GraphBuilder

b = GraphBuilder()
b.add_node(planner_agent, “planner”)
b.add_node(worker_agent,  “worker”)
b.add_node(solver_agent,  “solver”)
b.add_edge(“planner”, “worker”)
b.add_edge(“worker”,  “solver”)
b.set_entry_point(“planner”)
graph = b.build()

Planner: plan-only agent with a strict grammar
The planner generates a declarative program describing tool usage, not an answer. The following are the important features to design an effective planner prompt:

Enumerate the allowed set of tool names with arguments
Few-shot examples to demonstrate the LLM how to plan to answer a given user query.
Enforce the output shape. We used this:

Plan 1: <short intent>
#E1 = <tool_name>[key=value, …]

Plan 2: <short intent>
#E2 = <tool_name>[key=value, …]

#E4 = REPEAT(<analysis_or_count>) {
    <tool_a>[…]
    <tool_b>[…]
}

The plan is returned as an AgentResult. A strict plan is audit-ready and minimizes ambiguity. It also enables static checks (e.g., “only these tools are allowed; one per step”) before anything runs.
Example of a plan created by planner agent for user query:
“My user id is mia_li_3668. I want to fly from New York to Seattle on May 20 (one way). I do not want to fly before 11am EST. I want to fly in economy. I prefer direct flights but one stopover is also fine. If there are multiple options, I prefer the one with the lowest price. I have 3 baggages. I do not want insurance. I want to use my two certificates to pay. If only one certificate can be used, I prefer using the larger one, and pay the rest with my 7447 card.”

DEBUG: PLANNER AGENT CALLED

Plan 1: Get user details to check available certificates
#E1 = get_user_details[user_id=”mia_li_3668″]

Plan 2: Get list of airports to find the airport codes for New York and Seattle
#E2 = list_all_airports[]

Plan 3: Search for direct flights using airport codes from #E2 and date from given user question
#E3 = search_direct_flight[origin=”JFK”, destination=”SEA”, date=”2024-05-20″]

Plan 4: If no suitable direct flights after 11am, search for one-stop flights
#E4 = search_onestop_flight[origin=”JFK”, destination=”SEA”, date=”2024-05-20″]

Plan 5: Think about the flight selection, pricing, and payment options
#E5 = think[“Analyze the flight options from #E3 and #E4:
– Filter flights departing after 11am EST
– Select cheapest suitable flight (direct preferred)
– Calculate baggage fees (3 bags total)
– Determine payment strategy using certificates from user profile
– Plan to use the largest certificate first
– Prepare to use 7447 card for remaining balance”]

Plan 6: Book the reservation with all the collected information
#E6 = book_reservation[user_id=”mia_li_3668″, origin=”JFK”, destination=”SEA”, flight_type=”one_way”, cabin=”economy”, flights=[selected_flight_from_E3_or_E4], passengers=[{“first_name”:”Mia”, “last_name”:”Li”}], payment_methods=[largest_certificate, remaining_certificate_or_card_7447], total_baggages=3, nonfree_baggages=calculated_from_E5, insurance=false]

Worker: deterministic executor with argument and loop resolution
The worker executes only what the plan authorizes; argument resolution is data-driven. This makes behavior reproducible across runs and model versions.The worker treats the plan as an executable spec.

Unified plan parser: It parses both regular steps and REPEAT blocks, sorts by evidence ID, and executes them in order.It parses both regular steps and REPEAT blocks, sorts by evidence ID, and executes them in order.

Evidence ledger : Every step produces a structured record (#E{id} with description + results). Errors are captured as evidence instead of failing silently.step_evidence[f’#E{eid}’] = {

step_evidence[f’#E{eid}’] = {
‘evidence_id’: f’#E{eid}’,
‘description’: f”Execute {tool} with {kwargs or ‘no parameters’}”,
‘results’: result_text<br />
}
all_evidence.update(step_evidence)

Context-aware dynamic argument resolution: Build a context from (a) the original task and (b) previous N evidences. Fill placeholders (e.g., airport codes, reservation IDs) from that context—no brittle regex on raw strings. This can be done in couple of different ways. One way is to use a LLM to infer the argument values from the built context. The second method is to use a regex matching to resolve argument values.
Dynamic tool dispatch with special cases : Tools are invoked directly using getattr.

Example of an executed step:

DEBUG: Processing step #E3

DEBUG: Tool name: search_direct_flight
DEBUG: Calling search_direct_flight with kwargs:
{
    ‘origin’: ‘JFK’,
    ‘destination’: ‘SEA’,
    ‘date’: ‘2024-05-20’
}

DEBUG: Tool result:
{
    ‘toolUseId’: ‘tooluse_search_direct_flight_716684779’,
    ‘status’: ‘success’,
    ‘content’: [
        {
            ‘text’: ‘{“flights”: [
                {
                    “flight_number”: “HAT069”,
                    “origin”: “JFK”,
                    “destination”: “SEA”,
                    “scheduled_departure_time_est”: “06:00:00”,
                    “scheduled_arrival_time_est”: “12:00:00”,
                    “status”: “available”,
                    “available_seats”: {
                        “basic_economy”: 17,
                        “economy”: 12,
                        “business”: 3
                    },
                    “prices”: {
                        “basic_economy”: 51,
                        “economy”: 121,
                        “business”: 239
                    },
                    “date”: “2024-05-20”
                },
                {
                    “flight_number”: “HAT083”,
                    “origin”: “JFK”,
                    “destination”: “SEA”,
                    “scheduled_departure_time_est”: “01:00:00”,
                    “scheduled_arrival_time_est”: “07:00:00”,
                    “status”: “available”,
                    “available_seats”: {
                        “basic_economy”: 16,
                        “economy”: 7,
                        “business”: 3
                    },
                    “prices”: {
                        “basic_economy”: 87,
                        “economy”: 100,
                        “business”: 276
                    },
                    “date”: “2024-05-20”
                }
            ]}’
        }
    ]
}

Solver: Builds the final answer and presents to the user
Solver combines execution evidence from Worker with the original user query to generate the final response. It receives the structured evidence dictionary and synthesizes it into a natural language answer. The solver never calls tools. It does the following:

Evidence parsing – Reads the original task and the worker’s evidence , from worker agent node.
Plan reconstruction — normalizes evidence into a compact, ordered “plan + evidence” text block.
Final answer generation – Uses LLM with an appropriate prompt to produce the final answer, explicitly addressing constraints and trade-offs.

solve_prompt = “””Solve the following task or problem.
To solve the problem, we have made step-by-step Plan and retrieved
corresponding Evidence to each Plan. Use them with caution since long
evidence might contain irrelevant information.

{plan}

Now solve the question or task according to provided Evidence above.
Respond with the answer directly with no extra words.

Task: {task}
Response:”””

Separating synthesis from execution yields clear decision logs and stable latency. It also makes it easy to swap synthesis prompts or models without touching planning/execution logic.
Architecture at a glance: Reflexion (Self-Critiquing)
Reflexion is an orchestration pattern where an agent generates a candidate answer and a critique of that answer, then uses the critique to revise the answer in a bounded loop. The goal isn’t to “try again” blindly, but to target revisions based on explicit, machine-parsable feedback (e.g., violated constraints, missing checks, weak rationale). In other words, Reflexion turns model feedback into a structured control signal that governs one or more additional passes, stopping as soon as the answer meets the stated criteria.
Reflexion wraps the existing flight tool-executor with a deliberate draft → critique → (optional) revision loop. Instead of accepting the first output, the system generates a candidate answer, evaluates it against explicit criteria, and only revises when the critique says it should. The motivation is that this method would give higher answer quality. The Reflexion graph has 2 nodes built with GraphBuilder.

Draft (plan only). Produces an initial answer and initial reflection.
Revisor (execute only). Loops between improving the query, revising and reflecting on the answer.

Although the orchestration is modeled as a DAG, the revisor node encapsulates up to three revision cycles, invoking tools as needed. Each node returns an AgentResult; the runtime forwards the upstream result to the downstream node and records the full trace in GraphResult.
Draft: Generates initial answer and critique
The draft node uses the same airline tool-executor as used by the other patterns to produce an initial answer. Immediately afterward, it runs a focused “reflection” pass by invoking LLM with a reflection prompt that flags gaps (violated constraints, missing checks, weak rationale) and outputs a compact, labeled payload the revisor can parse deterministically:

reflection_system_prompt=”””You are analyzing a flight assistant’s response that uses real flight database tools.
        
IMPORTANT: The flight data comes from real database queries, NOT hallucination.
        
Analyze the response quality on these dimensions:
: Does it address all parts of the user’s query?
: If the user query clearly states the final goal and if it can be
fulfiled as per the policy, then does the response show that?
: Is the information presented clearly and logically?
: Are next steps or options clearly presented?
: Is the tone helpful and appropriate?
: What important details are missing?
: REVISE or ACCEPT
: Why this decision was made
“””

Formatted payload after revision:**Answer**: …**Self-Reflection**: …**Needs-Revision**: True|False**User-Query**: …
Revisor: Loops through revision and generation phase
The revisor reads the draft payload, parses the labels (Answer, Self-Reflection, Needs-Revision, User-Query), and decides whether revision is warranted. If so, it improves the original user query using the critique (e.g., “limit to departures ≥ 11:00, ≤ 1 stop, min layover 70m”) and re-invokes the tool-executor to produce a revised answer. It then reflects again using the same labels. This cycle is bounded (e.g., up to 3 passes) and stops as soon as the critique returns Needs Revision:False. The query is improved by a LLM using a specially designed prompt .

query_improver_system_prompt=”””You are a query improvement specialist.
Based on reflection analysis, improve the original user query to address
identified issues and guide better responses.

Examples:
Original: “Book me a flight from NYC to LA tomorrow”
Issue: “Agent booked immediately without showing options”
Improved: “Please SEARCH and SHOW ME available flight options from NYC to
LA tomorrow. I want to see different times, prices, and airlines before deciding.
DO NOT book anything until I confirm.”
Now improve the provided query based on the specific reflection issues identified.”””

Results: Responses from different orchestration patterns
In this section, we look at some examples from the dataset and how each orchestration pattern behaves.
Example 1:
“User Query: I am Lucas Brown (user id is lucas_brown_4047). I need to change the date of my flight reservation EUJUY6 and move it out 2 days due to a family emergency.“
Winner: ReWOO (28s) — policy-aligned refusal without unsafe changes
Summary

ReAct (17s): Fast but incorrect—claims date change + charge on a Basic Economy fare.
ReWOO (28s): Correct—blocks modification; points to cancel/rebook path.
Reflexion (60s): Policy-incorrect—acknowledges Basic Economy yet says it can proceed with the change; self-evaluates as “ACCEPT” instead of catching the violation.

Example 2:  
User Query: “My user id is mohamed_silva_9265. I want to know the sum of my gift card balances and sum of my certificate balances… Then I want to change my recent reservation to the cheapest business round trip without changing the dates… If basic economy cannot be changed, cancel and book a new one… Use certificates, then gift cards, then Mastercard; tell me how much my Mastercard will be charged.”
Winner: Reflexion (116s) — follows the user’s pre-authorized “cancel → rebook” path, preserves dates, gives totals, and computes the exact Mastercard remainder.
Summary

ReAct (67s): Detailed search and payment plan but changes a date (May 29) and mutates before a clean confirm; conflicts with user constraint.
ReWOO (43s): Strong plan, correctly identifies Basic Economy and totals, suggests cancel→ rebook; pricing inconsistent for full return and no final Mastercard figure.
Reflexion (116s): End-to-end: totals → constraint check → cheapest RT on same dates → cancel (authorized) → compute Mastercard = $1,286. Slowest, but most aligned with the user’s exact instructions.

Example3:
User Query: “My user id is james_taylor_7043. I want to change my upcoming one-stop flight from LAS to IAH to a nonstop flight. My reservation ID is 1N99U6. I also want to remove my checked bag and want the agent to refund me for the same.”
Winner: Reflexion (27s) — offers valid nonstop options and correctly denies bag-removal refund per policy, without premature changes.
Summary

ReAct (9s): Safe but under-specified—no options surfaced; proposes an action that violates policy.
ReWOO (25s): Over-eager mutation—updates reservation with two flights and issues refund pre-choice; unnecessary transfer for baggage.
Reflexion (27s): Policy-aligned and user-centric—presents concrete nonstop choices, explains no bag refund, and waits for selection.

Example 4: 
User Query: “I am Anya Garcia (ID: anya_garcia_5901). I booked a flight (3RK2T9) and I want to change the passenger name from Mei Lee to Mei Garcia. Please make this change.“
Winner: ReAct (8s) — correct, minimal path: shows precise update preview and awaits a single yes/no.
Summary

ReAct (8s): Correct preview → one-tap confirm; fastest and faithful to user intent.
ReWOO (14s): Good decomposition (identity check + update call) but inconsistent evidence (passenger remained “Lee” in #E4).
Reflexion (40s): Over-thinks a straightforward edit; no change executed.

When to use which pattern

ReAct (fast, linear “do the obvious”)

Use when: A simple, unambiguous update or lookup with 1–2 tool calls and no trade-offs (e.g., rename, toggle, fetch → reply).
Strength: Lowest latency; minimal planning, however latency might increase if it overcalls tools.
Watch out: Can skip policy/eligibility checks and mutate unsafely if you’re not careful.

ReWOO (plan → execute → synthesize with governance)

Use when: You need ordered dependencies and policy gates before any mutation (e.g., verify fare class, then search, then update).
Strength: Transparent dataflow; auditable Graph/Agent results; safer by design.
Watch out (arguments):

If not using an LLM, argument parsing/validation must be meticulous (types/enums/required).
If using an LLM for arguments resolution, pass rich context (schemas + examples) to bind correctly—adds latency but improves reliability on complex params.

Reflexion (analyze options, then act)

Use when: Multi-constraint decisions, trade-offs, or policy nuances require comparing options (cheapest itinerary under payment rules, etc.).
Strength: Better at reasoning over alternatives and producing justified choices.
Watch out: Slower; can over-ask on trivial edits unless reflection is capped.

Architecture at a glance: Hybrid Orchestration — ReWOO-Guided ReAct
Taking into account the pros/cons of ReWOO (governance and auditability), ReAct (speed and flexibility), and Reflexion (quality via critique), we use a hybrid that takes ReWOO’s plan discipline and ReAct’s within-step agility. A ReWOO Planner first emits a strict, step-indexed program (#E1…#En) that names the tools and their order. Execution then switches to a plan-guided ReAct loop that runs inside each step: the agent thinks → validates arguments from prior evidence → calls the authorized tool → observes and (if needed) does one light refinement pass. This preserves global guarantees (no new tools, no reordering, policy gates before mutations) while keeping local flexibility for argument binding and micro-decisions. In Strands, this hybrid maps to a two-node graph:
Planner (ReWOO): Generates the step program only (no tool calls in this node). Output is a typed plan artifact with #E-steps (e.g., get balances → fetch most-recent reservation → search options → compare costs → conditionally mutate).
Plan-Guided ReAct Worker: Consumes the plan and the user task; for each #E step it performs a local ReAct loop but never reorders steps or calls tools not in the plan. It validates arguments, applies policy gates (e.g., Basic Economy ⇒ cancel→ rebook), and synthesizes the final answer. Both planner and executor use the same τ-Bench airline toolbelt (search/book/modify/cancel, user/reservation lookups, math, etc.), exposed as Strands tools.

Local ReAct loop (per step #Ek, bounded):

THINK: derive & validate args from {task, policy, evidence #E1..#Ek-1}
ACT: call the authorized tool for #Ek (no placeholders, no extra tools)
OBSERVE: parse result; at most one refinement pass if needed
COMMIT: append #Ek evidence and advance strictly to #Ek+1

Compared to vanilla ReAct, the plan provides governance and idempotence—the agent can’t wander or mutate early. Compared to pure ReWOO, the in-step loop handles real-world messiness (argument binding, minor retries) without re-planning. Unlike Reflexion, it avoids multi-pass critique overhead on straightforward tasks while still producing an auditable trace (plan + per-step evidence). In practice, we see it shine on multi-step requests that need ordered checks (e.g., totals → fare rules → search → cancel/rebook → payment split) but benefit from small, local reasoning inside each tool call.
Conclusion
In this post, we showed how custom orchestration on Amazon Strands helps users move beyond a single, monolithic agent and engineer explicit control over reasoning, tool use, and information flow. Using the same τ-Bench airline toolkit, we compared three patterns—ReAct, ReWOO, and Reflexion—under real constraints and observed distinct trade-offs in latency, cost, and answer quality. ReAct remains the lowest-overhead path for simple lookups and single-field updates. ReWOO is the right default when correctness depends on good planning and ordered dependencies: users can stage policy gates before mutations, resolve arguments with richer context, and keep a typed evidence trail for audit. Reflexion adds self-critique to handle multi-constraint choices and payment/itinerary trade-offs, at the cost of extra deliberation. Strands’ graph execution model provides typed handoffs, execution traces, and enforceable tool contracts so users can tune these patterns per use case—tight loops for CRUD, plan→ execute→ synthesize for governed updates, reflect→ revise for option analysis—while bounding side effects and model drift.
To build production agents, treat orchestration as the control plane: pick the pattern that matches your dependency structure and risk profile, then instrument it. Visit this GitHub repo for end-to-end examples, prompts, and runnable graphs.

About the authors
Baishali Chaudhury is an Applied Scientist at the Generative AI Innovation Center at AWS, where she focuses on advancing Generative AI solutions for real-world applications. She has a strong background in computer vision, machine learning, and AI for healthcare. Baishali holds a PhD in Computer Science from University of South Florida and PostDoc from Moffitt Cancer Centre.
Rahul Ghosh is an Applied Scientist at Amazon’s Generative AI Innovation Center, where he works with AWS customers across different verticals to expedite their use of Generative AI. Rahul holds a Ph.D. in Computer Science from the University of Minnesota.
Isaac Privitera is a Principal Data Scientist with the AWS Generative AI Innovation Center, where he develops bespoke generative AI-based solutions to address customers’ business problems. His primary focus lies in building responsible AI systems, using techniques such as RAG, multi-agent systems, and model fine-tuning. When not immersed in the world of AI, Isaac can be found on the golf course, enjoying a football game, or hiking trails with his loyal canine companion, Barry.

OpenAI has Released the ‘circuit-sparsity’: A Set of Open Tools fo …

OpenAI team has released their openai/circuit-sparsity model on Hugging Face and the openai/circuit_sparsity toolkit on GitHub. The release packages the models and circuits from the paper ‘Weight-sparse transformers have interpretable circuits‘.

https://arxiv.org/pdf/2511.13653

What is a weight sparse transformer?

The models are GPT-2 style decoder only transformers trained on Python code. Sparsity is not added after training, it is enforced during optimization. After each AdamW step, the training loop keeps only the largest magnitude entries in every weight matrix and bias, including token embeddings, and zeros the rest. All matrices maintain the same fraction of nonzero elements.

The sparsest models have approximately 1 in 1000 nonzero weights. In addition, the OpenAI team enforced mild activation sparsity so that about 1 in 4 node activations are nonzero, covering residual reads, residual writes, attention channels and MLP neurons.

Sparsity is annealed during training. Models start dense, then the allowed nonzero budget gradually moves toward the target value. This design lets the research team scale width while holding the number of nonzero parameters fixed, and then study the capability interpretability tradeoff as they vary sparsity and model size. The research team show that, for a given pretraining loss, circuits recovered from sparse models are roughly 16 times smaller than those from dense models.

https://arxiv.org/pdf/2511.13653

So, what is a sparse circuit?

The central object in this research work is a sparse circuit. The research team defines nodes at a very fine granularity, each node is a single neuron, attention channel, residual read channel or residual write channel. An edge is a single nonzero entry in a weight matrix that connects two nodes. Circuit size is measured by the geometric mean number of edges across tasks.

To probe the models, the research team built 20 simple Python next token binary tasks. Each task forces the model to choose between 2 completions that differ in one token. Examples include:

single_double_quote, predict whether to close a string with a single or double quote

bracket_counting, decide between ] and ]] based on list nesting depth

set_or_string, track whether a variable was initialized as a set or a string

For each task, they prune the model to find the smallest circuit that still achieves a target loss of 0.15 on that task distribution. Pruning operates at the node level. Deleted nodes are mean ablated, their activations are frozen to the mean over the pretraining distribution. A learned binary mask per node is optimized with a straight through style surrogate so that the objective trades off task loss and circuit size.

https://arxiv.org/pdf/2511.13653

Example circuits, quote closing and counting brackets

The most compact example is the circuit for single_double_quote. Here the model must emit the correct closing quote type given an opening quote. The pruned circuit has 12 nodes and 9 edges.

The mechanism is two step. In layer 0.mlp, 2 neurons specialize:

a quote detector neuron that activates on both ” and ‘

a quote type classifier neuron that is positive on ” and negative on ‘

A later attention head in layer 10.attn uses the quote detector channel as a key and the quote type classifier channel as a value. The final token has a constant positive query, so the attention output copies the correct quote type into the last position and the model closes the string correctly.

https://arxiv.org/pdf/2511.13653

bracket_counting yields a slightly larger circuit but with a clear algorithm. The embedding of [ writes into several residual channels that act as bracket detectors. A value channel in a layer 2 attention head averages this detector activation over the context, effectively computing nesting depth and storing it in a residual channel. A later attention head thresholds this depth and activates a nested list close channel only when the list is nested, which leads the model to output ]].

A third circuit, for set_or_string_fixedvarname, shows how the model tracks the type of a variable called current. One head copies the embedding of current into the set() or “” token. A later head uses that embedding as query and key to copy the relevant information back when the model must choose between .add and +=.

https://arxiv.org/pdf/2511.13653

https://arxiv.org/pdf/2511.13653

Bridges, connecting sparse models to dense models

The research team also introduces bridges that connect a sparse model to an already trained dense model. Each bridge is an encoder decoder pair that maps dense activations into sparse activations and back once per sublayer. The encoder uses a linear map with an AbsTopK activation, the decoder is linear.

Training adds losses that encourage hybrid sparse dense forward passes to match the original dense model. This lets the research team perturb interpretable sparse features such as the quote type classifier channel and then map that perturbation into the dense model, changing its behavior in a controlled way.

https://arxiv.org/pdf/2511.13653

What Exactly has OpenAI Team released?

The OpenAI team as released openai/circuit-sparsity model on Hugging Face. This is a 0.4B parameter model tagged with custom_code, corresponding to csp_yolo2 in the research paper. The model is used for the qualitative results on bracket counting and variable binding. It is licensed under Apache 2.0.

Copy CodeCopiedUse a different Browserimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer

if __name__ == “__main__”:
PROMPT = “def square_sum(xs):n return sum(x * x for x in xs)nnsquare_sum([1, 2, 3])n”
tok = AutoTokenizer.from_pretrained(“openai/circuit-sparsity”, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
“openai/circuit-sparsity”,
trust_remote_code=True,
torch_dtype=”auto”,
)
model.to(“cuda” if torch.cuda.is_available() else “cpu”)

inputs = tok(PROMPT, return_tensors=”pt”, add_special_tokens=False)[“input_ids”].to(
model.device
)
with torch.no_grad():
out = model.generate(
inputs,
max_new_tokens=64,
do_sample=True,
temperature=0.8,
top_p=0.95,
return_dict_in_generate=False,
)

print(tok.decode(out[0], skip_special_tokens=True))
“` :contentReference[oaicite:14]{index=14}

Key Takeaways

Weight sparse training, not post hoc pruning: Circuit sparsity trains GPT-2 style decoder models with extreme weight sparsity enforced during optimization, most weights are zero so each neuron has only a few connections.

Small, task specific circuits with explicit nodes and edges: The research team defines circuits at the level of individual neurons, attention channels and residual channels, and recovers circuits that often have tens of nodes and few edges for 20 binary Python next token tasks.

Quote closing and type tracking are fully instantiated circuits: For tasks like single_double_quote, bracket_counting and set_or_string_fixedvarname, the research team isolate circuits that implement concrete algorithms for quote detection, bracket depth and variable type tracking, with the string closing circuit using 12 nodes and 9 edges.

Models and tooling on Hugging Face and GitHub: OpenAI released the 0.4B parameter openai/circuit-sparsity model on Hugging Face and the full openai/circuit_sparsity codebase on GitHub under Apache 2.0, including model checkpoints, task definitions and a circuit visualization UI.

Bridge mechanism to relate sparse and dense models: The work introduces encoder-decoder bridges that map between sparse and dense activations, which lets researchers transfer sparse feature interventions into standard dense transformers and study how interpretable circuits relate to real production scale models.

Check out the Paper and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OpenAI has Released the ‘circuit-sparsity’: A Set of Open Tools for Connecting Weight Sparse Models and Dense Baselines through Activation Bridges appeared first on MarkTechPost.

5 AI Model Architectures Every AI Engineer Should Know

Everyone talks about LLMs—but today’s AI ecosystem is far bigger than just language models. Behind the scenes, a whole family of specialized architectures is quietly transforming how machines see, plan, act, segment, represent concepts, and even run efficiently on small devices. Each of these models solves a different part of the intelligence puzzle, and together they’re shaping the next generation of AI systems.

In this article, we’ll explore the five major players: Large Language Models (LLMs), Vision-Language Models (VLMs), Mixture of Experts (MoE), Large Action Models (LAMs) & Small Language Models (SLMs).

Large Language Models (LLMs)

LLMs take in text, break it into tokens, turn those tokens into embeddings, pass them through layers of transformers, and generate text back out. Models like ChatGPT, Claude, Gemini, Llama, and others all follow this basic process.

At their core, LLMs are deep learning models trained on massive amounts of text data. This training allows them to understand language, generate responses, summarize information, write code, answer questions, and perform a wide range of tasks. They use the transformer architecture, which is extremely good at handling long sequences and capturing complex patterns in language.

Today, LLMs are widely accessible through consumer tools and assistants—from OpenAI’s ChatGPT and Anthropic’s Claude to Meta’s Llama models, Microsoft Copilot, and Google’s Gemini and BERT/PaLM family. They’ve become the foundation of modern AI applications because of their versatility and ease of use.

Vision-Language Models (VLMs)

VLMs combine two worlds:

A vision encoder that processes images or video

A text encoder that processes language

Both streams meet in a multimodal processor, and a language model generates the final output.

Examples include GPT-4V, Gemini Pro Vision, and LLaVA.

A VLM is essentially a large language model that has been given the ability to see. By fusing visual and text representations, these models can understand images, interpret documents, answer questions about pictures, describe videos, and more.

Traditional computer vision models are trained for one narrow task—like classifying cats vs. dogs or extracting text from an image—and they can’t generalize beyond their training classes. If you need a new class or task, you must retrain them from scratch.

VLMs remove this limitation. Trained on huge datasets of images, videos, and text, they can perform many vision tasks zero-shot, simply by following natural language instructions. They can do everything from image captioning and OCR to visual reasoning and multi-step document understanding—all without task-specific retraining.

This flexibility makes VLMs one of the most powerful advances in modern AI.

Mixture of Experts (MoE)

Mixture of Experts models build on the standard transformer architecture but introduce a key upgrade: instead of one feed-forward network per layer, they use many smaller expert networks and activate only a few for each token. This makes MoE models extremely efficient while offering massive capacity.

In a regular transformer, every token flows through the same feed-forward network, meaning all parameters are used for every token. MoE layers replace this with a pool of experts, and a router decides which experts should process each token (Top-K selection). As a result, MoE models may have far more total parameters, but they only compute with a small fraction of them at a time—giving sparse compute.

For example, Mixtral 8×7B has 46B+ parameters, yet each token uses only about 13B.

This design drastically reduces inference cost. Instead of scaling by making the model deeper or wider (which increases FLOPs), MoE models scale by adding more experts, boosting capacity without raising per-token compute. This is why MoEs are often described as having “bigger brains at lower runtime cost.”

Large Action Models (LAMs)

Large Action Models go a step beyond generating text—they turn intent into action. Instead of just answering questions, a LAM can understand what a user wants, break the task into steps, plan the required actions, and then execute them in the real world or on a computer.

A typical LAM pipeline includes:

Perception – Understanding the user’s input

Intent recognition – Identifying what the user is trying to achieve

Task decomposition – Breaking the goal into actionable steps

Action planning + memory – Choosing the right sequence of actions using past and present context

Execution – Carrying out tasks autonomously

Examples include Rabbit R1, Microsoft’s UFO framework, and Claude Computer Use, all of which can operate apps, navigate interfaces, or complete tasks on behalf of a user.

LAMs are trained on massive datasets of real user actions, giving them the ability to not just respond, but act—booking rooms, filling forms, organizing files, or performing multi-step workflows. This shifts AI from a passive assistant into an active agent capable of complex, real-time decision-making.

Small Language Models (SLMs)

SLMs are lightweight language models designed to run efficiently on edge devices, mobile hardware, and other resource-constrained environments. They use compact tokenization, optimized transformer layers, and aggressive quantization to make local, on-device deployment possible. Examples include Phi-3, Gemma, Mistral 7B, and Llama 3.2 1B.

Unlike LLMs, which may have hundreds of billions of parameters, SLMs typically range from a few million to a few billion. Despite their smaller size, they can still understand and generate natural language, making them useful for chat, summarization, translation, and task automation—without needing cloud computation.

Because they require far less memory and compute, SLMs are ideal for:

Mobile apps

IoT and edge devices

Offline or privacy-sensitive scenarios

Low-latency applications where cloud calls are too slow

SLMs represent a growing shift toward fast, private, and cost-efficient AI, bringing language intelligence directly onto personal devices.

The post 5 AI Model Architectures Every AI Engineer Should Know appeared first on MarkTechPost.

Nanbeige4-3B-Thinking: How a 23T Token Pipeline Pushes 3B Models Past …

Can a 3B model deliver 30B class reasoning by fixing the training recipe instead of scaling parameters? Nanbeige LLM Lab at Boss Zhipin has released Nanbeige4-3B, a 3B parameter small language model family trained with an unusually heavy emphasis on data quality, curriculum scheduling, distillation, and reinforcement learning.

The research team ships 2 primary checkpoints, Nanbeige4-3B-Base and Nanbeige4-3B-Thinking, and evaluates the reasoning tuned model against Qwen3 checkpoints from 4B up to 32B parameters.

https://arxiv.org/pdf/2512.06266

Benchmark results

On AIME 2024, Nanbeige4-3B-2511 reports 90.4, while Qwen3-32B-2504 reports 81.4. On GPQA-Diamond, Nanbeige4-3B-2511 reports 82.2, while Qwen3-14B-2504 reports 64.0 and Qwen3-32B-2504 reports 68.7. These are the 2 benchmarks where the research’s “3B beats 10× larger” framing is directly supported.

The research team also showcase strong tool use gains on BFCL-V4, Nanbeige4-3B reports 53.8 versus 47.9 for Qwen3-32B and 48.6 for Qwen3-30B-A3B. On Arena-Hard V2, Nanbeige4-3B reports 60.0, matching the highest score listed in that comparison table inside the research paper. At the same time, the model is not best across every category, on Fullstack-Bench it reports 48.0, below Qwen3-14B at 55.7 and Qwen3-32B at 58.2, and on SuperGPQA it reports 53.2, slightly below Qwen3-32B at 54.1.

https://arxiv.org/pdf/2512.06266

The training recipe, the parts that move a 3B model

Hybrid Data Filtering, then resampling at scale

For pretraining, the research team combine multi dimensional tagging with similarity based scoring. They reduce their labeling space to 20 dimensions and report 2 key findings, content related labels are more predictive than format labels, and a fine grained 0 to 9 scoring scheme outperforms binary labeling. For similarity based scoring, they build a retrieval database with hundreds of billions of entries supporting hybrid text and vector retrieval.

They filter to 12.5T tokens of high quality data, then select a 6.5T higher quality subset and upsample it for 2 or more epochs, producing a final 23T token training corpus. This is the first place where the report diverges from typical small model training, the pipeline is not just “clean data”, it is scored, retrieved, and resampled with explicit utility assumptions.

FG-WSD, a data utility scheduler instead of uniform sampling

Most similar research projects treat warmup stable decay as a learning rate schedule only. Nanbeige4-3B adds a data curriculum inside the stable phase via FG-WSD, Fine-Grained Warmup-Stable-Decay. Instead of sampling a fixed mixture throughout stable training, they progressively concentrate higher quality data later in training.

https://arxiv.org/pdf/2512.06266

In a 1B ablation trained on 1T tokens, the above Table shows GSM8K improving from 27.1 under vanilla WSD to 34.3 under FG-WSD, with gains across CMATH, BBH, MMLU, CMMLU, and MMLU-Pro. In the full 3B run, the research team splits training into Warmup, Diversity-Enriched Stable, High-Quality Stable, and Decay, and uses ABF in the decay stage to extend context length to 64K.

https://arxiv.org/pdf/2512.06266

Multi-stage SFT, then fix the supervision traces

Post training starts with cold start SFT, then overall SFT. The cold start stage uses about 30M QA samples focused on math, science, and code, with 32K context length, and a reported mix of about 50% math reasoning, 30% scientific reasoning, and 20% code tasks. The research team also claim that scaling cold start SFT instructions from 0.5M to 35M keeps improving AIME 2025 and GPQA-Diamond, with no early saturation in their experiments.

https://arxiv.org/pdf/2512.06266

Overall SFT shifts to a 64K context length mix including general conversation and writing, agent style tool use and planning, harder reasoning that targets weaknesses, and coding tasks. This stage introduces Solution refinement plus Chain-of-Thought reconstruction. The system runs iterative generate, critique, revise cycles guided by a dynamic checklist, then uses a chain completion model to reconstruct a coherent CoT that is consistent with the final refined solution. This is meant to avoid training on broken reasoning traces after heavy editing.

https://arxiv.org/pdf/2512.06266

DPD distillation, then multi stage RL with verifiers

Distillation uses Dual-Level Preference Distillation, DPD. The student learns token level distributions from the teacher model, while a sequence level DPO objective maximizes the margin between positive and negative responses. Positives come from sampling the teacher Nanbeige3.5-Pro, negatives are sampled from the 3B student, and distillation is applied on both sample types to reduce confident errors and improve alternatives.

Reinforcement learning is staged by domain, and each stage uses on policy GRPO. The research team describes on policy data filtering using avg@16 pass rate and retaining samples strictly between 10% and 90% to avoid trivial or impossible items. STEM RL uses an agentic verifier that calls a Python interpreter to check equivalence beyond string matching. Coding RL uses synthetic test functions, validated via sandbox execution, and uses pass fail rewards from those tests. Human preference alignment RL uses a pairwise reward model designed to produce preferences in a few tokens and reduce reward hacking risk compared to general language model rewarders.

https://arxiv.org/pdf/2512.06266

Comparison Table

Benchmark, metricQwen3-14B-2504Qwen3-32B-2504Nanbeige4-3B-2511AIME2024, avg@879.381.490.4AIME2025, avg@870.472.985.6GPQA-Diamond, avg@364.068.782.2SuperGPQA, avg@346.854.153.2BFCL-V4, avg@345.447.953.8Fullstack Bench, avg@355.758.248.0ArenaHard-V2, avg@339.948.460.0

Key Takeaways

3B can lead much larger open models on reasoning, under the paper’s averaged sampling setup. Nanbeige4-3B-Thinking reports AIME 2024 avg@8 90.4 vs Qwen3-32B 81.4, and GPQA-Diamond avg@3 82.2 vs Qwen3-14B 64.0.

The research team is careful about evaluation, these are avg@k results with specific decoding, not single shot accuracy. AIME is avg@8, most others are avg@3, with temperature 0.6, top p 0.95, and long max generation.

Pretraining gains are tied to data curriculum, not just more tokens. Fine-Grained WSD schedules higher quality mixtures later, and the 1B ablation shows GSM8K moving from 27.1 to 34.3 versus vanilla scheduling.

Post-training focuses on supervision quality, then preference aware distillation. The pipeline uses deliberative solution refinement plus chain-of-thought reconstruction, then Dual Preference Distillation that combines token distribution matching with sequence level preference optimization.

Check out the Paper and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Nanbeige4-3B-Thinking: How a 23T Token Pipeline Pushes 3B Models Past 30B Class Reasoning appeared first on MarkTechPost.

How to Design a Fully Local Agentic Storytelling Pipeline Using Gripta …

In this tutorial, we build a fully local, API-free agentic storytelling system using Griptape and a lightweight Hugging Face model. We walk through creating an agent with tool-use abilities, generating a fictional world, designing characters, and orchestrating a multi-stage workflow that produces a coherent short story. By dividing the implementation into modular snippets, we can clearly understand each component as it comes together into an end-to-end creative pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q “griptape[drivers-prompt-huggingface-pipeline]” “transformers” “accelerate” “sentencepiece”

import textwrap
from griptape.structures import Workflow, Agent
from griptape.tasks import PromptTask
from griptape.tools import CalculatorTool
from griptape.rules import Rule, Ruleset
from griptape.drivers.prompt.huggingface_pipeline import HuggingFacePipelinePromptDriver

local_driver = HuggingFacePipelinePromptDriver(
model=”TinyLlama/TinyLlama-1.1B-Chat-v1.0″,
max_tokens=256,
)

def show(title, content):
print(f”n{‘=’*20} {title} {‘=’*20}”)
print(textwrap.fill(str(content), width=100))

We set up our environment by installing Griptape and initializing a local Hugging Face driver. We configure a helper function to display outputs cleanly, allowing us to follow each step of the workflow. As we build the foundation, we ensure everything runs locally without relying on external APIs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsermath_agent = Agent(
prompt_driver=local_driver,
tools=[CalculatorTool()],
)

math_response = math_agent.run(
“Compute (37*19)/7 and explain the steps briefly.”
)

show(“Agent + CalculatorTool”, math_response.output.value)

We create an agent equipped with a calculator tool and test it with a simple mathematical prompt. We observe how the agent delegates computation to the tool and then formulates a natural-language explanation. By running this, we validate that our local driver and tool integration work correctly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserworld_task = PromptTask(
input=”Create a vivid fictional world using these cues: {{ args[0] }}.nDescribe geography, culture, and conflicts in 3–5 paragraphs.”,
id=”world”,
prompt_driver=local_driver,
)

def character_task(task_id, name):
return PromptTask(
input=(
“Based on the world below, invent a detailed character named {{ name }}.n”
“World description:n{{ parent_outputs[‘world’] }}nn”
“Describe their background, desires, flaws, and one secret.”
),
id=task_id,
parent_ids=[“world”],
prompt_driver=local_driver,
context={“name”: name},
)

scotty_task = character_task(“scotty”, “Scotty”)
annie_task = character_task(“annie”, “Annie”)

We build the world-generation task and dynamically construct character-generation tasks that depend on the world’s output. We define a reusable function to create character tasks conditioned on shared context. As we assemble these components, we see how the workflow begins to take shape through hierarchical dependencies. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserstyle_ruleset = Ruleset(
name=”StoryStyle”,
rules=[
Rule(“Write in a cinematic, emotionally engaging style.”),
Rule(“Avoid explicit gore or graphic violence.”),
Rule(“Keep the story between 400 and 700 words.”),
],
)

story_task = PromptTask(
input=(
“Write a complete short story using the following elements.nn”
“World:n{{ parent_outputs[‘world’] }}nn”
“Character 1 (Scotty):n{{ parent_outputs[‘scotty’] }}nn”
“Character 2 (Annie):n{{ parent_outputs[‘annie’] }}nn”
“The story must have a clear beginning, middle, and end, with a meaningful character decision near the climax.”
),
id=”story”,
parent_ids=[“world”, “scotty”, “annie”],
prompt_driver=local_driver,
rulesets=[style_ruleset],
)

story_workflow = Workflow(tasks=[world_task, scotty_task, annie_task, story_task])
topic = “tidally locked ocean world with floating cities powered by storms”
story_workflow.run(topic)

We introduce stylistic rules and create the final storytelling task that merges worldbuilding and characters into a coherent narrative. We then assemble all tasks into a workflow and run it with a chosen topic. Through this, we witness how Griptape chains multiple prompts into a structured creative pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserworld_text = world_task.output.value
scotty_text = scotty_task.output.value
annie_text = annie_task.output.value
story_text = story_task.output.value

show(“Generated World”, world_text)
show(“Character: Scotty”, scotty_text)
show(“Character: Annie”, annie_text)
show(“Final Story”, story_text)

def summarize_story(text):
paragraphs = [p for p in text.split(“n”) if p.strip()]
length = len(text.split())
structure_score = min(len(paragraphs), 10)
return {
“word_count”: length,
“paragraphs”: len(paragraphs),
“structure_score_0_to_10”: structure_score,
}

metrics = summarize_story(story_text)
show(“Story Metrics”, metrics)

We retrieve all generated outputs and display the world, characters, and final story. We also compute simple metrics to evaluate structure and length, giving us a quick analytical summary. As we wrap up, we observe that the full workflow produces measurable, interpretable results.

In conclusion, we demonstrate how easily we can orchestrate complex reasoning steps, tool interactions, and creative generation using local models within the Griptape framework. We experience how modular tasks, rulesets, and workflows merge into a powerful agentic system capable of producing structured narrative outputs. By running everything without external APIs, we gain full control, reproducibility, and flexibility, opening the door to more advanced experiments in local agent pipelines, automated writing systems, and multi-task orchestration.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Fully Local Agentic Storytelling Pipeline Using Griptape Workflows, Hugging Face Models, and Modular Creative Task Orchestration appeared first on MarkTechPost.

Building a voice-driven AWS assistant with Amazon Nova Sonic

As cloud infrastructure becomes increasingly complex, the need for intuitive and efficient management interfaces has never been greater. Traditional command-line interfaces (CLI) and web consoles, while powerful, can create barriers to quick decision-making and operational efficiency. What if you could speak to your AWS infrastructure and get immediate, intelligent responses?
In this post, we explore how to build a sophisticated voice-powered AWS operations assistant using Amazon Nova Sonic for speech processing and Strands Agents for multi-agent orchestration. This solution demonstrates how natural language voice interactions can transform cloud operations, making AWS services more accessible and operations more efficient.
The multi-agent architecture we demonstrate extends beyond basic AWS operations to support diverse use cases including customer service automation, internet-of-things (IoT) device management, financial data analysis, and enterprise workflow orchestration. This foundational pattern can be adapted for any domain requiring intelligent task routing and natural language interaction.
Architecture deep dive
This section explores the technical architecture that powers our voice-driven AWS assistant. The following diagram illustrates how Amazon Nova Sonic integrates with Strands Agents to create a seamless multi-agent system that processes voice commands and executes AWS operations in real-time.

Core components
The multi-agent architecture consists of several specialized components that work together to process voice commands and execute AWS operations:

Supervisor Agent: Acts as the central coordinator, analyzing incoming voice queries and routing them to the appropriate specialized agent based on context and intent.
Specialized Agents:

EC2 Agent: Handles instance management, status monitoring, and compute operations
SSM Agent: Manages Systems Manager operations, command execution, and patch management
Backup Agent: Oversees AWS Backup configurations, job monitoring, and restore operations

Voice Integration Layer: Uses Amazon Nova Sonic for bidirectional voice processing, converting speech to text for processing and text back to speech for responses.

Solution overview
The Strands Agents Nova Voice Assistant demonstrates a new paradigm for AWS infrastructure management through conversational artificial intelligence (AI). Instead of navigating complex web consoles or memorizing CLI commands, users can simply speak their intentions and receive immediate responses. This solution bridges the gap between natural human communication and technical AWS operations, making cloud management accessible to both technical and non-technical team members.
Technology stack
The solution uses modern, cloud-native technologies to deliver a robust and scalable voice interface:

Backend: Python 3.12+ with Strands Agents framework for agent orchestration
Frontend: React with AWS Cloudscape Design System for consistent AWS UI/UX
AI models: Amazon Bedrock and Claude 3 Haiku for natural language understanding and generation
Voice processing: Amazon Nova Sonic for high-quality speech synthesis and recognition
Communication: WebSocket server for real-time bidirectional communication

Key features and capabilities
Our voice-driven assistant offers several advanced features that make AWS operations more intuitive and efficient. The system understands natural voice queries and converts them into appropriate AWS API calls. For example:

“Show me all running EC2 instances in us-east-1”
“Install Amazon CloudWatch agent using SSM on my Dev instances”
“Check the status of last night’s backup jobs”

The responses are specifically optimized for voice delivery, with concise summaries limited to 800 characters, clear structured information delivery, and conversational phrasing that sounds natural when spoken aloud (avoiding technical jargon and using complete sentences suitable for speech synthesis).
Implementation overview
Getting started with the voice-driven AWS assistant involves three main steps:
Environment setup

Configure AWS credentials with access to Bedrock, Nova Sonic, and target AWS services
Set up Python 3.12+ backend environment and React frontend
Ensure proper AWS Identity and Access Management (IAM) permissions for multi-agent operations

Launch the application

Start the Python WebSocket server for voice processing
Launch the React frontend with AWS Cloudscape components
Configure voice settings and WebSocket connections

Begin voice interactions

Grant browser microphone permissions for voice input
Test with example commands like “List my EC2 instances” or “Check backup status”
Experience real-time voice responses through Amazon Nova Sonic

Ready to build your own? Complete deployment instructions, code examples, and troubleshooting guides are available in the GitHub repository.
Example prompts to test through audio
Test your voice assistant with these example commands:
EC2 instance management:

“List my dev EC2 instances where tag key is ‘env’”
“What’s the status of those instances?”
“Start those instances”
“Do these instances have SSM permissions?”

Backup management:

“Make sure these instances are backed up daily”

SSM management:

“Install CloudWatch agent using SSM on these instances”
“Scan these instances for patches using SSM”

Demo video
The following video demonstrates the voice assistant in action, showing how natural language commands are processed and executed against AWS services via real-time voice interaction, agent coordination, and AWS API responses.

Implementation examples
The following code examples demonstrate key integration patterns and best practices for implementing your voice-driven AWS assistant. These examples show how to integrate Amazon Nova Sonic for voice processing and configure the supervisor agent for intelligent task routing.
AWS Strands Agents setup
The implementation uses a multi-agent orchestrator pattern with specialized agents:

from strands import Agent
from config.conversation_config import ConversationConfig
from config.config import create_bedrock_model

class SupervisorAgent(Agent):
def __init__(self, specialized_agents, config=None):
bedrock_model = create_bedrock_model(config)
conversation_manager = ConversationConfig.create_conversation_manager(“supervisor”)

super().__init__(
model=bedrock_model,
system_prompt=self._get_routing_instructions(),
tools=[], # No tools for pure router
conversation_manager=conversation_manager,
)
self.specialized_agents = specialized_agents

Nova Sonic integration
The implementation uses a WebSocket server with session management for real-time voice processing:

class S2sSessionManager:
def __init__(self, model_id=’amazon.nova-sonic-v1:0′, region=’us-east-1′, config=None):
self.model_id = model_id
self.region = region
self.audio_input_queue = asyncio.Queue()
self.output_queue = asyncio.Queue()
self.supervisor_agent = SupervisorAgentIntegration(config)

async def processToolUse(self, toolName, toolUseContent):
if toolName == “supervisoragent”:
result = await self.supervisor_agent.query(content)
if len(result) > 800:
result = result[:800] + “… (truncated for voice)”
return {“result”: result}

Security best practices
This solution is designed for development and testing purposes. Before deploying to production environments, implement appropriate security controls including:

Authentication and authorization mechanisms
Network security controls and access restrictions
Monitoring and logging for audit compliance
Cost controls and usage monitoring

Note: Always follow AWS security best practices and the principle of least privilege when configuring IAM permissions.
Production Considerations
While this solution demonstrates Strands Agents capabilities using a development-focused deployment approach, organizations planning production implementations should consider Amazon Bedrock AgentCore Runtime for enterprise-grade hosting and management. Amazon Bedrock AgentCore Benefits for production deployment:

Serverless runtime: Purpose-built for deploying and scaling dynamic AI agents without managing infrastructure
Session isolation: Complete session isolation with dedicated microVMs for each user session, critical for agents performing privileged operations
Auto-scaling: Scale up to thousands of agent sessions in seconds with pay-per-usage pricing
Enterprise security: Built-in security controls with seamless integration to identity providers (Amazon Cognito, Microsoft Entra ID, Okta)
Observability: Built-in distributed tracing, metrics, and debugging capabilities through Cloudwatch integration
Session persistence: Highly reliable with session persistence for long-running agent interactions

For organizations ready to move beyond development and testing, Amazon Bedrock AgentCore Runtime provides the production-ready foundation needed to deploy voice-driven AWS assistants at enterprise scale.
Integration with additional AWS services
The system can be extended to support additional AWS services:

AWS Lambda Functions: Execute serverless functions via voice commands
CloudWatch: Query metrics and logs through natural language
Amazon Relational Database Service (RDS): Database management and monitoring operations

Conclusion
The Strands Agents Nova Voice Assistant demonstrates the powerful potential of combining voice interfaces with intelligent agent orchestration across diverse domains. By leveraging Amazon Nova Sonic for speech processing and Strands Agents for multi-agent coordination, organizations can create more intuitive and efficient ways to interact with complex systems and workflows.
This foundational architecture extends far beyond cloud operations to enable voice-driven solutions for customer service automation, financial analysis, IoT device management, healthcare workflows, supply chain optimization, and countless other enterprise applications. The combination of natural language processing, intelligent routing, and specialized domain knowledge creates a versatile platform for transforming how users interact with any complex system. The modular architecture ensures scalability and extensibility, allowing organizations to customize the solution for their specific domains and use cases. As voice interfaces continue to evolve and AI capabilities advance, solutions like this are likely to become increasingly important for managing complex environments across all industries.
Getting Started
Ready to build your own voice-powered AWS operations assistant? The complete source code and documentation are available in the GitHub repository. Follow this implementation guide to get started, and don’t hesitate to customize the solution for your specific use cases.
For questions, feedback, or contributions, please visit the project repository or reach out through the AWS community forums.

About the authors:
Jagdish Komakula is a passionate Sr. Delivery Consultant working with AWS Professional Services. With over two decades of experience in Information Technology, he helped numerous enterprise clients successfully navigate their digital transformation journeys and cloud adoption initiatives.
Aditya Ambati is an experienced DevOps Engineer with 14 plus years of experience in IT. He has an excellent reputation for resolving problems, improving customer satisfaction, and driving overall operational improvements.
Anand Krishna Varanasi is a seasoned AWS builder and architect who began his career over 17 years ago. He guides customers with cutting-edge cloud technology migration strategies (the 7 Rs) and modernization. He is passionate about the role that technology plays in bridging the present with all the possibilities for our future.
D.T.V.R.L Phani Kumar is a visionary DevOps Consultant with 10+ years of groundbreaking technology leadership, specializing in transformative automation strategies. As a distinguished engineer, he expertly bridges AI/ML innovations with DevOps practices, consistently delivering revolutionary solutions that redefine operational excellence and customer experiences. His strategic approach and technical mastery have positioned him as a thought leader in driving technological paradigm shifts.