Mistral AI Releases OCR 3: A Smaller Optical Character Recognition (OC …

Mistral AI has released Mistral OCR 3, its latest optical character recognition service that powers the company’s Document AI stack. The model, named as mistral-ocr-2512, is built to extract interleaved text and images from PDFs and other documents while preserving structure, and it does this at an aggressive price of $2 per 1,000 pages with a 50% discount when used through the Batch API.

What Mistral OCR 3 is Optimized for?

Mistral OCR 3 targets typical enterprise document workloads. The model is tuned for forms, scanned documents, complex tables, and handwriting. It is evaluated on internal benchmarks drawn from real business use cases, where it achieves a 74% overall win rate over Mistral OCR 2 across these document categories using a fuzzy match metric against ground truth.

The model outputs markdown that preserves document layout, and when table formatting is enabled, it enriches the output with HTML based table representations. This combination gives downstream systems both the content and the structural information that is needed for retrieval pipelines, analytics, and agent workflows.

Role in Mistral Document AI

OCR 3 sits inside Mistral Document AI, the company’s document processing capability that combines OCR with structured data extraction and Document QnA.

It now powers the Document AI Playground in Mistral AI Studio. In this interface, users upload PDFs or images and get back either clean text or structured JSON without writing code. The same underlying OCR pipeline is accessible via the public API, which allows teams to move from interactive exploration to production workloads without changing the core model.

Inputs, Outputs, And Structure

The OCR processor accepts multiple document formats through a single API. The document field can point to:

document_url for PDFs, pptx, docx and more

image_url for image types such as png, jpeg or avif

Uploaded or base64 encoded PDFs or images through the same schema

This is documented in the OCR Processor section of Mistral’s Document AI docs.

The response is a JSON object with a pages array. Each page contains an index, a markdown string, a list of images, a list of tables when table_format=”html” is used, detected hyperlinks, optional header and footer fields when header or footer extraction is enabled, and a dimensions object with page size. There is also a document_annotation field for structured annotations and a usage_info block for accounting information.

When images and HTML tables are extracted, the markdown includes placeholders such as ![img-0.jpeg](img-0.jpeg) and [tbl-3.html](tbl-3.html). These placeholders are mapped back to actual content using the images and tables arrays in the response, which simplifies downstream reconstruction.

Upgrades Over Mistral OCR 2

Mistral OCR 3 introduces several concrete upgrades relative to OCR 2. The public release notes emphasize four main areas.

Handwriting Mistral OCR 3 more accurately interprets cursive, mixed content annotations, and handwritten text placed on top of printed templates.

Forms It improves detection of boxes, labels, and handwritten entries in dense layouts such as invoices, receipts, compliance forms, and government documents.

Scanned and complex documents The model is more robust to compression artifacts, skew, distortion, low DPI, and background noise in scanned pages.

Complex tables It reconstructs table structures with headers, merged cells, multi row blocks, and column hierarchies, and it can return HTML tables with proper colspan and rowspan tags so that layout is preserved.

https://mistral.ai/news/mistral-ocr-3

Pricing, Batch Inference, And Annotations

The OCR 3 model card lists pricing at $2 per 1,000 pages for standard OCR and $3 per 1,000 annotated pages when structured annotations are used.

Mistral also exposes OCR 3 through its Batch Inference API /v1/batch, which is documented under the batching section of the platform. Batch processing halves the effective OCR price to $1 per 1,000 pages by applying a 50% discount for jobs that run through the batch pipeline.

The model integrates with two important features on the same endpoint, Annotations – Structured and BBox Extraction. These allow developers to attach schema driven labels to regions of a document and get bounding boxes for text and other elements, which is useful when mapping content into downstream systems or UI overlays.

Key Takeaways

Model and role: Mistral OCR 3, named as mistral-ocr-2512, is the new OCR service that powers Mistral’s Document AI stack for page based document understanding.

Accuracy gains: On internal benchmarks covering forms, scanned documents, complex tables, and handwriting, OCR 3 achieves a 74% overall win rate over Mistral OCR 2, and Mistral positions it as state of the art against both traditional and AI native OCR systems.

Structured outputs for RAG: The service extracts interleaved text and embedded images and returns markdown enriched with HTML reconstructed tables, preserving layout and table structure so outputs can feed directly into RAG, agents, and search pipelines with minimal extra parsing.

API and document formats: Developers access OCR 3 via the /v1/ocr endpoint or SDK, passing PDFs as document_url and images such as png or jpeg as image_url, and can enable options like HTML table output, header or footer extraction, and base64 images in the response.

Pricing and batch processing: OCR 3 is priced at 2 dollars per 1,000 pages and 3 dollars per 1,000 annotated pages, and when used through the Batch API the effective price for standard OCR drops to 1 dollar per 1,000 pages for large scale processing.

Check out the TECHNICAL DETAILS. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Mistral AI Releases OCR 3: A Smaller Optical Character Recognition (OCR) Model for Structured Document AI at Scale appeared first on MarkTechPost.

How to Build a High-Performance Distributed Task Routing System Using …

In this tutorial, we build a fully functional event-driven workflow using Kombu, treating messaging as a core architectural capability. We walk through step by step the setup of exchanges, routing keys, background workers, and concurrent producers, allowing us to observe a real distributed system. As we implement each component, we see how clean message flow, asynchronous processing, and routing patterns give us the same power that production microservices rely on every day. Check out the FULL CODES.

Copy CodeCopiedUse a different Browser!pip install kombu

import threading
import time
import logging
import uuid
import datetime
import sys

from kombu import Connection, Exchange, Queue, Producer, Consumer
from kombu.mixins import ConsumerMixin

logging.basicConfig(
level=logging.INFO,
format=’%(message)s’,
handlers=[logging.StreamHandler(sys.stdout)],
force=True
)
logger = logging.getLogger(__name__)

BROKER_URL = “memory://localhost/”

We begin by installing Kombu, importing dependencies, and configuring logging so we can clearly see every message flowing through the system. We also set the in-memory broker URL, allowing us to run everything locally in Colab without needing RabbitMQ. This setup forms the foundation for our distributed messaging workflow. Check out the FULL CODES.

Copy CodeCopiedUse a different Browsermedia_exchange = Exchange(‘media_exchange’, type=’topic’, durable=True)

task_queues = [
Queue(‘video_queue’, media_exchange, routing_key=’video.#’),
Queue(‘audit_queue’, media_exchange, routing_key=’#’),
]

We define a topic exchange to flexibly route messages using wildcard patterns. We also create two queues: one dedicated to video-related tasks and another audit queue that listens to everything. Using topic routing, we can precisely control how messages flow across the system. Check out the FULL CODES.

Copy CodeCopiedUse a different Browserclass Worker(ConsumerMixin):
def __init__(self, connection, queues):
self.connection = connection
self.queues = queues
self.should_stop = False

def get_consumers(self, Consumer, channel):
return [
Consumer(queues=self.queues,
callbacks=[self.on_message],
accept=[‘json’],
prefetch_count=1)
]

def on_message(self, body, message):
routing_key = message.delivery_info[‘routing_key’]
payload_id = body.get(‘id’, ‘unknown’)

logger.info(f”n RECEIVED MSG via key: [{routing_key}]”)
logger.info(f” Payload ID: {payload_id}”)

try:
if ‘video’ in routing_key:
self.process_video(body)
elif ‘audit’ in routing_key:
logger.info(” [Audit] Logging event…”)

message.ack()
logger.info(f” ACKNOWLEDGED”)

except Exception as e:
logger.error(f” ERROR: {e}”)

def process_video(self, body):
logger.info(” [Processor] Transcoding video (Simulating work…)”)
time.sleep(0.5)

We implement a custom worker using Kombu’s ConsumerMixin to run it in a background thread. In the message callback, we inspect the routing key, invoke the appropriate processing function, and acknowledge the message. This worker architecture gives us clean, concurrent message consumption with full control. Check out the FULL CODES.

Copy CodeCopiedUse a different Browserdef publish_messages(connection):
producer = Producer(connection)

tasks = [
(‘video.upload’, {‘file’: ‘movie.mp4’}),
(‘user.login’, {‘user’: ‘admin’}),
]

logger.info(“n PRODUCER: Starting to publish messages…”)

for r_key, data in tasks:
data[‘id’] = str(uuid.uuid4())[:8]

logger.info(f” SENDING: {r_key} -> {data}”)

producer.publish(
data,
exchange=media_exchange,
routing_key=r_key,
serializer=’json’
)
time.sleep(1.5)

logger.info(” PRODUCER: Done.”)

We now build a producer that sends structured JSON payloads into the exchange with different routing keys. We generate unique IDs for each event and observe how they are routed to other queues. This mirrors real-world microservice event publishing, where producers and consumers remain decoupled. Check out the FULL CODES.

Copy CodeCopiedUse a different Browserdef run_example():
with Connection(BROKER_URL) as conn:
worker = Worker(conn, task_queues)
worker_thread = threading.Thread(target=worker.run)
worker_thread.daemon = True
worker_thread.start()

logger.info(” SYSTEM: Worker thread started.”)
time.sleep(1)

try:
publish_messages(conn)
time.sleep(2)
except KeyboardInterrupt:
pass
finally:
worker.should_stop = True
logger.info(“n SYSTEM: Execution complete.”)

if __name__ == “__main__”:
run_example()

We start the worker in a background thread and fire the producer in the main thread. This structure gives us a mini distributed system running in Colab. By observing the logs, we see messages published → routed → consumed → acknowledged, completing the full event-processing lifecycle.

In conclusion, we orchestrated a dynamic, distributed task-routing pipeline that processes real-time events with clarity and precision. We witnessed how Kombu abstracts away the complexity of messaging systems while still giving us fine-grained control over routing, consumption, and worker concurrency. As we see messages move from producer to exchange to queue to worker, we gained a deeper appreciation for the elegance of event-driven system design, and we are now well-equipped to scale this foundation into robust microservices, background processors, and enterprise-grade workflows.

Check out the FULL CODES. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build a High-Performance Distributed Task Routing System Using Kombu with Topic Exchanges and Concurrent Workers appeared first on MarkTechPost.

Google Introduces T5Gemma 2: Encoder Decoder Models with Multimodal In …

Google has released T5Gemma 2, a family of open encoder-decoder Transformer checkpoints built by adapting Gemma 3 pretrained weights into an encoder-decoder layout, then continuing pretraining with the UL2 objective. The release is pretrained only, intended for developers to post-train for specific tasks, and Google explicitly notes it is not releasing post-trained or IT checkpoints for this drop.

T5Gemma 2 is positioned as an encoder-decoder counterpart to Gemma 3 that keeps the same low level building blocks, then adds 2 structural changes aimed at small model efficiency. The models inherit Gemma 3 features that matter for deployment, notably multimodality, long context up to 128K tokens, and broad multilingual coverage, with the blog stating over 140 languages.

https://arxiv.org/pdf/2512.14856

What Google actually released?

The release includes 3 pretrained sizes, 270M-270M, 1B-1B, and 4B-4B, where the notation means the encoder and decoder are the same size. The research team reports approximate totals excluding the vision encoder, about 370M, 1.7B, and 7B parameters. The multimodal accounting lists a 417M parameter vision encoder, along with encoder and decoder parameters broken into embedding and non embedding components.

The adaptation, encoder-decoder without training from scratch

T5Gemma 2 follows the same adaptation idea introduced in T5Gemma, initialize an encoder-decoder model from a decoder-only checkpoint, then adapt with UL2. In the above figure the research team show encoder and decoder parameters initialized from the pretrained decoder-only model, then pretrained with UL2, with images first converted by SigLIP into 256 tokens.

This matters because encoder-decoder splits the workload, the encoder can read the full input bidirectionally, while the decoder focuses on autoregressive generation. The research team argues this separation can help long context tasks where the model must retrieve relevant evidence from a large input before generating.

Two efficiency changes that are easy to miss but affect small models

First, T5Gemma 2 uses tied word embeddings across encoder input embedding, decoder input embedding, and decoder output or softmax embedding. This reduces parameter redundancy, and references an ablation showing little quality change while reducing embedding parameters.

Second, it introduces merged attention in the decoder. Instead of separate self-attention and cross-attention sublayers, the decoder performs a single attention operation where K and V are formed by concatenating encoder outputs and decoder states, and masking preserves causal visibility for decoder tokens. This ties to easier initialization, because it narrows differences between the adapted decoder and the original Gemma style decoder stack, and it reports parameter savings with a small average quality drop in their ablations.

https://arxiv.org/pdf/2512.14856

https://arxiv.org/pdf/2512.14856

Multimodality, image understanding is encoder side, not decoder side

T5Gemma 2 is multimodal by reusing Gemma 3’s vision encoder and keeping it frozen during training. Vision tokens are always fed to the encoder and encoder tokens have full visibility to each other in self attention. This is a pragmatic encoder-decoder design, the encoder fuses image tokens with text tokens into contextual representations, and the decoder can then attend to those representations while generating text.

On the tooling side, T5Gemma 2 is placed under an image-text-to-text pipeline, which matches the research’s design, image in, text prompt in, text out. That pipeline example is the fastest way to validate the end to end multimodal path, including dtype choices like bfloat16 and automatic device mapping.

Long context to 128K, what enables it

Google researchers attributes the 128K context window to Gemma 3’s alternating local and global attention mechanism. The Gemma 3 team describes a repeating 5 to 1 pattern, 5 local sliding window attention layers followed by 1 global attention layer, with a local window size of 1024. This design reduces KV cache growth relative to making every layer global, which is one reason long context becomes feasible at smaller footprints.

In the T5Gemma 2, the research team also mention adopting positional interpolation methods for long context, and they pretrain on sequences up to 16K input paired with 16K target outputs, then evaluate long context performance up to 128K on benchmarks including RULER and MRCR. The detailed pretraining results table includes 32K and 128K evaluations, showing the long context deltas they claim over Gemma 3 at the same scale.

https://arxiv.org/pdf/2512.14856

Training setup and what “pretrained only” implies for users

The research team states the models are pretrained on 2T tokens and describes a training setup that includes a batch size of 4.2M tokens, cosine learning rate decay with 100 warmup steps, global gradient clipping at 1.0, and checkpoint averaging over the last 5 checkpoints.

Key Takeaways

T5Gemma 2 is an encoder decoder family adapted from Gemma 3 and continued with UL2, it reuses Gemma 3 pretrained weights, then applies the same UL2 based adaptation recipe used in T5Gemma.

Google released pretrained checkpoints only, no post trained or instruction tuned variants are included in this drop, so downstream use requires your own post training and evaluation.

Multimodal input is handled by a SigLIP vision encoder that outputs 256 image tokens and stays frozen, those vision tokens go into the encoder, the decoder generates text.

Two parameter efficiency changes are central, tied word embeddings share encoder, decoder, and output embeddings, merged attention unifies decoder self attention and cross attention into a single module.

Long context up to 128K is enabled by Gemma 3’s interleaved attention design, a repeating 5 local sliding window layers with window size 1024 followed by 1 global layer, and T5Gemma 2 inherits this mechanism.

Check out the Paper, Technical details and Model on Hugging Face. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Introduces T5Gemma 2: Encoder Decoder Models with Multimodal Inputs via SigLIP and 128K Context appeared first on MarkTechPost.

Introducing SOCI indexing for Amazon SageMaker Studio: Faster containe …

Today, we are excited to introduce a new feature for SageMaker Studio: SOCI (Seekable Open Container Initiative) indexing. SOCI supports lazy loading of container images, where only the necessary parts of an image are downloaded initially rather than the entire container.
SageMaker Studio serves as a web Integrated Development Environment (IDE) for end-to-end machine learning (ML) development, so users can build, train, deploy, and manage both traditional ML models and foundation models (FM) for the complete ML workflow.
Each SageMaker Studio application runs inside a container that packages the required libraries, frameworks, and dependencies for consistent execution across workloads and user sessions. This containerized architecture allows SageMaker Studio to support a wide range of ML frameworks such as TensorFlow, PyTorch, scikit-learn, and more while maintaining strong environment isolation. Although SageMaker Studio provides containers for the most common ML environments, data scientists may need to tailor these environments for specific use cases by adding or removing packages, configuring custom environment variables, or installing specialized dependencies. SageMaker Studio supports this customization through Lifecycle Configurations (LCCs), which allow users to run bash scripts at the startup of a Studio IDE space. However, repeatedly customizing environments using LCCs can become time-consuming and difficult to maintain at scale. To address this, SageMaker Studio supports building and registering custom container images with preconfigured libraries and frameworks. These reusable custom images reduce setup friction and improve reproducibility for consistency across projects, so data scientists can focus on model development rather than environment management.
As ML workloads become increasingly complex, the container images that power these environments have grown in size, leading to longer startup times that can delay productivity and interrupt development workflows. Data scientists, ML engineers, and developers may have longer wait times for their environments to initialize, particularly when switching between different frameworks or when using images with extensive pre-installed libraries and dependencies. This startup latency becomes a significant bottleneck in iterative ML development where quick experimentation and rapid prototyping are essential. Instead of downloading the entire container image upfront, SOCI creates an index that allows the system to fetch only the specific files and layers needed to start the application, with additional components loaded on-demand as required. This significantly reduces container startup times from minutes to seconds, allowing your SageMaker Studio environments to launch faster and get you working on your ML projects sooner, ultimately improving developer productivity and reducing time-to-insight for ML experiments.
Prerequisites
To use SOCI indexing with SageMaker Studio, you need:

An AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage SageMaker and ECR resources. For details, refer to Create an AWS account.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain.
A private Amazon Elastic Container Registry (ECR) repository to store your container images with SOCI indexes.
Verify you have AWS CLI version 2.0 or higher installed to interact with these services and manage your SOCI-indexed images.

SageMaker Studio SOCI Indexing – Feature overview
The SOCI (Seekable Open Container Initiative), originally open sourced by AWS, addresses container startup delays in SageMaker Studio through selective image loading. This technology creates a specialized index that maps the internal structure of container images for granular access to individual files without downloading the entire container archive first. Traditional container images are stored as ordered lists of layers in gzipped tar files, which typically require complete download before accessing any content. SOCI overcomes this limitation by generating a separate index stored as an OCI Artifact that links to the original container image through OCI Reference Types. This design preserves all original container images, maintains consistent image digests, and ensures signature validity—critical factors for AI/ML environments with strict security requirements.
For SageMaker Studio users, you can implement SOCI indexing through the integration with Finch container runtime, this translates to 35-70% reduction in container startup times across all instance types using Bring Your Own Image (BYOI). This implementation extends beyond current optimization strategies that are limited to specific first-party image and instance type combinations, providing faster app launch times in SageMaker AI Studio and SageMaker Unified Studio environments.
Creating a SOCI index
To create and manage SOCI indices, you can use several container management tools, each offering different advantages depending on your development environment and preferences:

Finch CLI is a Docker-compatible command-line tool developed by AWS that provides native support for building and pushing SOCI indices. It offers a familiar Docker-like interface while including built-in SOCI functionality, making it straightforward to create indexed images without additional tooling.
nerdctl serves as an alternative container CLI for containerd, the industry-standard container runtime. It provides Docker-compatible commands while offering direct integration with containerd features, including SOCI support for lazy loading capabilities.
Docker + SOCI CLI combines the widely used Docker toolchain with the dedicated SOCI command-line interface. This approach allows you to leverage existing Docker workflows while adding SOCI indexing capabilities through a separate CLI tool, providing flexibility for teams already invested in Docker-based development processes.

In the standard SageMaker Studio workflow, launching a machine learning environment requires downloading the complete container image before any application can start. When user initiates a new SageMaker Studio session, the system must pull the entire image containing frameworks like TensorFlow, PyTorch, scikit-learn, Jupyter, and associated dependencies from the container registry. This process is sequential and time consuming—the container runtime downloads each compressed layer, extracts the complete filesystem to local storage, and only then can the application begin initialization. For typical ML images ranging from 2-5 GB, this results in startup times of 3-5 minutes, creating significant friction in iterative development workflows where data scientists frequently switch between different environments or restart sessions.The SOCI-enhanced workflow transforms container startup by enabling intelligent, on-demand file retrieval. Instead of downloading entire images, SOCI creates a searchable index that maps the precise location of every file within the compressed container layers. When launching a SageMaker Studio application, the system downloads only the SOCI index (typically 10-20 MB) and the minimal set of files required for application startup—usually 5-10% of the total image size. The container begins running immediately while a background process continues downloading remaining files as the application requests them. This lazy loading approach reduces initial startup times from few minutes to seconds, allowing users to begin productive work almost immediately while the environment completes initialization transparently in the background.
Converting the image to SOCI
You can convert your existing image into a SOCI image and push it to your private ECR using the following commands:

#/bin/bash
# Download and install soci-snapshotter, containerd, and nerdctl
sudo yum install soci-snapshotter
sudo yum install containerd jq
sudo systemctl start soci-snapshotter
sudo systemctl restart containerd
sudo yum install nerdctl

# Set your registry variables
REGISTRY=”123456789012.dkr.ecr.us-west-2.amazonaws.com”
REPOSITORY_NAME=”my-sagemaker-image”

# Authenticate for image pull and push
AWS_REGION=us-west-2
REGISTRY_USER=AWS
REGISTRY_PASSWORD=$(/usr/local/bin/aws ecr get-login-password –region $AWS_REGION)
echo $REGISTRY_PASSWORD | sudo nerdctl login -u $REGISTRY_USER –password-stdin $REGISTRY

# Pull the original image
sudo nerdctl pull $REGISTRY/$REPOSITORY_NAME:original-image

# Create SOCI index using the convert subcommand
sudo nerdctl image convert –soci $REGISTRY/$REPOSITORY_NAME:original-image $REGISTRY/$REPOSITORY_NAME:soci-image

# Push the SOCI v2 indexed image
sudo nerdctl push –platform linux/amd64 $REGISTRY/$REPOSITORY_NAME:soci-image

This process creates two artifacts for the original container image in your ECR repository:

SOCI index – Metadata enabling lazy loading.
Image index manifest – OCI-compliant manifest linking them together.

To use SOCI-indexed images in SageMaker Studio, you must reference the image index URI rather than the original container image URI when creating SageMaker Image and SageMaker Image Version resources. The image index URI corresponds to the tag you specified during the SOCI conversion process (for example, soci-image in the previous example).

#/bin/bash
# Use the SOCI v2 image index URI
IMAGE_INDEX_URI=”123456789012.dkr.ecr.us-west-2.amazonaws.com/my-sagemaker-image:soci-image”  

# Create SageMaker Image
aws sagemaker create-image
–image-name “my-sagemaker-image”
–role-arn “arn:aws:iam::123456789012:role/SageMakerExecutionRole”  

# Create SageMaker Image Version with SOCI index
aws sagemaker create-image-version
–image-name “my-sagemaker-image”
–base-image “$IMAGE_INDEX_URI”  

# Create App Image Config for JupyterLab
aws sagemaker create-app-image-config
–app-image-config-name “my-sagemaker-image-config”
–jupyter-lab-app-image-config ‘{ “FileSystemConfig”: { “MountPath”: “/home/sagemaker-user”, “DefaultUid”: 1000, “DefaultGid”: 100 } }’  

#Update domain to include the custom image (required step)
aws sagemaker update-domain
 –domain-id “d-xxxxxxxxxxxx”
 –default-user-settings ‘{
  “JupyterLabAppSettings”: {
  “CustomImages”: [{
  “ImageName”: “my-sagemaker-image”,
  “AppImageConfigName”: “my-sagemaker-image-config”
  }]
  }
 }’

The image index URI contains references to both the container image and its associated SOCI index through the OCI Image Index manifest. When SageMaker Studio launches applications using this URI, it automatically detects the SOCI index and enables lazy loading capabilities.
SOCI indexing is supported for all ML environments (JupyterLab, CodeEditor, etc.) for both SageMaker Unified Studio and SageMaker AI. For additional information on setting up your customer image, please reference SageMaker Bring Your Own Image documentation.
Benchmarking SOCI impact on SageMaker Studio JupyterLab startup
The primary objective of this new feature in SageMaker Studio is to streamline the end user experience by reducing the startup durations for SageMaker Studio applications launched with custom images. To measure the effectiveness of lazy loading custom container images in SageMaker Studio using SOCI, we will empirically quantify and contrast start-up durations for a given custom image both with and without SOCI. Further, we’ll conduct this test for a variety of custom images representing a diverse sets of dependencies, files, and data, to evaluate how effectiveness may vary for end users with different custom image needs.
To empirically quantify the startup durations for custom image app launches, we will programmatically launch JupyterLab and CodeEditor Apps with the SageMaker CreateApp API—specifying the candidate sageMakerImageArn and sageMakerImageVersionAlias event time with an appropriate instanceType—recording the eventTime for analysis. We will then poll the SageMaker ListApps API every second to monitor the app startup, recording the eventTime of the first response that where Status is reported as InService. The delta between these two times for a particular app is the startup duration.
For this analysis, we have created two sets of private ECR repositories, each with the same SageMaker custom container images but with only one set implementing SOCI indices. When comparing the equivalent images in ECR, we can see the SOCI artifacts present in only one repo. We will be deploying the apps into a single SageMaker AI domain. All custom images are attached to that domain so that its SageMaker Studio users can choose those custom images when invoking startup of a JupyterLab space.
To run the tests, for each custom image, we invoke a series of ten CreateApp API calls:

“requestParameters”: {
    “domainId”: “<>”,
    “spaceName”: “<>”,
    “appType”: “JupyterLab”,
    “appName”: “default”,
    “tags”: [],
    “resourceSpec”: {
        “sageMakerImageArn”: “<>”,
        “sageMakerImageVersionAlias”: “<>”,
        “instanceType”: “<>”
    },
    “recoveryMode”: false
}

The following table captures the startup acceleration with SOCI index enabled for Amazon SageMaker distribution images:

App type
Instance type
Image
App startup duration (sec)
% Reduction in app startup duration

Regular image
SOCI image

SMAI JupyterLab
t3.medium
SMD 3.4.2
231
150
35.06%

t3.medium
SMD 3.4.2
350
191
45.43%

c7i.large
SMD 3.4.2
331
141
57.40%

SMAI CodeEditor
t3.medium
SMD 3.4.2
202
110
45.54%

t3.medium
SMD 3.4.2
213
78
63.38%

c7i.large
SMD 3.4.2
279
91
67.38%

Note: Each app startup latency and their improvement may vary depending on the availability of SageMaker ML instances.
Based on these findings, we see that running SageMaker Studio custom images with SOCI indexes allows SageMaker Studio users to launch their apps faster compared to without SOCI indexes. Specifically, we see ~35-70% faster container start-up time.
Conclusion
In this post, we showed you how the introduction of SOCI indexing to SageMaker Studio improves the developer experience for machine learning practitioners. By optimizing container startup times through lazy loading—reducing wait times from several minutes to under a minute—AWS helps data scientists, ML engineers, and developers spend less time waiting and more time innovating. This improvement addresses one of the most common friction points in iterative ML development, where frequent environment switches and restarts impact productivity. With SOCI, teams can maintain their development velocity, experiment with different frameworks and configurations, and accelerate their path from experimentation to production deployment.

About the authors
Pranav Murthy is a Senior Generative AI Data Scientist at AWS, specializing in helping organizations innovate with Generative AI, Deep Learning, and Machine Learning on Amazon SageMaker AI. Over the past 10+ years, he has developed and scaled advanced computer vision (CV) and natural language processing (NLP) models to tackle high-impact problems—from optimizing global supply chains to enabling real-time video analytics and multilingual search. When he’s not building AI solutions, Pranav enjoys playing strategic games like chess, traveling to discover new cultures, and mentoring aspiring AI practitioners. You can find Pranav on LinkedIn.
Raj Bagwe is a Senior Solutions Architect at Amazon Web Services, based in San Francisco, California. With over 6 years at AWS, he helps customers navigate complex technological challenges and specializes in Cloud Architecture, Security and Migrations. In his spare time, he coaches a robotics team and plays volleyball. You can find Raj on LinkedIn.
Nikita Arbuzov is a Software Development Engineer at Amazon Web Services, working and maintaining SageMaker Studio platform and its applications, based in New York, NY. With over 3 years of experience in backend platform latency optimization, he works on improving customer experience and usability of SageMaker AI and SageMaker Unified Studio. In his spare time, Nikita performs different outdoor activities, like mountain biking, kayaking, and snowboarding, loves traveling around the US and enjoys making new friends. You can find Nikita on LinkedIn.

Build and deploy scalable AI agents with NVIDIA NeMo, Amazon Bedrock A …

This post is co-written with Ranjit Rajan, Abdullahi Olaoye, and Abhishek Sawarkar from NVIDIA.
AI’s next frontier isn’t merely smarter chat-based assistants, it’s autonomous agents that reason, plan, and execute across entire systems. But to accomplish this, enterprise developers need to move from prototypes to production-ready AI agents that scale securely. This challenge grows as enterprise problems become more complex, requiring architectures where multiple specialized agents collaborate to accomplish sophisticated tasks.
Building AI agents in development differs fundamentally from deploying them at scale. Developers face a chasm between prototype and production, struggling with performance optimization, resource scaling, security implementation, and operational monitoring. Typical approaches leave teams juggling multiple disconnected tools and frameworks, making it difficult to maintain consistency from development through deployment with optimal performance. That’s where the powerful combination of Strands Agents, Amazon Bedrock AgentCore, and NVIDIA NeMo Agent Toolkit shine. You can use these tools together to design sophisticated multi-agent systems, orchestrate them, and scale them securely in production with built-in observability, agent evaluation, profiling, and performance optimization. This post demonstrates how to use this integrated solution to build, evaluate, optimize, and deploy AI agents on Amazon Web Services (AWS) from initial development through production deployment.
Foundation for enterprise-ready agents
The open source Strands Agents framework simplifies AI agent development through its model-driven approach. Developers create agents using three components:

Foundation models (FMs) such as Amazon Nova, Claude by Anthropic, and Meta’s Llama
Tools (over 20 built-in, plus custom tools using Python decorators)
Prompts that guide agent behavior.

The framework includes built-in integrations with AWS services such as Amazon Bedrock and Amazon Simple Storage Service (Amazon S3), local testing support, continuous integration and continuous development (CI/CD) workflows, multiple deployment options, and OpenTelemetry observability.
Amazon Bedrock AgentCore is an agentic platform for building, deploying, and operating effective agents securely at scale. It has composable, fully managed services:

Runtime for secure, serverless agent deployment
Memory for short-term and long-term context retention
Gateway for secure tool access by transforming APIs and AWS Lambda functions into agent-compatible tools and connecting to existing Model Context Protocol (MCP) servers
Identity for secure agent identity and access management
Code Interpreter for secure code execution in sandbox environments
Browser for fast, secure web interactions
Observability for comprehensive operational insights to trace, debug, and monitor agent performance
Evaluations for continuously inspecting agent quality based on real-world behavior
Policy to keep agents within defined boundaries

These services, designed to work independently or together, abstract the complexity of building, deploying, and operating sophisticated agents while working with open source frameworks or models delivering enterprise-grade security and reliability.
Agent evaluation, profiling, and optimization with NeMo Agent Toolkit
NVIDIA NeMo Agent Toolkit is an open source framework designed to help developers build, profile, and optimize AI agents regardless of their underlying framework. Its framework-agnostic approach means it works seamlessly with Strands Agents, LangChain, LlamaIndex, CrewAI, and custom enterprise frameworks. In addition, different frameworks can interoperate when they’re connected in the NeMo Agent Toolkit.
The toolkit’s profiler provides complete agent workflow analysis that tracks token usage, timing, workflow-specific latency, throughput, and run times for individual agents and tools, enabling targeted performance improvements. Built on the toolkit’s evaluation harness, it includes Retrieval Augmented Generation (RAG)-specific evaluators (such as answer accuracy, context relevance, response groundedness, and agent trajectory) and supports custom evaluators for specialized use cases, enabling targeted performance optimization. The automated hyperparameter optimizer profiles and systematically discovers optimal settings for parameters such as temperature, top_p, and max_tokens while maximizing accuracy, groundedness, context relevance, and minimizing token usage, latency, and optimizing for other custom metrics as well. This automated approach profiles your complete agent workflows, identified bottlenecks, and uncovers optimal parameter combinations that manual tuning might miss. The toolkit’s intelligent GPU sizing calculator alleviates guesswork by simulating agent latency and concurrency scenarios and predicting precise GPU infrastructure requirements for production deployment.
The toolkit’s observability integration connects with popular monitoring services including Arize Phoenix, Weights & Biases Weave, Langfuse, and OpenTelemetry supported systems, like Amazon Bedrock AgentCore Observability, creating a continuous feedback loop for ongoing optimization and maintenance.
Real-world implementation
This example demonstrates a knowledge-based agent that retrieves and synthesizes information from web URLs to answer user queries. Built using Strands Agents with integrated NeMo Agent Toolkit, the solution is containerized for quick deployment in Amazon Bedrock AgentCore Runtime and takes advantage of Bedrock AgentCore services, such as AgentCore Observability. Additionally, developers have the flexibility to integrate with fully managed models in Amazon Bedrock, models hosted in Amazon SageMaker AI, containerized models in Amazon Elastic Kubernetes Service (Amazon EKS) or other model API endpoints. The overall architecture is designed for a streamlined workflow, moving from agent definition and optimization to containerization and scalable deployment.
The following architecture diagram illustrates an agent built with Strands Agents integrating NeMo Agent Toolkit deployed in Amazon Bedrock AgentCore.

Agent development and evaluation
Start by defining your agent and workflows in Strands Agents, then wrap it with NeMo Agent Toolkit to configure components such as a large language model (LLM) for inference and tools. Refer to the Strands Agents and NeMo Agent Toolkit integration example in GitHub for a detailed setup guide. After configuring your environment, validate your agent logic by running a single workflow from the command line with an example prompt:

nat run –config_file examples/frameworks/strands_demo/configs/config.yml –input “How do I use the Strands Agents API?”

The following is the truncated terminal output:

Workflow Result:
[‘The Strands Agents API is a flexible system for managing prompts, including both
system prompts and user messages. System prompts provide high-level instructions to
the model about its role, capabilities, and constraints, while user messages are your
queries or requests to the agent. The API supports multiple techniques for prompting,
including text prompts, multi-modal prompts, and direct tool calls. For guidance on
how to write safe and responsible prompts, please refer to the Safety & Security –
Prompt Engineering documentation.’]

Instead of executing a single workflow and exiting, to simulate a real-world scenario, you can spin up a long-running API server capable of handling concurrent requests with the serve command:

nat serve –config_file examples/frameworks/strands_demo/configs/config.yml

The following is the truncated terminal output:

INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)

The agent is now running locally on port 8000. To interact with the agent, open a new terminal and execute the following cURL command. This will generate output similar to the previous nat run step but the agent runs continuously as a persistent service rather than executing one time and exiting. This simulates the production environment where Amazon Bedrock AgentCore will run the agent as a containerized service:

curl -X ‘POST’ ‘http://localhost:8080/invocations’ -H ‘accept: application/json’ -H ‘Content-Type: application/json’ -d ‘{“inputs” : “How do I use the Strands Agents API?”}’curl -X ‘POST’ ‘http://localhost:8000/generate’ -H ‘accept: application/json’ -H ‘Content-Type: application/json’ -d ‘{“inputs” : “How do I use the Strands Agents API?”}’ 

The following is the truncated terminal output:

{“value”:”The Strands Agents API provides a flexible system for managing prompts,
including both system prompts and user messages. System prompts provide high-level
instructions to the model about its role, capabilities, and constraints, while user
messages are your queries or requests to the agent. The SDK supports multiple techniques
for prompting, including text prompts, multi-modal prompts, and direct tool calls.
For guidance on how to write safe and responsible prompts, please refer to the
Safety & Security – Prompt Engineering documentation.”}

Agent profiling and workflow performance monitoring
With the agent running, the next step is to establish a performance baseline. To illustrate the depth of insights available, in this example, we use a self-managed Llama 3.3 70B Instruct NIM on an Amazon Elastic Compute Cloud (Amazon EC2) P4de.24xlarge instance powered by NVIDIA A100 Tensor Core GPUs (8xA100 80 GB GPU) running on Amazon EKS. We use the nat eval command to evaluate the agent and generate the analysis:
nat eval –config_file examples/frameworks/strands_demo/configs/eval_config.yml
The following is the truncated terminal output:

Evaluating Trajectory: 100%|████████████████████████████████████████████████████████████████████| 10/10 [00:10<00:00, 1.00s/it]
2025-11-24 16:59:18 – INFO – nat.profiler.profile_runner:127 – Wrote combined data to: .tmp/nat/examples/frameworks/strands_demo/eval/all_requests_profiler_traces.json
2025-11-24 16:59:18 – INFO – nat.profiler.profile_runner:146 – Wrote merged standardized DataFrame to .tmp/nat/examples/frameworks/strands_demo/eval/standardized_data_all.csv
2025-11-24 16:59:18 – INFO – nat.profiler.profile_runner:200 – Wrote inference optimization results to: .tmp/nat/examples/frameworks/strands_demo/eval/inference_optimization.json
2025-11-24 16:59:28 – INFO – nat.profiler.profile_runner:224 – Nested stack analysis complete
2025-11-24 16:59:28 – INFO – nat.profiler.profile_runner:235 – Concurrency spike analysis complete
2025-11-24 16:59:28 – INFO – nat.profiler.profile_runner:264 – Wrote workflow profiling report to: .tmp/nat/examples/frameworks/strands_demo/eval/workflow_profiling_report.txt
2025-11-24 16:59:28 – INFO – nat.profiler.profile_runner:271 – Wrote workflow profiling metrics to: .tmp/nat/examples/frameworks/strands_demo/eval/workflow_profiling_metrics.json
2025-11-24 16:59:28 – INFO – nat.eval.evaluate:345 – Workflow output written to .tmp/nat/examples/frameworks/strands_demo/eval/workflow_output.json
2025-11-24 16:59:28 – INFO – nat.eval.evaluate:356 – Evaluation results written to .tmp/nat/examples/frameworks/strands_demo/eval/rag_relevance_output.json
2025-11-24 16:59:28 – INFO – nat.eval.evaluate:356 – Evaluation results written to .tmp/nat/examples/frameworks/strands_demo/eval/rag_groundedness_output.json
2025-11-24 16:59:28 – INFO – nat.eval.evaluate:356 – Evaluation results written to .tmp/nat/examples/frameworks/strands_demo/eval/rag_accuracy_output.json
2025-11-24 16:59:28 – INFO – nat.eval.evaluate:356 – Evaluation results written to .tmp/nat/examples/frameworks/strands_demo/eval/trajectory_accuracy_output.json
2025-11-24 16:59:28 – INFO – nat.eval.utils.output_uploader:62 – No S3 config provided; skipping upload.

The command generates detailed artifacts that include JSON files per evaluation metric (such as accuracy, groundedness, relevance, and Trajectory accuracy) showing scores from 0–1, reasoning traces, retrieved contexts, and aggregated averages. Additional information in the artifacts generated include workflow outputs, standardized tables, profile traces, and compact summaries for latency and token efficiency. This multi-metric sweep provides a holistic view of agent quality and behavior. The evaluation highlights that while the agent achieved consistent groundedness scores—meaning answers were reliably supported by sources—there is still an opportunity to improve retrieval relevance. The profile trace output contains workflow-specific latency, throughput, and runtime at 90%, 95%, and 99% confidence intervals. The command generates a Gantt chart of the agent flow and nested stack analysis to pinpoint exactly where bottlenecks exist, as seen in the following figure. It also reports concurrency spikes and token efficiency so you can understand precisely how scaling impacts prompt and completion usage.

During the profiling, nat spawns eight concurrent agent workflows (shown in orange bars in the chart), which is the default concurrency configuration during evaluation. The p90 latency for the workflow shown is approximately 58.9 seconds. Crucially, the data confirmed that response generation was the primary bottleneck, with the longest LLM segments taking roughly 61.4 seconds. Meanwhile, non-LLM overhead remained minimal. HTTP requests averaged only 0.7–1.2 seconds, and knowledge base access was negligible. Using this level of granularity, you can now identify and optimize specific bottlenecks in the agent workflows.
Agent performance optimization
After profiling, refine the agent’s parameters to balance quality, performance, and cost. Manual tuning of LLM settings like temperature and top_p is often a game of guesswork. The NeMo Agent Toolkit turns this into a data-driven science. You can use the built-in optimizer to perform a systematic sweep across your parameter search space:
nat optimize –config_file examples/frameworks/strands_demo/configs/optimizer_config.yml
The following is the truncated terminal output:

Evaluating Trajectory: 100%|██████████████████████████████████████████████████████████████| 10/10 [00:10<00:00, 1.00it/s]
2025-10-31 16:50:41 – INFO – nat.profiler.profile_runner:127 – Wrote combined data to: ./tmp/nat/strands_demo/eval/all_requests_profiler_traces.json
2025-10-31 16:50:41 – INFO – nat.profiler.profile_runner:146 – Wrote merged standardized DataFrame to: ./tmp/nat/strands_demo/eval/standardized_data_all.csv
2025-10-31 16:50:41 – INFO – nat.profiler.profile_runner:208 – Wrote inference optimization results to: ./tmp/nat/strands_demo/eval/inference_optimization.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:337 – Workflow output written to ./tmp/nat/strands_demo/eval/workflow_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/token_efficiency_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/llm_latency_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_relevance_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_groundedness_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_accuracy_output.json
2025-10-31 16:50:41 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/trajectory_accuracy_output.json
2025-10-31 16:50:41 – INFO – nat.eval.utils.output_uploader:61 – No S3 config provided; skipping upload.
Evaluating Regex-Ex_Accuracy: 100%|████████████████████████████████████████████████████████| 10/10 [00:21<00:00, 2.15s/it]
2025-10-31 16:50:44 – INFO – nat.profiler.profile_runner:127 – Wrote combined data to: ./tmp/nat/strands_demo/eval/all_requests_profiler_traces.json
2025-10-31 16:50:44 – INFO – nat.profiler.profile_runner:146 – Wrote merged standardized DataFrame to: ./tmp/nat/strands_demo/eval/standardized_data_all.csv
2025-10-31 16:50:45 – INFO – nat.profiler.profile_runner:208 – Wrote inference optimization results to: ./tmp/nat/strands_demo/eval/inference_optimization.json
2025-10-31 16:50:46 – INFO – nat.eval.evaluate:337 – Workflow output written to ./tmp/nat/strands_demo/eval/workflow_output.json
2025-10-31 16:50:47 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/token_efficiency_output.json
2025-10-31 16:50:48 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/llm_latency_output.json
2025-10-31 16:50:49 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_relevance_output.json
2025-10-31 16:50:50 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_groundedness_output.json
2025-10-31 16:50:51 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/trajectory_accuracy_output.json
2025-10-31 16:50:52 – INFO – nat.eval.evaluate:348 – Evaluation results written to ./tmp/nat/strands_demo/eval/rag_accuracy_output.json
2025-10-31 16:50:53 – INFO – nat.eval.utils.output_uploader:61 – No S3 config provided; skipping upload.
[I 2025-10-31 16:50:53,361] Trial 19 finished with values: [0.6616666666666667, 1.0, 0.38000000000000007, 0.26800000000000006, 2.1433333333333333, 2578.222222222222] and parameters: {‘llm_sim_llm.top_p’: 0.8999999999999999, ‘llm_sim_llm.temperature’: 0.38000000000000006, ‘llm_sim_llm.max_tokens’: 5632}.
2025-10-31 16:50:53 – INFO – nat.profiler.parameter_optimization.parameter_optimizer:120 – Numeric optimization finished
2025-10-31 16:50:53 – INFO – nat.profiler.parameter_optimization.parameter_optimizer:162 – Generating Pareto front visualizations…
2025-10-31 16:50:53 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:320 – Creating Pareto front visualizations…
2025-10-31 16:50:53 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:330 – Total trials: 20
2025-10-31 16:50:53 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:331 – Pareto optimal trials: 14
2025-10-31 16:50:54 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:345 – Parallel coordinates plot saved to: ./tmp/nat/strands_demo/optimizer/plots/pareto_parallel_coordinates.png
2025-10-31 16:50:56 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:374 – Pairwise matrix plot saved to: ./tmp/nat/strands_demo/optimizer/plots/pareto_pairwise_matrix.png
2025-10-31 16:50:56 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:387 – Visualization complete!
2025-10-31 16:50:56 – INFO – nat.profiler.parameter_optimization.pareto_visualizer:389 – Plots saved to: ./tmp/nat/strands_demo/optimizer/plots
2025-10-31 16:50:56 – INFO – nat.profiler.parameter_optimization.parameter_optimizer:171 – Pareto visualizations saved to: ./tmp/nat/strands_demo/optimizer/plots
2025-10-31 16:50:56 – INFO – nat.profiler.parameter_optimization.optimizer_runtime:88 – All optimization phases complete.

This command launches an automated sweep across key LLM parameters, such as temperature, top_p, and max_tokens, as defined in the config (in this case optimizer_config.yml) search space. The optimizer runs 20 trials with three repetitions each, using weighted evaluation metrics to automatically discover optimal model settings. It might take up to 15–20 minutes for the optimizer to run 20 trials.
The toolkit evaluates each parameter set against a weighted multi-objective score, aiming to maximize quality (for example, accuracy, groundedness, or tool use) while minimizing token cost and latency. Upon completion, it generates detailed performance artifacts and summary tables so you can quickly identify and select the optimal configuration for production. The following is the hyperparameter optimizer configuration:

llms:
nim_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
temperature: 0.5
top_p: 0.9
max_tokens: 4096
# Enable optimization for these parameters
optimizable_params:
– temperature
– top_p
– max_tokens
# Define search spaces
search_space:
temperature:
low: 0.1
high: 0.7
step: 0.2 # Tests: 0.1, 0.3, 0.5, 0.7
top_p:
low: 0.7
high: 1.0
step: 0.1 # Tests: 0.7, 0.8, 0.9, 1.0
max_tokens:
low: 4096
high: 8192
step: 512 # Tests: 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192

In this example, NeMo Agent Toolkit Optimize systematically evaluated parameter configurations and identified temperature ≈ 0.7, top_p ≈ 1.0, and max_tokens ≈ 6k (6144) as optimal configuration yielding the highest accuracy across 20 trials. This configuration delivered a 35% accuracy improvement over baseline while simultaneously achieving 20% token efficiency gains compared to the 8192 max_tokens setting—maximizing both performance and cost efficiency for these production deployments.
The optimizer plots pairwise pareto curves, as shown in the following pairwise matrix comparison charts, to analyze trade-offs between different parameters. The parallel coordinates plot, that follows the matrix comparison chart, shows optimal trials (red lines) achieving high quality scores (0.8–1.0) across accuracy, groundedness, and relevance while trading off some efficiency as token usage and latency drop to 0.6–0.8 on the normalized scale. The pairwise matrix confirms strong correlations between quality metrics and reveals actual token consumption clustered tightly around 2,500–3,100 tokens across all trials. These results indicate that further gains in accuracy and token efficiency might be possible through prompt engineering. This is something that development teams can achieve using NeMo Agent Toolkit’s prompt optimization capabilities, helping reduce costs while maximizing performance.
The following image shows the pairwise matrix comparison:

The following image shows the parallel coordinates plot:

Right-sizing production GPU infrastructure
After your agent is optimized and you’ve finalized the runtime or inference configuration, you can shift your focus to assessing your model deployment infrastructure. If you’re self-managing your model deployment on a fleet of EC2 GPU-powered instances, then one of the most difficult aspects of moving agents to production is predicting exactly what compute resources are necessary to support a target use case and concurrent users without overrunning the budget or causing timeouts. The NeMo Agent Toolkit GPU sizing calculator addresses this challenge by using your agent’s actual performance profile to determine the optimal cluster size for specific service level objectives (SLOs), enabling right-sizing that alleviates the trade-off between performance and cost. To generate a sizing profile, you run the sizing calculator across a range of concurrency levels (for example, 1–32 simultaneous users):

nat sizing calc –config_file examples/frameworks/strands_demo/configs/sizing_config.yml –calc_output_dir /tmp/strands_demo/sizing_calc_run1/ –concurrencies 1,2,4,8,12,20,24,28,32 –num_passes 2

Executing this on our reference EC2 P4de.24xlarge instance powered by NVIDIA A100 Tensor Core GPUs running on Amazon EKS for a Llama 3.3 70B Instruct NIM produced the following capacity analysis:

Per concurrency results:
Alerts!: W = Workflow interrupted, L = LLM latency outlier, R = Workflow runtime outlier
| Alerts | Concurrency | p95 LLM Latency | p95 WF Runtime | Total Runtime |
|——–|————–|—————–|—————-|—————|
| | 1 | 11.8317 | 21.3647 | 33.2416 |
| | 2 | 19.3583 | 26.2694 | 36.931 |
| | 4 | 25.728 | 32.4711 | 61.13 |
| | 8 | 38.314 | 57.1838 | 89.8716 |
| | 12 | 55.1766 | 72.0581 | 130.691 |
| | 20 | 103.68 | 131.003 | 202.791 |
| !R | 24 | 135.785 | 189.656 | 221.721 |
| !R | 28 | 125.729 | 146.322 | 245.654 |
| | 32 | 169.057 | 233.785 | 293.562 |

As shown in the following chart, calculated concurrency scales almost linearly with both latency and end‑to‑end runtime, with P95 LLM latency and workflow runtime demonstrating near-perfect trend fits (R² ≈ 0.977/0.983). Each additional concurrent request introduces a predictable latency penalty, suggesting the system operates within a linear capacity zone where throughput can be optimized by adjusting latency tolerance.

With the sizing metrics captured, you can estimate the GPU cluster size for a specific concurrency and latency. For example, to support 25 concurrent users with a target workflow runtime of 50 seconds, you can run the calculator:

nat sizing calc –offline_mode –calc_output_dir /tmp/strands_demo/sizing_calc_run1/ –test_gpu_count 8 –target_workflow_runtime 50 –target_users 25

This workflow analyzes current performance metrics and generates a resource recommendation. In our example scenario, the tool calculates that to meet strict latency requirements for 25 simultaneous users, approximately 30 GPUs are required based on the following formula:

gpu_estimate = (target_users / calculated_concurrency) * test_gpu_count
calculated_concurrency = (target_time_metric – intercept) / slope

The following is the output from the sizing estimation:

Targets: LLM Latency ≤ 0.0s, Workflow Runtime ≤ 50.0s, Users = 25
Test parameters: GPUs = 8
Per concurrency results:
Alerts!: W = Workflow interrupted, L = LLM latency outlier, R = Workflow runtime outlier
| Alerts | Concurrency | p95 LLM Latency | p95 WF Runtime | Total Runtime | GPUs (WF Runtime, Rough) |
|——–|————-|—————–|—————-|—————|————————–|
| | 1 | 11.8317 | 21.3647 | 33.2416 | 85.4587 |
| | 2 | 19.3583 | 26.2694 | 36.931 | 52.5388 |
| | 4 | 25.728 | 32.4711 | 61.13 | 32.4711 |
| | 8 | 38.314 | 57.1838 | 89.8716 | |
| | 12 | 55.1766 | 72.0581 | 130.691 | |
| | 20 | 103.68 | 131.003 | 202.791 | |
| !R | 24 | 135.785 | 189.656 | 221.721 | |
| !R | 28 | 125.729 | 146.322 | 245.654 | |
| | 32 | 169.057 | 233.785 | 293.562 | |

=== GPU ESTIMATES ===
Estimated GPU count (Workflow Runtime): 30.5

Production agent deployment to Amazon Bedrock AgentCore
After evaluating, profiling, and optimizing your agent, deploy it to production. Although running the agent locally is sufficient for testing, enterprise deployment requires an agent runtime that helps provide security, scalability, and robust memory management without the overhead of managing infrastructure. This is where Amazon Bedrock AgentCore Runtime shines—providing enterprise-grade serverless agent runtime without the infrastructure overhead. Refer to the step-by-step deployment guide in the NeMo Agent Toolkit Repository. By packaging your optimized agent in a container and deploying it to the serverless Bedrock AgentCore Runtime, you elevate your prototype agent to a resilient application for long-running tasks and concurrent user requests. After you deploy the agent, visibility becomes critical. This integration creates a unified observability experience, transforming opaque black-box execution into deep visibility. You gain exact traces, spans, and latency breakdowns for every interaction in production, integrated into Bedrock AgentCore Observability using OpenTelemetry.
The following screenshot shows the Amazon CloudWatch dashboard displaying Amazon Bedrock AgentCore traces and spans, visualizing the execution path and latency of the deployed Strands agent.

Amazon Bedrock AgentCore services extend well beyond agent runtime management and observability. Your deployed agents can seamlessly use additional Bedrock AgentCore services, including Amazon Bedrock AgentCore Identity for authentication and authorization, Amazon Bedrock AgentCore Gateway for tools access, Amazon Bedrock AgentCore Memory for context-awareness, Amazon Bedrock AgentCore Code Interpreter for secure code execution, and Amazon Bedrock AgentCore Browser for web interactions, to create enterprise-ready agents.
Conclusion
Production AI agents need performance visibility, optimization, and reliable infrastructure. For the example use case, this integration delivered on all three fronts: achieving 20% token efficiency gains, 35% accuracy improvements for the example use case, and performance-tuned GPU infrastructure calibrated for target concurrency. By combining Strands Agents for foundational agent development and orchestration, the NVIDIA NeMo Agent Toolkit for deep agent profiling, optimization, and right-sizing production GPU infrastructure, and Amazon Bedrock AgentCore for secure, scalable agent infrastructure, developers can have an end-to-end solution that helps provide predictable outcomes. You can now build, evaluate, optimize, and deploy agents at scale on AWS with this integrated solution. To get started, check out the Strands Agents and NeMo Agent Toolkit integration example and deploying Strands Agents and NeMo Agent Toolkit to Amazon Bedrock AgentCore Runtime.

About the authors
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime, Browser, Code Interpreter, and Identity. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and enjoys life with his wife and kids.
Sagar Murthy is an agentic AI GTM leader at AWS, where he collaborates with frontier foundation model partners, agentic frameworks, startups, and enterprise customers to evangelize AI and data innovations, open-source solutions, and scale impactful partnerships. With collaboration experiences spanning data, cloud and AI, he brings a blend of technical solutions background and business outcomes focus to delight developers and customers.
Chris Smith is a Solutions Architect at AWS specializing in AI-powered automation and enterprise AI agent orchestration. With over a decade of experience architecting solutions at the intersection of generative AI, cloud computing, and systems integration, he helps organizations design and deploy agent systems that transform emerging technologies into measurable business outcomes. His work spans technical architecture, security-first implementation, and cross-functional team leadership.
Ranjit Rajan is a Senior Solutions Architect at NVIDIA, where he helps customers design and build solutions spanning generative AI, agentic AI, and accelerated multi-modal data processing pipelines for pre-training and fine-tuning foundation models.
Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.
Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on Agentic AI. He focuses on product strategy and roadmap of integrating Agentic AI library in partner platforms & enhancing user experience on accelerated computing for AI Agents.

Bi-directional streaming for real-time agent interactions now availabl …

Building natural voice conversations with AI agents requires complex infrastructure and lots of code from engineering teams. Text-based agent interactions follow a turn-based pattern: a user sends a complete request, waits for the agent to process it, and receives a full response before continuing. Bi-directional streaming removes this constraint by establishing a persistent connection that carries data in both directions simultaneously.
Amazon Bedrock AgentCore Runtime supports bi-directional streaming for real-time, two-way communication between users and AI agents. With this capability, agents can simultaneously listen to user input while generating responses, creating a more natural conversational flow. This is particularly well-suited for multimodal interactions, such as voice and vision agent conversations. The agent can begin responding while still receiving user input, handle mid-conversation interruptions, and adjust its responses based on real-time feedback.
A bi-directional voice chat agent can conduct spoken conversations with the fluidity of human dialogue so that users can interrupt, clarify, or change topics naturally. These agents process streaming audio input and output simultaneously while maintaining conversational state. Building this infrastructure requires managing persistent low-latency connections, handling concurrent audio streams, preserving context across exchanges, and scaling multiple conversations. Implementing these capabilities from scratch demands months of engineering effort and specialized real-time systems expertise. Amazon Bedrock AgentCore Runtime addresses these challenges by providing a secure, serverless, and purpose-built hosting environment for deploying and running AI agents, without requiring developers to build and maintain complex streaming infrastructure themselves.
In this post, you will learn about bi-directional streaming on AgentCore Runtime and the prerequisites to create a WebSocket implementation. You will also learn how to use Strands Agents to implement a bi-directional streaming solution for voice agents.
AgentCore Runtime bi-directional streaming
Bi-directional streaming uses the WebSocket protocol. WebSocket provides full-duplex communication over a single TCP connection, establishing a persistent channel where data flows continuously in both directions. This protocol has broad client support across browsers, mobile applications, and server environments, making it accessible for diverse implementation scenarios.
When a connection is established, the agent can receive user input as a stream while simultaneously sending response chunks back to the user. The AgentCore Runtime manages the underlying infrastructure that handles connection, message ordering, and maintains conversational state across the bi-directional exchange. This alleviates the need for developers to build custom streaming infrastructure or manage the complexities of concurrent data flows.Voice conversations differ from text-based interactions in their expectation of natural flow. When speaking with a voice agent, users expect the same conversational dynamics they experience with humans: the ability to interrupt when they need to correct themselves, to interject clarification mid-response, or to redirect the conversation without awkward pauses.With bi-directional streaming, it’s possible for voice agents to process incoming audio while generating responses, detecting interruptions, and adjusting behavior in real-time. The agent maintains conversational context throughout these interactions, preserving the thread of dialogue even as the conversation shifts direction. This capability also helps voice agents from turn-based systems into a responsive conversational partner.
Beyond voice conversations, bi-directional streaming has several interaction patterns. Interactive debugging sessions allow developers to guide agents through problem-solving in real-time, providing feedback as the agent explores solutions. Collaborative agents can work alongside users on shared tasks, receiving continuous input as the work progresses rather than waiting for complete instructions. Multi-modal agents can process streaming video or sensor data while simultaneously providing analysis and recommendations. Async long-running agent operations can process tasks over minutes or hours while streaming incremental results to clients.
WebSocket implementation
To create a WebSocket implementation in AgentCore Runtime, you should follow a few patterns. Firstly, your containers must implement WebSocket endpoints on port 8080 at the /ws path, which aligns with standard WebSocket server practices. This WebSocket endpoint will enable a single agent container to serve both the traditional InvokeAgentRuntime API and the new InvokeAgentRuntimeWithWebsocketStream API. Additionally, customers must provide a /ping endpoint for health checks.
Bi-directional streaming using WebSockets on AgentCore Runtime supports applications using a WebSocket language library. The client must connect to the service endpoint with a WebSocket protocol connection:

wss://bedrock-agentcore.<region>.amazonaws.com/runtimes/<agentRuntimeArn>/ws

You also need to use one of the supported authentication methods (SigV4 headers, SigV4 pre-signed URL, or OAuth 2.0) and to make sure that the agent application implements the WebSocket service contract as specified in HTTP protocol contract.
Strands bi-directional agent: Simplified voice agent development
Amazon Nova Sonic unifies speech understanding and generation into a single model, delivering human-like conversational AI with low latency, leading accuracy, and strong price performance. Its integrated architecture provides expressive speech generation and real-time transcription in one model, dynamically adapting responses based on input speech prosody, pace, and timbre.
With bi-directional streaming now also available in AgentCore Runtime, you have several ways to show how to host a voice agent: one can be the direct implementation where you need to managing WebSocket connections, parsing protocol events, handling audio chunks, and orchestrating async tasks; another is the strands bi-directional agent implementation that abstracts this complexity and implements these steps on its own.
Example Implementation
In this post, you should refer to the Amazon Bedrock AgentCore bi-directional code, which implements bi-directional communication with Amazon Bedrock AgentCore. The repository has two implementations: One that uses native Amazon Nova Sonic Python implementation deployed directly to AgentCore Runtime, and a high-level framework implementation using the Strands bi-directional agent for simplified real-time audio conversations.

The following diagram shows the native Amazon Nova Sonic Python WebSocket server directly to AgentCore. It provides full control over the Nova Sonic protocol with direct event handling for complete visibility into session management, audio streaming, and response generation.

The Strands bi-directional agent framework for real-time audio conversations with Amazon Nova Sonic provides a high-level abstraction that simplifies bi-directional streaming, automatic session management, and tool integration. The code snippet below is an example of this simplification.

from strands.experimental.bidi.agent import BidiAgent
from strands.experimental.bidi.models.nova_sonic import BidiNovaSonicModel
from strands_tools import calculator
@app.websocket(“/ws”)
async def websocket_endpoint(websocket: WebSocket, model_name: str):
# Define a Nova Sonic BidiModel
model = BidiNovaSonicModel(
region=”us-east-1″,
model_id=”amazon.nova-sonic-v1:0″,
provider_config={
“audio”: {
“input_sample_rate”: 16000,
“output_sample_rate”: 24000,
“voice”: “matthew”,
}
}
)
# Create a Strands Agent with tools and system prompt
agent = BidiAgent(
model=model,
tools=[calculator],
system_prompt=”You are a helpful assistant with access to a calculator tool.”,
)
# Start streaming conversation
await agent.run(inputs=[receive_and_convert], outputs=[websocket.send_json])

This implementation demonstrates the simplicity of Strands: instantiate a model, create an agent with tools and a system prompt, and run it with input/output streams. The framework handles protocol complexity internally.
The following is the agent declaration section in the code:

agent = BidiAgent(
model=model,
tools=[calculator, weather_api, database_query],
system_prompt=”You are a helpful assistant…”
)

Tools are passed directly to the agent’s constructor, and Strands handles function calling orchestration automatically. In summary, a native WebSocket implementation of the same functionality requires approximately 150 lines of code, whereas Strands implementation reduces this to approximately 20 lines focused on business logic. Developers can focus on defining agent behavior, integrating tools, and crafting system prompts rather than managing WebSocket connections, parsing events, handling audio chunks, or orchestrating async tasks. This makes bi-directional streaming accessible to developers without specialized real-time systems expertise while maintaining full access to the audio conversation capabilities of Nova Sonic. The Strands bi-directional feature is currently only supported for the Python SDK. If you are looking for flexibility in the implementation of your voice agent, the native Amazon Nova Sonic implementation can help you. Also, this can be important for the cases where you have multiple different patterns of communication from agent to model. With Amazon Nova Sonic implementation you will be able to control every step of the process with full control. The framework approach can provide better control of dependencies, because it is done by the SDK, and provides consistency across systems. The same Strands bi-directional agent code structure works with Nova Sonic, OpenAI Realtime API, and Google Gemini Live developers simply swap the model implementation while keeping the rest of their code unchanged.
Conclusion
The bi-directional streaming capability of Amazon Bedrock AgentCore Runtime transforms how developers can build conversational AI agents. By providing WebSocket-based real-time communication infrastructure, AgentCore removes months of engineering effort required to implement streaming systems from scratch. The framework runtime enables developers to deploy multiple types of voice agents—from native protocol implementations using Amazon Nova Sonic to high-level frameworks like the Strands bi-directional agent—within the same secure, serverless environment.

About the authors
Lana Zhang is a Senior Specialist Solutions Architect for Generative AI at AWS within the Worldwide Specialist Organization. She specializes in AI/ML, with a focus on use cases such as AI voice assistants and multimodal understanding. She works closely with customers across diverse industries, including media and entertainment, gaming, sports, advertising, financial services, and healthcare, to help them transform their business solutions through AI.
Phelipe Fabres is a Senior Specialist Solutions Architect for Generative AI at AWS for Startups. He specializes in AI/ML with a focus on Agentic systems and the full process of training/inference. He has more than 10 years of working with software development, from monolith to event-driven architectures with a Ph.D. in Graph Theory.
Evandro Franco is an Sr. Data Scientist working on Amazon Web Services. He is part of the Global GTM team that helps AWS customers overcome business challenges related to AI/ML on top of AWS, mainly on Amazon Bedrock AgentCore and Strands Agents. He has more than 18 years of experience working with technology, from software development, infrastructure, serverless, to machine learning. In his free time, Evandro enjoys playing with his son, mainly building some funny Lego bricks.

Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses …

Meta has released SAM Audio, a prompt driven audio separation model that targets a common editing bottleneck, isolating one sound from a real world mix without building a custom model per sound class. Meta released 3 main sizes, sam-audio-small, sam-audio-base, and sam-audio-large. The model is available to download and to try in the Segment Anything Playground.

Architecture

SAM Audio uses separate encoders for each conditioning signal, an audio encoder for the mixture, a text encoder for the natural language description, a span encoder for time anchors, and a visual encoder that consumes a visual prompt derived from video plus an object mask. The encoded streams are concatenated into time aligned features, then processed by a diffusion transformer that applies self attention over the time aligned representation and cross attention to the textual feature, then a DACVAE decoder reconstructs waveforms and emits 2 outputs, target audio and residual audio.

https://ai.meta.com/blog/sam-audio/

What SAM Audio does, and what ‘segment’ means here?

SAM Audio takes an input recording that contains multiple overlapping sources, for example speech plus traffic plus music, and separates out a target source based on a prompt. In the public inference API, the model produces 2 outputs, result.target and result.residual. The research team describes target as the isolated sound, and residual as everything else.

That target plus residual interface maps directly to editor operations. If you want to remove a dog bark across a podcast track, you can treat the bark as the target, then subtract it by keeping only residual. If you want to extract a guitar part from a concert clip, you keep the target waveform instead. Meta uses these exact kinds of examples to explain what the model is meant to enable.

The 3 prompt types Meta is shipping

Meta positions SAM Audio as a single unified model that supports 3 prompt types, and it says these prompts can be used alone or combined.

Text prompting: You describe the sound in natural language, for example “dog barking” or “singing voice”, and the model separates that sound from the mixture. Meta lists text prompts as one of the core interaction modes, and the open source repo includes an end to end example using SAMAudioProcessor and model.separate.

Visual prompting: You click the person or object in a video and ask the model to isolate the audio associated with that visual object. Meta team describes visual prompting as selecting the sounding object in the video. In the released code path, visual prompting is implemented by passing video frames plus masks into the processor via masked_videos.

Span prompting: Meta team calls span prompting an industry first. You mark time segments where the target sound occurs, then the model uses those spans to guide separation. This matters for ambiguous cases, for example when the same instrument appears in multiple passages, or when a sound is present only briefly and you want to prevent the model from over separating.

https://ai.meta.com/blog/sam-audio/

Results

Meta team positions SAM Audio as achieving cutting edge performance across diverse, real world scenarios, and frames it as a unified alternative to single purpose audio tools. The team publishes a subjective evaluation table across categories, General, SFX, Speech, Speaker, Music, Instr(wild), Instr(pro), with General scores of 3.62 for sam audio small, 3.28 for sam audio base, and 3.50 for sam audio large, and Instr(pro) scores reaching 4.49 for sam audio large.

Key Takeaways

SAM Audio is a unified audio separation model, it segments sound from complex mixtures using text prompts, visual prompts, and time span prompts.

The core API produces two waveforms per request, target for the isolated sound and residual for everything else, which maps cleanly to common edit operations like remove noise, extract stem, or keep ambience.

Meta released multiple checkpoints and variants, including sam-audio-small, sam-audio-base, sam-audio-large, plus tv variants that the repo says perform better for visual prompting, the repo also publishes a subjective evaluation table by category.

The release includes tooling beyond inference, Meta provides a sam-audio-judge model that scores separation results against a text description with overall quality, recall, precision, and faithfulness.

Check out the Technical details and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Releases SAM Audio: A State-of-the-Art Unified Model that Uses Intuitive and Multimodal Prompts for Audio Separation appeared first on MarkTechPost.

How to Orchestrate a Fully Autonomous Multi-Agent Research and Writing …

In this tutorial, we implement how we build a small but powerful two-agent CrewAI system that collaborates using the Gemini Flash model. We set up our environment, authenticate securely, define specialized agents, and orchestrate tasks that flow from research to structured writing. As we run the crew, we observe how each component works together in real time, giving us a hands-on understanding of modern agentic workflows powered by LLMs. With these steps, we clearly see how multi-agent pipelines become practical, modular, and developer-friendly. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserimport os
import sys
import getpass
from textwrap import dedent

print(“Installing CrewAI and tools… (this may take 1-2 mins)”)
!pip install -q crewai crewai-tools

from crewai import Agent, Task, Crew, Process, LLM

We set up our environment and installed the required CrewAI packages so we can run everything smoothly in Colab. We import the necessary modules and lay the foundation for our multi-agent workflow. This step ensures that our runtime is clean and ready for the agents we create next. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserprint(“n— API Authentication —“)
api_key = None

try:
from google.colab import userdata
api_key = userdata.get(‘GEMINI_API_KEY’)
print(” Found GEMINI_API_KEY in Colab Secrets.”)
except Exception:
pass

if not api_key:
print(” Key not found in Secrets.”)
api_key = getpass.getpass(” Enter your Google Gemini API Key: “)

os.environ[“GEMINI_API_KEY”] = api_key

if not api_key:
sys.exit(” Error: No API Key provided. Please restart and enter a key.”)

We authenticate ourselves securely by retrieving or entering the Gemini API key. We ensure the key is securely stored in the environment so the model can operate without interruption. This step gives us confidence that our agent framework can communicate reliably with the LLM. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browsergemini_flash = LLM(
model=”gemini/gemini-2.0-flash”,
temperature=0.7
)

We configure the Gemini Flash model that our agents rely on for reasoning and generation. We choose the temperature and model variant to balance creativity and precision. This configuration becomes the shared intelligence that drives all agent tasks ahead. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserresearcher = Agent(
role=’Tech Researcher’,
goal=’Uncover cutting-edge developments in AI Agents’,
backstory=dedent(“””You are a veteran tech analyst with a knack for finding emerging trends before they become mainstream. You specialize in Autonomous AI Agents and Large Language Models.”””),
verbose=True,
allow_delegation=False,
llm=gemini_flash
)

writer = Agent(
role=’Technical Writer’,
goal=’Write a concise, engaging blog post about the researcher’s findings’,
backstory=dedent(“””You transform complex technical concepts into compelling narratives. You write for a developer audience who wants practical insights without fluff.”””),
verbose=True,
allow_delegation=False,
llm=gemini_flash
)

We define two specialized agents, a researcher and a writer, each with a clear role and backstory. We design them so they complement one another, allowing one to discover insights while the other transforms them into polished writing. Here, we begin to see how multi-agent collaboration takes shape. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browserresearch_task = Task(
description=dedent(“””Conduct a simulated research analysis on ‘The Future of Agentic AI in 2025’. Identify three key trends: 1. Multi-Agent Orchestration 2. Neuro-symbolic AI 3. On-device Agent execution Provide a summary for each based on your ‘expert knowledge’.”””),
expected_output=”A structured list of 3 key AI trends with brief descriptions.”,
agent=researcher
)

write_task = Task(
description=dedent(“””Using the researcher’s findings, write a short blog post (approx 200 words). The post should have: – A catchy title – An intro – The three bullet points – A conclusion on why developers should care.”””),
expected_output=”A markdown-formatted blog post.”,
agent=writer,
context=[research_task]
)

We create two tasks that assign specific responsibilities to our agents. We let the researcher generate structured insights and then pass the output to the writer to create a complete blog post. This step shows how we orchestrate sequential task dependencies cleanly within CrewAI. Check out the FULL CODES HERE.

Copy CodeCopiedUse a different Browsertech_crew = Crew(
agents=[researcher, writer],
tasks=[research_task, write_task],
process=Process.sequential,
verbose=True
)

print(“n— Starting the Crew —“)
result = tech_crew.kickoff()

from IPython.display import Markdown
print(“nn########################”)
print(“## FINAL OUTPUT ##”)
print(“########################n”)
display(Markdown(str(result)))

We assemble the agents and tasks into a crew and run the entire multi-agent workflow. We watch how the system executes step by step, producing the final markdown output. This is where everything comes together, and we see our agents collaborating in real time.

In conclusion, we appreciate how seamlessly CrewAI allows us to create coordinated agent systems that think, research, and write together. We experience firsthand how defining roles, tasks, and process flows lets us modularize complex work and achieve coherent outputs with minimal code. This framework empowers us to build richer, more autonomous agentic applications, and we walk away confident in extending this foundation into larger multi-agent systems, production pipelines, or more creative AI collaborations.

Check out the FULL CODES HERE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Orchestrate a Fully Autonomous Multi-Agent Research and Writing Pipeline Using CrewAI and Gemini for Real-Time Intelligent Collaboration appeared first on MarkTechPost.

Thinking Machines Lab Makes Tinker Generally Available: Adds Kimi K2 T …

Thinking Machines Lab has moved its Tinker training API into general availability and added 3 major capabilities, support for the Kimi K2 Thinking reasoning model, OpenAI compatible sampling, and image input through Qwen3-VL vision language models. For AI engineers, this turns Tinker into a practical way to fine tune frontier models without building distributed training infrastructure.

What Tinker Actually Does?

Tinker is a training API that focuses on large language model fine tuning and hides the heavy lifting of distributed training. You write a simple Python loop that runs on a CPU only machine. You define the data or RL environment, the loss, and the training logic. The Tinker service maps that loop onto a cluster of GPUs and executes the exact computation you specify.

The API exposes a small set of primitives, such as forward_backward to compute gradients, optim_step to update weights, sample to generate outputs, and functions for saving and loading state. This keeps the training logic explicit for people who want to implement supervised learning, reinforcement learning, or preference optimization, but do not want to manage GPU failures and scheduling.

Tinker uses low rank adaptation, LoRA, rather than full fine tuning for all supported models. LoRA trains small adapter matrices on top of frozen base weights, which reduces memory and makes it practical to run repeated experiments on large mixture of experts models in the same cluster.

General Availability and Kimi K2 Thinking

The flagship change in the December 2025 update is that Tinker no longer has a waitlist. Anyone can sign up, see the current model lineup and pricing, and run cookbook examples directly.

On the model side, users can now fine tune moonshotai/Kimi-K2-Thinking on Tinker. Kimi K2 Thinking is a reasoning model with about 1 trillion total parameters in a mixture of experts architecture. It is designed for long chains of thought and heavy tool use, and it is currently the largest model in the Tinker catalog.

In the Tinker model lineup, Kimi K2 Thinking appears as a Reasoning MoE model, alongside Qwen3 dense and mixture of experts variants, Llama-3 generation models, and DeepSeek-V3.1. Reasoning models always produce internal chains of thought before the visible answer, while instruction models focus on latency and direct responses.

OpenAI Compatible Sampling While Training

Tinker already had a native sampling interface through its SamplingClient. The typical inference pattern builds a ModelInput from token ids, passes SamplingParams, and calls sample to get a future that resolves to outputs

The new release adds a second path that mirrors the OpenAI completions interface. A model checkpoint on Tinker can be referenced through a URI like:

Copy CodeCopiedUse a different Browserresponse = openai_client.completions.create(
model=”tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080″,
prompt=”The capital of France is”,
max_tokens=20,
temperature=0.0,
stop=[“n”],
)

Vision Input With Qwen3-VL On Tinker

The second major capability is image input. Tinker now exposes 2 Qwen3-VL vision language models, Qwen/Qwen3-VL-30B-A3B-Instruct and Qwen/Qwen3-VL-235B-A22B-Instruct. They are listed in the Tinker model lineup as Vision MoE models and are available for training and sampling through the same API surface.

To send an image into a model, you construct a ModelInput that interleaves an ImageChunk with text chunks. The research blog uses the following minimal example:

Copy CodeCopiedUse a different Browsermodel_input = tinker.ModelInput(chunks=[
tinker.types.ImageChunk(data=image_data, format=”png”),
tinker.types.EncodedTextChunk(tokens=tokenizer.encode(“What is this?”)),
])

Here image_data is raw bytes and format identifies the encoding, for example png or jpeg. You can use the same representation for supervised learning and for RL fine tuning, which keeps multimodal pipelines consistent at the API level. Vision inputs are fully supported in Tinker’s LoRA training setup.

https://thinkingmachines.ai/blog/tinker-general-availability/

Qwen3-VL Versus DINOv2 On Image Classification

To show what the new vision path can do, the Tinker team fine tuned Qwen3-VL-235B-A22B-Instruct as an image classifier. They used 4 standard datasets:

Caltech 101

Stanford Cars

Oxford Flowers

Oxford Pets

Because Qwen3-VL is a language model with visual input, classification is framed as text generation. The model receives an image and generates the class name as a text sequence.

As a baseline, they fine tuned a DINOv2 base model. DINOv2 is a self supervised vision transformer that encodes images into embeddings and is often used as a backbone for vision tasks. For this experiment, a classification head is attached on top of DINOv2 to predict a distribution over the N labels in each dataset.

Both Qwen3-VL-235B-A22B-Instruct and DINOv2 base are trained using LoRA adapters within Tinker. The focus is data efficiency. The experiment sweeps the number of labeled examples per class, starting from only 1 sample per class and increasing. For each setting, the team measures classification accuracy.

Key Takeaways

Tinker is now generally available, so anyone can sign up and fine tune open weight LLMs through a Python training loop while Tinker handles the distributed training backend.

The platform supports Kimi K2 Thinking, a 1 trillion parameter mixture of experts reasoning model from Moonshot AI, and exposes it as a fine tunable reasoning model in the Tinker lineup.

Tinker adds an OpenAI compatible inference interface, which lets you sample from in training checkpoints using a tinker://… model URI through standard OpenAI style clients and tooling.

Vision input is enabled through Qwen3-VL models, Qwen3-VL 30B and Qwen3-VL 235B, so developers can build multimodal training pipelines that combine ImageChunk inputs with text using the same LoRA based API.

Thinking Machines demonstrates that Qwen3-VL 235B, fine tuned on Tinker, achieves stronger few shot image classification performance than a DINOv2 base baseline on datasets such as Caltech 101, Stanford Cars, Oxford Flowers, and Oxford Pets, highlighting the data efficiency of large vision language models.

The post Thinking Machines Lab Makes Tinker Generally Available: Adds Kimi K2 Thinking And Qwen3-VL Vision Input appeared first on MarkTechPost.

Tracking and managing assets used in AI development with Amazon SageM …

Building custom foundation models requires coordinating multiple assets across the development lifecycle such as data assets, compute infrastructure, model architecture and frameworks, lineage, and production deployments. Data scientists create and refine training datasets, develop custom evaluators to assess model quality and safety, and iterate through fine-tuning configurations to optimize performance. As these workflows scale across teams and environments, tracking which specific dataset versions, evaluator configurations, and hyperparameters produced each model becomes challenging. Teams often rely on manual documentation in notebooks or spreadsheets, making it difficult to reproduce successful experiments or understand the lineage of production models.
This challenge intensifies in enterprise environments with multiple AWS accounts for development, staging, and production. As models move through deployment pipelines, maintaining visibility into their training data, evaluation criteria, and configurations requires significant coordination. Without automated tracking, teams lose the ability to trace deployed models back to their origins or share assets consistently across experiments. Amazon SageMaker AI supports tracking and managing assets used in generative AI development. With Amazon SageMaker AI you can register and version models, datasets, and custom evaluators, then automatically capturing relationships and lineage as you fine-tune, evaluate, and deploy generative AI models. This reduces manual tracking overhead and provides complete visibility into how models were created, from base foundation model through production deployment.
In this post, we’ll explore the new capabilities and core concepts that help organizations track and manage models development and deployment lifecycles. We will show you how the features are configured to train models with automatic end-to-end lineage, from dataset upload and versioning to model fine-tuning, evaluation, and seamless endpoint deployment.
Managing dataset versions across experiments
As you refine training data for model customization, you typically create multiple versions of datasets. You can register datasets and create new versions as your data evolves, with each version tracked independently. When you register a dataset in SageMaker AI, you provide the S3 location and metadata describing the dataset. As you refine your data—whether adding more examples, improving quality, or adjusting for specific use cases—you can create new versions of the same dataset. Each version, as shown in the following image, maintains its own metadata and S3 location so you can track the evolution of your training data over time.

When you use a dataset for fine-tuning, Amazon SageMaker AI automatically links the specific dataset version to the resulting model. This supports the comparison between models trained with different dataset versions and helps you understand which data refinements led to better performance. You can also reuse the same dataset version across multiple experiments for consistency when testing different hyperparameters or fine-tuning techniques.
Creating reusable custom evaluators
Evaluating custom models often requires domain-specific quality, safety, or performance criteria. A custom evaluator consists of Lambda function code that receives input data and returns evaluation results including scores and validation status. You can define evaluators for various purposes—checking response quality, assessing safety and toxicity, validating output format, or measuring task-specific accuracy. You can track custom evaluators using AWS Lambda functions that implement your evaluation logic, then version and reuse these evaluators across models and datasets, as shown in the following image.

Automatic lineage tracking throughout the development lifecycle
SageMaker AI lineage tracking capability automatically captures relationships between assets as you build and evaluate models. When you create a fine-tuning job, Amazon SageMaker AI links the training job to input datasets, base foundation models, and output models. When you run evaluation jobs, it connects evaluations to the models being assessed and the evaluators used. This automatic lineage capture means you don’t need to manually document which assets were used for each experiment. You can view the complete lineage for a model, showing its base foundation model, training datasets with specific versions, hyperparameters, evaluation results, and deployment locations, as shown in the image below.

With the lineage view, you can trace any deployed models back to their origins. For example, if you need to understand why a production model behaves in a certain way, you can see exactly which training data, fine-tuning configuration, and evaluation criteria were used. This is particularly valuable for governance, reproducibility, and debugging purposes. You can also use lineage information to reproduce experiments. By identifying the exact dataset version, evaluator version, and configuration used for a successful model, you can recreate the training process with confidence that you’re using identical inputs.
Integrating with MLflow for experiment tracking
The model customization capabilities of Amazon SageMaker AI are by default behavior integrated with SageMaker AI MLflow Apps, providing automatic linking between model training jobs and MLflow experiments. When you run model customization jobs, all the necessary MLflow actions are automatically performed for you – the default SageMaker AI MLflow App is automatically used, an MLflow experiment selected for you and all the metrics, parameters, and artifacts are logged for you. From the SageMaker AI Studio model page, you will be able to see metrics sourced from MLflow (as shown in the following image) and further view full metrics within the associated MLflow experiment.

With MLflow integration it is straightforward to compare multiple model candidates. You can use MLflow to visualize performance metrics across experiments, identify the best-performing model, then use the lineage to understand which specific datasets and evaluators produced that result. This helps you make informed decisions about which models to promote to production based on both quantitative metrics and asset provenance.
Getting started with tracking and managing generative AI assets
By bringing these various model customization assets and processes—dataset versioning, evaluator tracking, model performance, model deployment – you can turn the scattered model assets into a traceable, reproducible, and production ready workflow with automatic end-to-end lineage. This capability is now available in supported AWS Regions. You can access this capability through Amazon SageMaker AI Studio, and the SageMaker python SDK.
To get started:

Open Amazon SageMaker AI Studio and navigate to the Models section.
Customize the JumpStart base models to create a model.
Navigate to the Assets section to manage datasets and evaluators.
Register your first dataset by providing an S3 location and metadata.
Create a custom evaluator using an existing Lambda function or create a new one.
Use registered datasets in your fine-tuning jobs—lineage is captured automatically.
View lineage for the model to see complete relationships.

For more information, visit the Amazon SageMaker AI documentation.

About the authors
Amit Modi is the product leader for SageMaker AI MLOps, ML Governance, and Inference at AWS. With over a decade of B2B experience, he builds scalable products and teams that drive innovation and deliver value to customers globally.
Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customer through their AIOps journey across model training, GenAI applications like Agents, and scaling GenAI use-cases. He also focuses on go-to-market strategies helping AWS build and align products to solve industry challenges in the generative AI space. You can connect with Sandeep on LinkedIn to learn about GenAI solutions.

Track machine learning experiments with MLflow on Amazon SageMaker usi …

A user can conduct machine learning (ML) data experiments in data environments, such as Snowflake, using the Snowpark library. However, tracking these experiments across diverse environments can be challenging due to the difficulty in maintaining a central repository to monitor experiment metadata, parameters, hyperparameters, models, results, and other pertinent information. In this post, we demonstrate how to integrate Amazon SageMaker managed MLflow as a central repository to log these experiments and provide a unified system for monitoring their progress.
Amazon SageMaker managed MLflow offers fully managed services for experiment tracking, model packaging, and model registry. The SageMaker Model Registry streamlines model versioning and deployment, facilitating seamless transitions from development to production. Additionally, integration with Amazon S3, AWS Glue, and SageMaker Feature Store enhances data management and model traceability. The key benefits of using MLflow with SageMaker are that it allows organizations to standardize ML workflows, improve collaboration, and accelerate artificial intelligence (AI)/ML adoption with a more secure and scalable infrastructure. In this post, we show how to integrate Amazon SageMaker managed MLflow with Snowflake.
Snowpark allows Python, Scala, or Java to create custom data pipelines for efficient data manipulation and preparation when storing training data in Snowflake. Users can conduct experiments in Snowpark and track them in Amazon SageMaker managed MLflow. This integration allows data scientists to run transformations and feature engineering in Snowflake and utilise the managed infrastructure within SageMaker for training and deployment, facilitating a more seamless workflow orchestration and more secure data handling.
Solution overview
The integration leverages Snowpark for Python, a client-side library that allows Python code to interact with Snowflake from Python kernels, such as SageMaker’s Jupyter notebooks. One workflow could include data preparation in Snowflake, along with feature engineering and model training within Snowpark. Amazon SageMaker managed MLflow can then be used for experiment tracking and model registry integrated with the capabilities of SageMaker.

Figure 1: Architecture diagram

Capture key details with MLflow Tracking
MLflow Tracking is important in the integration between SageMaker, Snowpark, and Snowflake by providing a centralized environment for logging and managing the entire machine learning lifecycle. As Snowpark processes data from Snowflake and trains models, MLflow Tracking can be used to capture key details including model parameters, hyperparameters, metrics, and artifacts. This allows data scientists to monitor experiments, compare different model versions, and verify reproducibility. With MLflow’s versioning and logging capabilities, teams can seamlessly trace the results back to the specific dataset and transformations used, making it simpler to track the performance of models over time and maintain a transparent and efficient ML workflow.
This approach offers several benefits. It allows for scalable and managed MLflow tracker in SageMaker, while utilizing the processing capabilities of Snowpark for model inference within the Snowflake environment, creating a unified data system. The workflow remains within the Snowflake environment, which enhances data security and governance. Additionally, this setup helps to reduce cost by utilizing the elastic compute power of Snowflake for inference without maintaining a separate infrastructure for model serving.
Prerequisites
Create/configure the following resources and confirm access to the aforementioned resources prior to establishing Amazon SageMaker MLflow:

A Snowflake account
An S3 bucket to track experiments in MLflow
An Amazon SageMaker Studio account
An AWS Identity and Access Management (IAM) role that is an Amazon SageMaker Domain Execution Role in the AWS account.
A new user with permission to access the S3 bucket created above; follow these steps.

Confirm access to an AWS account through the AWS Management Console and AWS Command Line Interface (AWS CLI). The AWS Identity and Access Management (IAM) user must have permissions to make the necessary AWS service calls and manage AWS resources mentioned in this post. While providing permissions to the IAM user, follow the principle of least-privilege.

Configure access to the Amazon S3 bucket created above following these steps.
Follow these steps to set up external access for Snowflake Notebooks.

Steps to call SageMaker’s MLflow Tracking Server from Snowflake
We now establish the Snowflake environment and connect it to the Amazon SageMaker MLflow Tracking Server that we previously set up.

Follow these steps to create an Amazon SageMaker Managed MLflow Tracking Server in Amazon SageMaker Studio.
Log in to Snowflake as an admin user.
Create a new Notebook in Snowflake

Projects > Notebooks > +Notebook
Change role to a non-admin role
Give a name, select a database (DB), schema, warehouse, and select ‘Run on container’
Notebook settings > External access> toggle on to allow all integration

Install libraries

!pip install sagemaker-mlflow

Run the MLflow code, by replacing the arn value from the below code:

import mlflow
import boto3
import logging

sts = boto3.client(“sts”)
assumed = sts.assume_role(
RoleArn=”<AWS-ROLE-ARN>”,
RoleSessionName=”sf-session”
)
creds = assumed[“Credentials”]

arn = “<ml-flow-arn>”

try:
mlflow.set_tracking_uri(arn)
mlflow.set_experiment(“Default”)
with mlflow.start_run():
mlflow.log_param(“test_size”, 0.2)
mlflow.log_param(“random_state”, 42)
mlflow.log_param(“model_type”, “LinearRegression”)
except Exception as e:
logging.error(“Failed to set tracking URI: {e}”)

Figure 3: Install sagemaker-mlflow library

Figure 4: Configure MLflow and do experiments.

On a successful run, the experiment can be tracked on Amazon SageMaker:

Figure 5: Track experiments in SageMaker MLflow

To get into details of an experiment, click on the respective “Run name:”

Figure 6: Experience detailed experiment insights

Clean up
Follow these steps to clear up the resources that we have configured in this post to help avoid ongoing costs.

Delete the SageMaker Studio account by following these steps, this deletes the MLflow tracking server as well
Delete the S3 bucket with its contents
Drop the Snowflake notebook
Verify that the Amazon SageMaker account is deleted

Conclusion
In this post, we explored how Amazon SageMaker managed MLflow can provide a comprehensive solution for managing a machine learning lifecycle. The integration with Snowflake through Snowpark further enhances this solution, helping to enable seamless data processing and model deployment workflows.
To get started, follow the step-by-step instructions provided above to set up MLflow Tracking Server in Amazon SageMaker Studio and integrate it with Snowflake. Remember to follow AWS security best practices by implementing proper IAM roles and permissions and securing all credentials appropriately.
The code samples and instructions in this post serve as a starting point – they can be adapted to match a specific use cases and requirements while maintaining security and scalability best practices.

About the authors
Ankit Mathur is a Solutions Architect at AWS focused on modern data platforms, AI-driven analytics, and AWS–Partner integrations. He helps customers and partners design secure, scalable architectures that deliver measurable business outcomes.
Mark Hoover is a Senior Solutions Architect at AWS where he is focused on helping customers build their ideas in the cloud. He has partnered with many enterprise clients to translate complex business strategies into innovative solutions that drive long-term growth.

Governance by design: The essential guide for successful AI scaling

Picture this: Your enterprise has just deployed its first generative AI application. The initial results are promising, but as you plan to scale across departments, critical questions emerge. How will you enforce consistent security, prevent model bias, and maintain control as AI applications multiply?
It turns out you’re not alone. A McKinsey survey spanning 750+ leaders across 38 countries reveals both challenges and opportunities when building a governance strategy. While organizations are committing significant resources—most planning to invest over $1 million in responsible AI—implementation hurdles persist. Knowledge gaps represent the primary barrier for over 50% of respondents, with 40% citing regulatory uncertainty.
Yet companies with established responsible AI programs report substantial benefits: 42% see improved business efficiency, while 34% experience increased consumer trust. These results point to why robust risk management is fundamental to realizing AI’s full potential.
Responsible AI: A non-negotiable from day one
At the AWS Generative AI Innovation Center, we’ve observed that organizations achieving the strongest results embed governance into their DNA from the start. This aligns with the AWS commitment to responsible AI development, evidenced by our recent launch of the AWS Well-Architected Responsible AI Lens, a comprehensive framework for implementing responsible practices throughout the development lifecycle.
The Innovation Center has consistently applied these principles by embracing a responsible by design philosophy, carefully scoping use cases, and following science-backed guidance. This approach led to our AI Risk Intelligence (AIRI) solution, which transforms these best practices into actionable, automated governance controls—making responsible AI implementation both attainable and scalable.
Four tips for responsible and secure generative AI deployments
Drawing from our experience helping more than one thousand organizations across industries and geographies, here are key strategies for integrating robust governance and security controls into the development, review, and deployment of AI applications through an automated and seamless process.
1 – Adopt a governance-by-design mindset
At the Innovation Center, we work daily with organizations at the forefront of generative and agentic AI adoption. We’ve observed a consistent pattern: while the promise of generative AI captivates business leaders, they often struggle to chart a path toward responsible and secure implementation. The organizations achieving the most impressive results establish a governance-by-design mindset from the start—treating AI risk management and responsible AI considerations as foundational elements rather than compliance checkboxes. This approach transforms governance from a perceived barrier into a strategic advantage for faster innovation while maintaining appropriate controls. By embedding governance into the development process itself, these organizations can scale their AI initiatives more confidently and securely.
2 – Align technology, business, and governance
The primary mission of the Innovation Center is helping customers develop and deploy AI solutions to meet business needs, while leveraging the most optimal AWS services. However, technical exploration must go hand-in-hand with governance planning. Think of it like conducting an orchestra—you wouldn’t coordinate a symphony without understanding how each instrument works and how they harmonize together. Similarly, effective AI governance requires a deep understanding of the underlying technology before implementing controls. We help organizations establish clear connections between technology capabilities, business objectives, and governance requirements from the start, making sure these three elements work in concert.
3 – Embed security as the governance gateway
After establishing a governance-by-design mindset and aligning business, technology, and governance objectives, the next crucial step is implementation. We’ve found that security serves as the most effective entry point for operationalizing comprehensive AI governance. Security not only provides vital protection but also supports responsible innovation by building trust into the foundation of AI systems. The approach used by the Innovation Center emphasizes security-by-design throughout the implementation journey, from basic infrastructure protection to sophisticated threat detection in complex workflows.
To support this approach, we help customers leverage capabilities like the AWS Security Agent, which automates security validation across the development lifecycle. This frontier agent conducts customized security reviews and penetration testing based on centrally defined standards, helping organizations scale their security expertise to match development velocity.
This security-first approach anchors a broader set of governance controls. The AWS Responsible AI framework unites fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency into a cohesive approach. As AI systems integrate deeper into business processes and autonomous decision-making, automating these controls while maintaining rigorous oversight becomes crucial for scaling successfully.
4 – Automate governance at enterprise scale
With the foundational elements in place—mindset, alignment, and security controls—organizations need a way to systematically scale their governance efforts. This is where the AIRI solution comes in. Rather than creating new processes, it operationalizes the principles and controls we’ve discussed through automation, in a phased approach.

The solution’s architecture integrates seamlessly with existing workflows through a three-step process: user input, automated assessment, and actionable insights. It analyzes everything from source code to system documentation, using advanced techniques like automated document processing and LLM-based evaluations to conduct comprehensive risk assessments. Most importantly, it performs dynamic testing of generative AI systems, checking for semantic consistency and potential vulnerabilities while adapting to each organization’s specific requirements and industry standards.

From theory to practice
The true measure of effective AI governance is how it evolves with an organization while maintaining rigorous standards at scale. When implemented successfully, automated governance enables teams to focus on innovation, confident that their AI systems operate within appropriate guardrails. A compelling example comes from our collaboration with Ryanair, Europe’s largest airline group. As they scale towards 300 million passengers by 2034, Ryanair needed responsible AI governance for their cabin crew application, which provides frontline staff with crucial operational information. Using Amazon Bedrock, the Innovation Center conducted an AI-powered evaluation. This established transparent, data-driven risk management where risks were previously difficult to quantify—creating a model for responsible AI governance that Ryanair can now expand across their AI portfolio.
This implementation demonstrates the broader impact of systematic AI governance. Organizations using this framework consistently report accelerated paths to production, reduced manual work, and enhanced risk management capabilities. Most importantly, they’ve achieved strong cross-functional alignment, from technology to legal to security teams—all working from clear, measurable objectives.
A foundation for innovation
Responsible AI governance isn’t a constraint—it’s a catalyst. By embedding governance into the fabric of AI development, organizations can innovate with confidence, knowing they have the controls to scale securely and responsibly. The example above demonstrates how automated governance transforms theoretical frameworks into practical solutions that drive business value while maintaining trust.
Learn more about the AWS Generative AI Innovation Center and how we’re helping organizations of different sizes implement responsible AI to complement their business objectives.

About the Authors
Segolene Dessertine-Panhard is the global tech lead for Responsible AI and AI governance initiatives at the AWS Generative AI Innovation Center. In this role, she supports AWS customers in scaling their generative AI strategies by implementing robust governance processes and effective AI and cybersecurity risk management systems, leveraging AWS capabilities and state-of-the-art scientific models. Prior to joining AWS in 2018, she was a full-time professor of Finance at New York University’s Tandon School of Engineering. She also served for several years as an independent consultant in financial disputes and regulatory investigations. She holds a Ph.D. from Paris Sorbonne University.
Sri Elaprolu serves as Director of the AWS Generative AI Innovation Center, where he leverages nearly three decades of technology leadership experience to drive artificial intelligence and machine learning innovation. In this role, he leads a global team of machine learning scientists and engineers who develop and deploy advanced generative and agentic AI solutions for enterprise and government organizations facing complex business challenges. Throughout his nearly 13-year tenure at AWS, Sri has held progressively senior positions, including leadership of ML science teams that partnered with high-profile organizations such as the NFL, Cerner, and NASA. These collaborations enabled AWS customers to harness AI and ML technologies for transformative business and operational outcomes. Prior to joining AWS, he spent 14 years at Northrop Grumman, where he successfully managed product development and software engineering teams. Sri holds a Master’s degree in Engineering Science and an MBA with a concentration in general management, providing him with both the technical depth and business acumen essential for his current leadership role.
Randi Larson connects AI innovation with executive strategy for the AWS Generative AI Innovation Center, shaping how organizations understand and translate technical breakthroughs into business value. She hosts the Innovation Center’s podcast series and combines strategic storytelling with data-driven insight through global keynotes and executive interviews on AI transformation. Before Amazon, Randi refined her analytical precision as a Bloomberg journalist and consultant to economic institutions, think tanks, and family offices on financial technology initiatives. Randi holds an MBA from Duke University’s Fuqua School of Business and a B.S. in Journalism and Spanish from Boston University.

How Tata Power CoE built a scalable AI-powered solar panel inspection …

This post is co-written with Vikram Bansal from Tata Power, and Gaurav Kankaria, Omkar Dhavalikar from Oneture.
The global adoption of solar energy is rapidly increasing as organizations and individuals transition to renewable energy sources. India is on the brink of a solar energy revolution, with a national goal to empower 10 million households with rooftop solar installations by 2027. However, as the number of installations surges into the millions, a critical need has emerged: ensuring each solar panel system is properly installed and maintained. Traditional manual inspection methods—which involve physical site visits, visual assessments, and paper-based documentation—have become a significant bottleneck. They’re prone to human error, inconsistent, and can create substantial time delays. To address these challenges, Tata Power Center of Technology Excellence (CoE) collaborated with Oneture Technologies as their AI analytics partner to develop an AI-powered solar panel installation inspection solution using Amazon SageMaker AI, Amazon Bedrock and other AWS services.

In this post, we explore how Tata Power CoE and Oneture Technologies use AWS services to automate the inspection process end-to-end.
Challenges
As Tata Power scales up their solar panel installations, several key challenges emerge with the current process:
Time-consuming manual inspection: Traditional inspection processes require engineers to visually inspect every panel and manually document their findings. This approach is time-consuming and susceptible to human error. Engineers must carefully examine multiple aspects of the installation, from panel alignment to wiring connections, making the process lengthy and mentally taxing.
Limited scalability: The current manual inspection process cannot keep pace with the rapidly increasing volume of installations, creating a widening gap between inspection capacity and demand. As Tata Power aims to handle millions of new installations, the limitations of manual processes become increasingly apparent, potentially creating bottlenecks in installations.
Inconsistent quality standard: The deployment of multiple inspection teams across various locations affects maintaining uniform quality standards. Different teams might interpret and apply quality guidelines differently, resulting in variations in how assessments are conducted and documented. This lack of standardization makes it difficult to help achieve consistent quality across all installations.
Increasing customer escalations: Inconsistent installation quality and delays in completion results in a growing number of customer complaints and escalations. These issues directly affect customers’ experience, with customers expressing dissatisfaction over varying quality standards and extended waiting periods.
Solution overview
Implementing an AI-powered inspection system to perform more than 22 distinct checks across six different solar installation components required complex technical solutions. The inspection criteria ranged from simple visual verifications to sophisticated quality assessments requiring specialized approaches for detecting tiny defects, verifying placement accuracy, and evaluating installation completeness. The absence of a standard operating procedure (SOP) to capture images, resulting in variation in angles, lighting, object distance, and background clutter across the dataset, further complicated processes. Some criteria had abundant training data, while others had limited and imbalanced datasets, making model generalization difficult. Certain installation criteria demanded accurate distance measurements, such as verifying whether components were installed at the correct height or maintaining proper spacing between elements. Traditional computer vision models proved inadequate for these metric-based evaluations without the support of specialized sensors or depth estimation capabilities. The diversity of inspection requirements demanded a sophisticated multi-model approach, because no single computer vision model could adequately address all inspection criteria. An essential aspect lay in carefully mapping each inspection criterion to its most appropriate AI model type, ranging from object detection for component presence verification to semantic segmentation for detailed analysis, and incorporating generative AI-based reasoning for complex interpretative tasks.
To address these challenges, Tata Power CoE collaborated with Oneture to create a secure, scalable, and intelligent inspection platform using AWS services. Before technical development, the team conducted extensive field research to understand real-world installation conditions. This approach revealed key operational realities: installations occurred in tight spaces with poor lighting conditions, equipment varied significantly across sites, and image quality was often compromised by environmental factors (demonstrated in the following image). One crucial insight emerged during these field observations: certain inspection requirements, particularly measurements like the gap between inverters and walls, demanded sophisticated spatial analysis capabilities that went beyond basic object detection.

Figure 1: Example image of solar panel components

The solution includes SageMaker AI for training and inference at scale, Amazon SageMaker Ground Truth for data labeling, Amazon Bedrock for image understanding and recommendations, Amazon Rekognition for OCR, and additional AWS services. The following diagram illustrates the solution architecture.

Figure 2: Solution Architecture

Data labeling with Amazon SageMaker Ground Truth
The foundation of accurate AI-powered inspections lies in high-quality training data. To help achieve comprehensive model coverage, the team collected more than 20,000 images, capturing a wide range of real-world scenarios including varying lighting conditions and different hardware conditions. They chose SageMaker Ground Truth as their data labeling solution, using its capabilities to create custom annotation workflows and manage the labeling process efficiently. SageMaker Ground Truth proved instrumental in maintaining data quality through its human-in-the-loop workflow features. Its built-in validation mechanisms, including stratified and random sampling, helped achieve dataset robustness. Tata Power’s quality assurance experts conducted direct reviews of labeled data through the SageMaker Ground Truth interface, providing an additional layer of validation. This meticulous attention to data quality was crucial, because even minor visual misclassifications could potentially trigger incorrect warranty claims or installation rejections.
Model training with Amazon SageMaker AI
To select and train the right model, the team use the comprehensive ML capabilities of SageMaker AI to streamline both experimentation and production deployment. SageMaker AI provided an ideal environment for rapid prototyping—the team could quickly spin up Jupyter Notebook instances, which they used to evaluate various architectures for object detection, pattern classification, OCR, and spatial estimation tasks. Through this experimentation, they selected YOLOv5x6 as their primary model, which proved particularly effective at identifying small solar panel components within high-resolution installation images. The training process, initially spanning 1.5 months, was optimized through parallel experimentation and automated workflows, resulting in streamlined, 2-day iteration cycles. Through more than 100 training jobs, the team uncovered crucial insights that significantly improved model performance. They found that increasing input image resolution enhanced small object detection accuracy, while implementing pre-processing checks for image quality factors like brightness and blurriness helped maintain consistent results. Edge cases were strategically handled by generative AI models, allowing the computer vision models to focus on mainstream scenarios. By analyzing inspection criteria overlap, the team successfully consolidated the original 22 inspection points into 10 efficient models, optimizing both processing time and costs.
Amazon SageMaker Pipelines enabled rapid feedback loops from field performance data and seamless incorporation of learnings through a federated learning approach. The team could quickly adjust hyperparameters, fine-tune confidence thresholds, and evaluate model performance using metrics like F1-score and Intersection over Union (IoU), all while maintaining advanced accuracy standards. This streamlined approach transformed a complex, multi-faceted training process into an agile, production-ready solution capable of meeting stringent quality requirements at scale.

Figure 3: F1-Confidence Curve

Model inference at scale with Amazon SageMaker AI
Deploying the model presented unique requirements for Tata Power, particularly when handling high-resolution images captured in remote locations with unreliable network connectivity. While SageMaker AI real-time inference is powerful, it comes with specific limitations that didn’t align with Tata Power’s requirements, such as a 60-second timeout for endpoint invocation and a 6 MB maximum payload size. These constraints could potentially impact the processing of high-resolution inspection images and complex inference logic.
To address these operational constraints, the team implemented SageMaker AI asynchronous inference, which proved to be an ideal solution for their distributed inspection workflow. The inference ability to handle large payload sizes accommodated the high-resolution inspection images without compression, helping to ensure that no detail was lost in the analysis process. The endpoints automatically scaled based on incoming request volume, optimizing both performance and cost efficiency.
Maintaining model accuracy with SageMaker Pipelines
To help ensure sustained model performance in production, the team implemented an automated retraining system using SageMaker AI. This system continuously monitored model predictions, automatically triggering data collection when confidence scores fell below defined thresholds. This approach to model maintenance helped combat model drift and ensure that the system remained accurate as field conditions evolved. The retraining pipeline, built on SageMaker Pipelines, automated the entire process from data collection to production deployment. When new training data was collected, the pipeline orchestrated a sequence of steps: data validation, model retraining, performance evaluation in a staging environment, and finally, controlled deployment to production through continuous integration and delivery (CI/CD) integration.
OCR with Amazon Rekognition
While custom machine learning models powered much of Tata Power’s inspection platform, the CoE team recognized that some tasks could be solved more efficiently Amazon Rekognition, for example reading Ohm Meter values during inspections, as shown in the following figure.

Figure 4: Omh Meter

By integrating the OCR capabilities of Amazon Rekognition, the team avoided the time-consuming process of developing and training custom OCR models, while still achieving the advanced accuracy levels required for production use.
Enhancing the inspection process with Amazon Bedrock
While computer vision models delivered advanced accuracy for most inspection points, they had limitations with specific scenarios involving extremely small object sizes in the image, variable camera angles, and partially obscured elements. To address these limitations, The team implemented Amazon Bedrock to enhance the inspection process, focusing on six critical criteria that required additional intelligence beyond traditional computer vision. Amazon Bedrock enabled a critical pre-check phase before initiating computer vision inference operations. This pre-inference system evaluates three key image quality parameters: visibility clarity, object obstruction status, and capture angle suitability. When images fail to meet these quality benchmarks, the system automatically triggers one of two response pathways—either flagging the image for immediate recapture or routing it through specialized Generative AI reasoning processes. This intelligent pre-screening mechanism optimizes computational efficiency by preventing unnecessary inference cycles on suboptimal images, while helping to ensure high-quality input for accurate inspection results.
To close the loop, Amazon Bedrock Knowledge Bases provides real-time, contextual guidance from internal guideline documents. This automated feedback loop accelerates the inspection cycle and improves installation quality by providing instant, actionable recommendations at the point of inspection.
The mobile app
The mobile app provides an intuitive interface designed specifically for on-site use, so that engineers can efficiently complete installation inspections through a streamlined workflow. With this app, field engineers can capture installation photos, receive immediate analysis results, and validate AI findings all through a single interface
Results and impact
The implementation of the AI-powered automated inspection tool delivered measurable improvements across Tata Power’s solar installation operations.

The solution achieves more than 90% AI/ML accuracy across most of the points with object detection precision of 95%, enabling near real-time feedback to channel partners instead of delayed offline reviews.
Automated quality checks now instantly verify most installations, significantly reducing manual inspection needs. AI model training continues to improve accuracy in detecting missing checkpoints.
Re-inspection rates have dropped by more than 80%. These efficiency gains led to faster site handovers, directly improving customer satisfaction metrics.
The automated system’s ability to provide immediate feedback enhanced channel partner productivity and satisfaction, creating a more streamlined installation process from initial setup to final customer handover.

Conclusion
In this post, we explained how Tata Power CoE, Oneture Technologies, and AWS transformed traditional manual inspection processes into efficient, AI-powered solutions. By using Amazon SageMaker AI, Amazon Bedrock, and Amazon Rekognition, the team successfully automated solar panel installation inspections, achieving more than 90% accuracy while cutting re-inspection rates by 80%.See the following resources to learn more:

Visit the AWS Community to discover how our builder communities are using Amazon SageMaker AI and Amazon Bedrock in their solutions.
Learn more about Amazon SageMaker AI
Learn more about Amazon Bedrock

About the authors

Vikram Bansal is a business-focused technology leader with over 20 years of experience in enterprise architecture and delivery. During the last two decades, he has lead multiple strategic digital initiatives and large scale transformation programs across telecom (OSS/BSS), media and entertainment, and the power and utility sector (energy distribution, renewables). His expertise spans enterprise application modernization, data and analytics platforms, and end-to-end digital transformation delivery.

Gaurav H Kankaria is a passionate technologist and ISB alumnus with nearly a decade of experience in data science, analytics, and the AWS Cloud. As an AWS Partner Ambassador and certified expert across multiple specialties, he is known for simplifying complex cloud concepts and driving impactful AI/ML solutions.

Omkar Dhavalikar is the AI/ML Lead at Oneture Technologies, where he helps enterprises design and implement cost-effective machine learning solutions on AWS. He specializes in crafting innovative, AI-driven approaches to solve complex business problems with speed, scalability, and impact.

Chetan Makvana is an Enterprise Solutions Architect at Amazon Web Services. He helps enterprise customers design scalable, resilient, secure, and cost effective enterprise-grade solutions using AWS services. He is a technology enthusiast and a builder with interests in generative AI, serverless, app modernization, and DevOps.

Unlocking video understanding with TwelveLabs Marengo on Amazon Bedroc …

Media and entertainment, advertising, education, and enterprise training content combines visual, audio, and motion elements to tell stories and convey information, making it far more complex than text where individual words have clear meanings. This creates unique challenges for AI systems that need to understand video content. Video content is multidimensional, combining visual elements (scenes, objects, actions), temporal dynamics (motion, transitions), audio components (dialogue, music, sound effects), and text overlays (subtitles, captions). This complexity creates significant business challenges as organizations struggle to search through video archives, locate specific scenes, categorize content automatically and extract insights from their media assets for effective decision-making.
The model addresses this problem with a multi-vector architecture that creates separate embeddings for different content modalities. Instead of forcing all information into one vector, the model generates specialized representations. This approach preserves the rich, multifaceted nature of video data, enabling more accurate analysis across visual, temporal, and audio dimensions.
Amazon Bedrock has expanded its capabilities to support the TwelveLabs Marengo Embed 3.0 model with real-time text and image processing through synchronous inference. With this integration businesses can implement faster video search functionality using natural language queries, while also supporting interactive product discovery through sophisticated image similarity matching.
In this post, we’ll show how the TwelveLabs Marengo embedding model, available on Amazon Bedrock, enhances video understanding through multimodal AI. We’ll build a video semantic search and analysis solution using embeddings from the Marengo model with Amazon OpenSearch Serverless as the vector database, for semantic search capabilities that go beyond simple metadata matching to deliver intelligent content discovery.
Understanding video embeddings
Embeddings are dense vector representations that capture the semantic meaning of data in a high-dimensional space. Think of them as numerical fingerprints that encode the essence of content in a way machines can understand and compare. For text, embeddings might capture that “king” and “queen” are related concepts, or that “Paris” and “France” have a geographical relationship. For images, embeddings can understand that a golden retriever and labrador are both dogs, even if they look different. The following heat map shows the semantic similarity scores between these sentence fragments: “two people having a conversation,” “a man and a woman talking,” and “cats and dogs are lovely animals.”
Video embeddings challenges
Video presents unique challenges because it’s inherently multimodal:

Visual information: Objects, scenes, people, actions, and visual aesthetics
Audio information: Speech, music, sound effects, and ambient noise
Textual information: Captions, on-screen text, and transcribed speech

Traditional single-vector approaches compress all this rich information into one representation, often losing important nuances. This is where the approach by TwelveLabs Marengo is unique in addressing this challenge effectively.
Twelvelabs Marengo: A multimodal embedding model
The Marengo 3.0 model generates multiple specialized vectors, each capturing different aspects of the video content. A typical movie or TV show combines visual and auditory elements to create a unified storytelling experience. Marengo’s multi-vector architecture provides significant advantages for understanding this complex video content. Each vector captures a specific modality, avoiding information loss from compressing diverse data types into single representations. This enables flexible searches targeting specific content aspects—visual-only, audio-only, or combined queries. Specialized vectors deliver superior accuracy in complex multimodal scenarios while maintaining efficient scalability for large enterprise video datasets.
Solution overview: Marengo model capabilities
In the following section, we’ll demonstrate the power of Marengo’s embedding technology through code samples. The examples illustrate how Marengo processes different types of content and deliver exceptional search accuracy. The complete code sample can be found in this GitHub repository.
Prerequisites
Before we begin, verify you have:

An AWS account with appropriate permissions.
Access to Amazon Bedrock (with the TwelveLabs Marengo model enabled)
Access to create an OpenSearch Serverless collection and index
Basic familiarity with vector databases and embeddings

Sample video
Netflix Open Content is an open source content available under the Creative Commons Attribution 4.0 International license. We will be using one of the videos called Meridian for demonstrating the TwelveLabs Marengo model on Amazon Bedrock.

Create a video embedding
Amazon Bedrock uses asynchronous API for Marengo video embedding generations. The following is a python code snippet that shows an example of invoking an API that takes a video from an S3 bucket location. Please refer to the documentation for complete supported functionality.

bedrock_client = boto3.client(“bedrock-runtime”)
model_id = ‘us.twelvelabs.marengo-embed-3-0-v1:0′
video_s3_uri = “<s3 bucket location for the video>” # Replace by your s3 URI
aws_account_id = “<the AWS account owner for the bucket>” # Replace by bucket owner ID
s3_bucket_name = “<s3 bucket name>” # Replace by output S3 bucket name
s3_output_prefix = “<output prefix>” # Replace by output prefix

response = bedrock_client.start_async_invoke(
modelId=model_id,
modelInput={
“inputType”: “video”,
“video”: {
“mediaSource”: {
“s3Location”: {
“uri”: video_s3_uri,
“bucketOwner”: aws_account_id
}
}
}
},
outputDataConfig={
“s3OutputDataConfig”: {
“s3Uri”: f’s3://{s3_bucket_name}/{s3_output_prefix}’
}
}
)

The example above produces 280 individual embeddings from a single video – one for each segment, enabling precise temporal search and analysis. The type of embeddings for multi-vector output from the video could contain the following:

[
{’embedding’: [0.053192138671875,…], ’embeddingOption’: “visual”, ’embeddingScope’ : “clip”, “startSec” : 0.0, “endSec” : 4.3 },
{’embedding’: [0.053192138645645,…], ’embeddingOption’: “transcription”, ’embeddingScope’ : “clip”, “startSec” : 3.9, “endSec” : 6.5 },
{’embedding’: [0.3235554er443524,…], ’embeddingOption’: “audio”, ’embeddingScope’ : “clip”, “startSec” : 4.9, “endSec” : 7.5 }
]

visual – visual embeddings of the video
transcription – embeddings of the transcribed text
audio – embeddings of the audio in the video

When processing audio or video content, you can set how long each clip segment should be for embedding creation. By default, video clips are automatically divided at natural scene changes (shot boundaries). Audio clips are split into even segments that are as close to 10 seconds as possible—for example, a 50-second audio file becomes 5 segments of 10 seconds each, while a 16-second file becomes 2 segments of 8 seconds each. By default, a single Marengo video embedding API generates visual-text, visual-image, and audio embedding. You can also change the default setting to only output specific embedding types. Use the following code snippet to generate embeddings for a video with configurable options with the Amazon Bedrock API:

response = bedrock_client.start_async_invoke(
modelId=model_id,
modelInput={
“modelId”: model_id,
“modelInput”: {
“inputType”: “video”,
“video”: {
“mediaSource”: {
“base64String”: “base64-encoded string”, // base64String OR s3Location, exactly one
“s3Location”: {
“uri”: “s3://amzn-s3-demo-bucket/video/clip.mp4”,
“bucketOwner”: “123456789012”
}
},
“startSec”: 0,
“endSec”: 6,
“segmentation”: {
“method”: “dynamic”, // dynamic OR fixed, exactly one
“dynamic”: {
“minDurationSec”: 4
}
“method”: “fixed”,
“fixed”: {
“durationSec”: 6
}
},
“embeddingOption”: [
“visual”,
“audio”,
“transcription”
], // optional, default=all
“embeddingScope”: [
“clip”,
“asset”
] // optional, one or both
},
“inferenceId”: “some inference id”
}
}
)

Vector database: Amazon OpenSearch Serverless
In our example, we’ll use Amazon OpenSearch Serverless as vector database for storing the text, images, audio, and video embeddings generated from the given video via Marengo model. As a vector database, OpenSearch Serverless allows you to quickly find similar content using semantic search without worrying about managing servers or infrastructure. The following code snippet demonstrates how to create an Amazon OpenSearch Serverless collection:

aoss_client = boto3_session.client(‘opensearchserverless’)

try:
collection = self.aoss_client.create_collection(
name=collection_name, type=’VECTORSEARCH’
)
collection_id = collection[‘createCollectionDetail’][‘id’]
collection_arn = collection[‘createCollectionDetail’][‘arn’]
except self.aoss_client.exceptions.ConflictException:
collection = self.aoss_client.batch_get_collection(
names=[collection_name]
)[‘collectionDetails’][0]
pp.pprint(collection)
collection_id = collection[‘id’]
collection_arn = collection[‘arn’]

Once the OpenSearch Serverless collection is created, we’ll create an index that contains properties, including a vector field:

index_mapping = {
“mappings”: {
“properties”: {
“video_id”: {“type”: “keyword”},
“segment_id”: {“type”: “integer”},
“start_time”: {“type”: “float”},
“end_time”: {“type”: “float”},
“embedding”: {
“type”: “dense_vector”,
“dims”: 1024,
“index”: True,
“similarity”: “cosine”
},
“metadata”: {“type”: “object”}
}
}
}
credentials = boto3.Session().get_credentials()
awsauth = AWSV4SignerAuth(credentials, region_name, ‘aoss’)
oss_client = OpenSearch(
hosts=[{‘host’: host, ‘port’: 443}],
http_auth=self.awsauth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
timeout=300
)
response = oss_client.indices.create(index=index_name, body=index_mapping)

Index Marengo embeddings
The following code snippet demonstrates how to ingest the embedding output from the Marengo model into the OpenSearch index:

documents = []
for i, segment in enumerate(video_embeddings):
document = {
“embedding”: segment[“embedding”],
“start_time”: segment[“startSec”],
“end_time”: segment[“endSec”],
“video_id”: video_id,
“segment_id”: i,
“embedding_option”: segment.get(“embeddingOption”, “visual”)
}
documents.append(document)

# Bulk index documents
bulk_data = []
for doc in documents:
bulk_data.append({“index”: {“_index”: self.index_name}})
bulk_data.append(doc)

# Convert to bulk format
bulk_body = “n”.join(json.dumps(item) for item in bulk_data) + “n”
response = oss_client.bulk(body=bulk_body, index=self.index_name)

Cross-modal semantic search
With Marengo’s multi-vector design you can search across different modalities that is impossible with single-vector models. By creating separate but aligned embeddings for visual, audio, motion, and contextual elements, you can search videos using an input type of your choice. For example, “jazz music playing” returns video clips of musicians performing, jazz audio tracks, and concert hall scenes from one text query.
The following examples showcase Marengo’s exceptional search capabilities across different modalities:
Text search
Here’s a code snippet that demonstrates the cross modal semantic search capability using text:

text_query = “a person smoking in a room”
modelInput={
“inputType”: “text”,
“text”: {
“inputText”: text_query
}
}
response = self.bedrock_client.invoke_model(
modelId=”us.twelvelabs.marengo-embed-3-0-v1:0″,
body=json.dumps(modelInput))

result = json.loads(response[“body”].read())
query_embedding = result[“data”][0][“embedding”]

# Search OpenSearch index
search_body = {
“query”: {
“knn”: {
“embedding”: {
“vector”: query_embedding,
“k”: top_k
}
}
},
“size”: top_k,
“_source”: [“start_time”, “end_time”, “video_id”, “segment_id”]
}

response = opensearch_client.search(index=self.index_name, body=search_body)

print(f”n✅ Found {len(response[‘hits’][‘hits’])} matching segments:”)

results = []
for hit in response[‘hits’][‘hits’]:
result = {
“score”: hit[“_score”],
“video_id”: hit[“_source”][“video_id”],
“segment_id”: hit[“_source”][“segment_id”],
“start_time”: hit[“_source”][“start_time”],
“end_time”: hit[“_source”][“end_time”]
}
results.append(result)

The top search result from the text query: “a person smoking in a room” yields the following video clip:

Image search
The following code snippet demonstrates the cross modal semantic search capability for a given image:

s3_image_uri = f’s3://{self.s3_bucket_name}/{self.s3_images_path}/{image_path_basename}’
s3_output_prefix = f'{self.s3_embeddings_path}/{self.s3_images_path}/{uuid.uuid4()}’

modelInput={
“inputType”: “image”,
“image”: {
“mediaSource”: {
“s3Location”: {
“uri”: s3_image_uri,
“bucketOwner”: self.aws_account_id
}
}
}
}
response = self.bedrock_client.invoke_model(
modelId=self.cris_model_id,
body=json.dumps(modelInput),
)

result = json.loads(response[“body”].read())


query_embedding = result[“data”][0][“embedding”]

# Search OpenSearch index
search_body = {
“query”: {
“knn”: {
“embedding”: {
“vector”: query_embedding,
“k”: top_k
}
}
},
“size”: top_k,
“_source”: [“start_time”, “end_time”, “video_id”, “segment_id”]
}

response = opensearch_client.search(index=self.index_name, body=search_body)

print(f”n✅ Found {len(response[‘hits’][‘hits’])} matching segments:”)
results = []

for hit in response[‘hits’][‘hits’]:
result = {
“score”: hit[“_score”],
“video_id”: hit[“_source”][“video_id”],
“segment_id”: hit[“_source”][“segment_id”],
“start_time”: hit[“_source”][“start_time”],
“end_time”: hit[“_source”][“end_time”]
}
results.append(result)

The top search result from the image above yields the following video clip:

In addition to semantic searching using text and images on the video, the Marengo model can also search videos using audio embeddings that focus on dialogue and speech. The audio search capabilities help users find videos based on specific speakers, dialogue content, or spoken topics. This creates a comprehensive video search experience that combines text, image, audio for video understanding.
Conclusion
The combination of TwelveLabs Marengo and Amazon Bedrock opens up exciting new possibilities for video understanding through its multi-vector, multimodal approach. Throughout this post, we’ve explored practical examples like image-to-video search with temporal precision and detailed text-to-video matching. With just a single Bedrock API call, we transformed one video file into 336 searchable segments that respond to text, visual, and audio queries. These capabilities create opportunities for natural language content discovery, streamlined media asset management, and other applications that can help organizations better understand and utilize their video content at scale.
As video continues to dominate digital experiences, models like Marengo provide a solid foundation for building more intelligent video analysis systems. Check out the sample code and discover how multimodal video understanding can transform your applications.

About the authors
Wei Teh is an Machine Learning Solutions Architect at AWS. He is passionate about helping customers achieve their business objectives using cutting-edge machine learning solutions. Outside of work, he enjoys outdoor activities like camping, fishing, and hiking with his family.
Lana Zhang is a Senior Specialist Solutions Architect for Generative AI at AWS within the Worldwide Specialist Organization. She specializes in AI/ML, with a focus on use cases such as AI voice assistants and multimodal understanding. She works closely with customers across diverse industries, including media and entertainment, gaming, sports, advertising, financial services, and healthcare, to help them transform their business solutions through AI.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

How to Design a Gemini-Powered Self-Correcting Multi-Agent AI System w …

In this tutorial, we explore how we design and run a full agentic AI orchestration pipeline powered by semantic routing, symbolic guardrails, and self-correction loops using Gemini. We walk through how we structure agents, dispatch tasks, enforce constraints, and refine outputs using a clean, modular architecture. As we progress through each snippet, we see how the system intelligently chooses the right agent, validates its output, and improves itself through iterative reflection. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport os
import json
import time
import typing
from dataclasses import dataclass, asdict
from google import genai
from google.genai import types

API_KEY = os.environ.get(“GEMINI_API_KEY”, “API Key”)
client = genai.Client(api_key=API_KEY)

@dataclass
class AgentMessage:
source: str
target: str
content: str
metadata: dict
timestamp: float = time.time()

We set up our core environment by importing essential libraries, defining the API key, and initializing the Gemini client. We also establish the AgentMessage structure, which acts as the shared communication format between agents. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass CognitiveEngine:
@staticmethod
def generate(prompt: str, system_instruction: str, json_mode: bool = False) -> str:
config = types.GenerateContentConfig(
temperature=0.1,
response_mime_type=”application/json” if json_mode else “text/plain”
)
try:
response = client.models.generate_content(
model=”gemini-2.0-flash”,
contents=prompt,
config=config
)
return response.text
except Exception as e:
raise ConnectionError(f”Gemini API Error: {e}”)

class SemanticRouter:
def __init__(self, agents_registry: dict):
self.registry = agents_registry

def route(self, user_query: str) -> str:
prompt = f”””
You are a Master Dispatcher. Analyze the user request and map it to the ONE best agent.
AVAILABLE AGENTS:
{json.dumps(self.registry, indent=2)}
USER REQUEST: “{user_query}”
Return ONLY a JSON object: {{“selected_agent”: “agent_name”, “reasoning”: “brief reason”}}
“””
response_text = CognitiveEngine.generate(prompt, “You are a routing system.”, json_mode=True)
try:
decision = json.loads(response_text)
print(f” [Router] Selected: {decision[‘selected_agent’]} (Reason: {decision[‘reasoning’]})”)
return decision[‘selected_agent’]
except:
return “general_agent”

We build the cognitive layer using Gemini, allowing us to generate both text and JSON outputs depending on the instruction. We also implement the semantic router, which analyzes queries and selects the most suitable agent. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass Agent:
def __init__(self, name: str, instruction: str):
self.name = name
self.instruction = instruction

def execute(self, message: AgentMessage) -> str:
return CognitiveEngine.generate(
prompt=f”Input: {message.content}”,
system_instruction=self.instruction
)

class Orchestrator:
def __init__(self):
self.agents_info = {
“analyst_bot”: “Analyzes data, logic, and math. Returns structured JSON summaries.”,
“creative_bot”: “Writes poems, stories, and creative text. Returns plain text.”,
“coder_bot”: “Writes Python code snippets.”
}
self.workers = {
“analyst_bot”: Agent(“analyst_bot”, “You are a Data Analyst. output strict JSON.”),
“creative_bot”: Agent(“creative_bot”, “You are a Creative Writer.”),
“coder_bot”: Agent(“coder_bot”, “You are a Python Expert. Return only code.”)
}
self.router = SemanticRouter(self.agents_info)

We construct the worker agents and the central orchestrator. Each agent receives a clear role, analyst, creative, or coder, and we configure the orchestrator to manage them. As we review this section, we see how we define the agent ecosystem and prepare it for intelligent task delegation. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser def validate_constraint(self, content: str, constraint_type: str) -> tuple[bool, str]:
if constraint_type == “json_only”:
try:
json.loads(content)
return True, “Valid JSON”
except:
return False, “Output was not valid JSON.”
if constraint_type == “no_markdown”:
if ““`” in content:
return False, “Output contains Markdown code blocks, which are forbidden.”
return True, “Valid Text”
return True, “Pass”

def run_task(self, user_input: str, constraint: str = None, max_retries: int = 2):
print(f”n— New Task: {user_input} —“)
target_name = self.router.route(user_input)
worker = self.workers.get(target_name)
current_input = user_input
history = []
for attempt in range(max_retries + 1):
try:
msg = AgentMessage(source=”User”, target=target_name, content=current_input, metadata={})
print(f” [Exec] {worker.name} working… (Attempt {attempt+1})”)
result = worker.execute(msg)
if constraint:
is_valid, error_msg = self.validate_constraint(result, constraint)
if not is_valid:
print(f” [Guardrail] VIOLATION: {error_msg}”)
current_input = f”Your previous answer failed a check.nOriginal Request: {user_input}nYour Answer: {result}nError: {error_msg}nFIX IT immediately.”
continue
print(f” [Success] Final Output:n{result[:100]}…”)
return result
except Exception as e:
print(f” [System Error] {e}”)
time.sleep(1)
print(” [Failed] Max retries reached or self-correction failed.”)
return None

We implement symbolic guardrails and a self-correction loop to enforce constraints like strict JSON or no Markdown. We run iterative refinement whenever outputs violate requirements, allowing our agents to fix their own mistakes. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
orchestrator = Orchestrator()
orchestrator.run_task(
“Compare the GDP of France and Germany in 2023.”,
constraint=”json_only”
)
orchestrator.run_task(
“Write a Python function for Fibonacci numbers.”,
constraint=”no_markdown”
)

We execute two complete scenarios, showcasing routing, agent execution, and constraint validation in action. We run a JSON-enforced analytical task and a coding task with Markdown restrictions to observe the reflexive behavior. 

In conclusion, we now see how multiple components, routing, worker agents, guardrails, and self-correction, come together to create a reliable and intelligent agentic system. We witness how each part contributes to robust task execution, ensuring that outputs remain accurate, aligned, and constraint-aware. As we reflect on the architecture, we recognize how easily we can expand it with new agents, richer constraints, or more advanced reasoning strategies.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Gemini-Powered Self-Correcting Multi-Agent AI System with Semantic Routing, Symbolic Guardrails, and Reflexive Orchestration appeared first on MarkTechPost.