i-genie, Author at i-genie.co.uk

Darwin Gödel Machine: A Self-Improving AI Agent That Evolves Code Usi …

Posted on June 7, 2025 by i-genie

Introduction: The Limits of Traditional AI Systems

Conventional artificial intelligence systems are limited by their static architectures. These models operate within fixed, human-engineered frameworks and cannot autonomously improve after deployment. In contrast, human scientific progress is iterative and cumulative—each advancement builds upon prior insights. Taking inspiration from this model of continuous refinement, AI researchers are now exploring evolutionary and self-reflective techniques that allow machines to improve through code modification and performance feedback.

Darwin Gödel Machine: A Practical Framework for Self-Improving AI

Researchers from the Sakana AI, the University of British Columbia and the Vector Institute have introduced the Darwin Gödel Machine (DGM), a novel self-modifying AI system designed to evolve autonomously. Unlike theoretical constructs like the Gödel Machine, which rely on provable modifications, DGM embraces empirical learning. The system evolves by continuously editing its own code, guided by performance metrics from real-world coding benchmarks such as SWE-bench and Polyglot.

Foundation Models and Evolutionary AI Design

To drive this self-improvement loop, DGM uses frozen foundation models that facilitate code execution and generation. It begins with a base coding agent capable of self-editing, then iteratively modifies it to produce new agent variants. These variants are evaluated and retained in an archive if they demonstrate successful compilation and self-improvement. This open-ended search process mimics biological evolution—preserving diversity and enabling previously suboptimal designs to become the basis for future breakthroughs.

Benchmark Results: Validating Progress on SWE-bench and Polyglot

DGM was tested on two well-known coding benchmarks:

SWE-bench: Performance improved from 20.0% to 50.0%

Polyglot: Accuracy increased from 14.2% to 30.7%

These results highlight DGM’s ability to evolve its architecture and reasoning strategies without human intervention. The study also compared DGM with simplified variants that lacked self-modification or exploration capabilities, confirming that both elements are critical for sustained performance improvements. Notably, DGM even outperformed hand-tuned systems like Aider in multiple scenarios.

Technical Significance and Limitations

DGM represents a practical reinterpretation of the Gödel Machine by shifting from logical proof to evidence-driven iteration. It treats AI improvement as a search problem—exploring agent architectures through trial and error. While still computationally intensive and not yet on par with expert-tuned closed systems, the framework offers a scalable path toward open-ended AI evolution in software engineering and beyond.

Conclusion: Toward General, Self-Evolving AI Architectures

The Darwin Gödel Machine shows that AI systems can autonomously refine themselves through a cycle of code modification, evaluation, and selection. By integrating foundation models, real-world benchmarks, and evolutionary search principles, DGM demonstrates meaningful performance gains and lays the groundwork for more adaptable AI. While current applications are limited to code generation, future versions could expand to broader domains—moving closer to general-purpose, self-improving AI systems aligned with human goals.

TL;DR

DGM is a self-improving AI framework that evolves coding agents through code modifications and benchmark validation.

It improves performance using frozen foundation models and evolution-inspired techniques.

Outperforms traditional baselines on SWE-bench (50%) and Polyglot (30.7%).

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post Darwin Gödel Machine: A Self-Improving AI Agent That Evolves Code Using Foundation Models and Real-World Benchmarks appeared first on MarkTechPost.

Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series � …

Posted on June 7, 2025 by i-genie

Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation (RAG). However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed.

Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding

Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages (119 in total), making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs.

These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs.

Technical Architecture

Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to the [EOS] token. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function.

The models are trained using a robust multi-stage training pipeline:

Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks.

Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity (>0.7), fine-tuning performance in downstream applications.

Model merging: Spherical linear interpolation (SLERP) of multiple fine-tuned checkpoints ensures robustness and generalization.

This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings.

Performance Benchmarks and Insights

The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks.

On MMTEB (216 tasks across 250+ languages), Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series.

On MTEB (English v2): Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B.

On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA.

For reranking:

Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers.

Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance.

Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops (up to 6 points on MMTEB), emphasizing their contributions.

Conclusion

Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation.

Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards appeared first on MarkTechPost.

Build a serverless audio summarization solution with Amazon Bedrock an …

Posted on June 7, 2025 by i-genie

Recordings of business meetings, interviews, and customer interactions have become essential for preserving important information. However, transcribing and summarizing these recordings manually is often time-consuming and labor-intensive. With the progress in generative AI and automatic speech recognition (ASR), automated solutions have emerged to make this process faster and more efficient.
Protecting personally identifiable information (PII) is a vital aspect of data security, driven by both ethical responsibilities and legal requirements. In this post, we demonstrate how to use the Open AI Whisper foundation model (FM) Whisper Large V3 Turbo, available in Amazon Bedrock Marketplace, which offers access to over 140 models through a dedicated offering, to produce near real-time transcription. These transcriptions are then processed by Amazon Bedrock for summarization and redaction of sensitive information.
Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon Nova through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Additionally, you can use Amazon Bedrock Guardrails to automatically redact sensitive information, including PII, from the transcription summaries to support compliance and data protection needs.
In this post, we walk through an end-to-end architecture that combines a React-based frontend with Amazon Bedrock, AWS Lambda, and AWS Step Functions to orchestrate the workflow, facilitating seamless integration and processing.
Solution overview
The solution highlights the power of integrating serverless technologies with generative AI to automate and scale content processing workflows. The user journey begins with uploading a recording through a React frontend application, hosted on Amazon CloudFront and backed by Amazon Simple Storage Service (Amazon S3) and Amazon API Gateway. When the file is uploaded, it triggers a Step Functions state machine that orchestrates the core processing steps, using AI models and Lambda functions for seamless data flow and transformation. The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The React application is hosted in an S3 bucket and served to users through CloudFront for fast, global access. API Gateway handles interactions between the frontend and backend services.
Users upload audio or video files directly from the app. These recordings are stored in a designated S3 bucket for processing.
An Amazon EventBridge rule detects the S3 upload event and triggers the Step Functions state machine, initiating the AI-powered processing pipeline.
The state machine performs audio transcription, summarization, and redaction by orchestrating multiple Amazon Bedrock models in sequence. It uses Whisper for transcription, Claude for summarization, and Guardrails to redact sensitive data.
The redacted summary is returned to the frontend application and displayed to the user.

The following diagram illustrates the state machine workflow.

The Step Functions state machine orchestrates a series of tasks to transcribe, summarize, and redact sensitive information from uploaded audio/video recordings:

A Lambda function is triggered to gather input details (for example, Amazon S3 object path, metadata) and prepare the payload for transcription.
The payload is sent to the OpenAI Whisper Large V3 Turbo model through the Amazon Bedrock Marketplace to generate a near real-time transcription of the recording.
The raw transcript is passed to Anthropic’s Claude Sonnet 3.5 through Amazon Bedrock, which produces a concise and coherent summary of the conversation or content.
A second Lambda function validates and forwards the summary to the redaction step.
The summary is processed through Amazon Bedrock Guardrails, which automatically redacts PII and other sensitive data.
The redacted summary is stored or returned to the frontend application through an API, where it is displayed to the user.

Prerequisites
Before you start, make sure that you have the following prerequisites in place:

Before using Amazon Bedrock models, you must request access—a one-time setup step. For this solution, verify that access to the Anthropic’s Claude Sonnet 3.5 model is enabled in your Amazon Bedrock account. For instructions, see Access Amazon Bedrock foundation models.
Set up a guardrail to enable PII redaction by configuring filters that block or mask sensitive information. For guidance on configuring filters for additional use cases, see Remove PII from conversations by using sensitive information filters.
Deploy the Whisper Large V3 Turbo model within the Amazon Bedrock Marketplace. This post also offers step-by-step guidance for the deployment.
The AWS Command Line Interface (AWS CLI) should be installed and configured. For instructions, see Installing or updating to the latest version of the AWS CLI.
Node.js 14.x or later should be installed.
The AWS CDK CLI should be installed.
You should have Python 3.8+.

Create a guardrail in the Amazon Bedrock console
For instructions for creating guardrails in Amazon Bedrock, refer to Create a guardrail. For details on detecting and redacting PII, see Remove PII from conversations by using sensitive information filters. Configure your guardrail with the following key settings:

Enable PII detection and handling
Set PII action to Redact
Add the relevant PII types, such as:

Names and identities
Phone numbers
Email addresses
Physical addresses
Financial information
Other sensitive personal information

After you deploy the guardrail, note the Amazon Resource Name (ARN), and you will be using this when deploys the model.
Deploy the Whisper model
Complete the following steps to deploy the Whisper Large V3 Turbo model:

On the Amazon Bedrock console, choose Model catalog under Foundation models in the navigation pane.
Search for and choose Whisper Large V3 Turbo.
On the options menu (three dots), choose Deploy.

Modify the endpoint name, number of instances, and instance type to suit your specific use case. For this post, we use the default settings.
Modify the Advanced settings section to suit your use case. For this post, we use the default settings.
Choose Deploy.

This creates a new AWS Identity and Access Management IAM role and deploys the model.
You can choose Marketplace deployments in the navigation pane, and in the Managed deployments section, you can see the endpoint status as Creating. Wait for the endpoint to finish deployment and the status to change to In Service, then copy the Endpoint Name, and you will be using this when deploying the

Deploy the solution infrastructure
In the GitHub repo, follow the instructions in the README file to clone the repository, then deploy the frontend and backend infrastructure.
We use the AWS Cloud Development Kit (AWS CDK) to define and deploy the infrastructure. The AWS CDK code deploys the following resources:

React frontend application
Backend infrastructure
S3 buckets for storing uploads and processed results
Step Functions state machine with Lambda functions for audio processing and PII redaction
API Gateway endpoints for handling requests
IAM roles and policies for secure access
CloudFront distribution for hosting the frontend

Implementation deep dive
The backend is composed of a sequence of Lambda functions, each handling a specific stage of the audio processing pipeline:

Upload handler – Receives audio files and stores them in Amazon S3
Transcription with Whisper – Converts speech to text using the Whisper model
Speaker detection – Differentiates and labels individual speakers within the audio
Summarization using Amazon Bedrock – Extracts and summarizes key points from the transcript
PII redaction – Uses Amazon Bedrock Guardrails to remove sensitive information for privacy compliance

Let’s examine some of the key components:
The transcription Lambda function uses the Whisper model to convert audio files to text:
def transcribe_with_whisper(audio_chunk, endpoint_name):
# Convert audio to hex string format
hex_audio = audio_chunk.hex()

# Create payload for Whisper model
payload = {
“audio_input”: hex_audio,
“language”: “english”,
“task”: “transcribe”,
“top_p”: 0.9
}

# Invoke the SageMaker endpoint running Whisper
response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=’application/json’,
Body=json.dumps(payload)
)

# Parse the transcription response
response_body = json.loads(response[‘Body’].read().decode(‘utf-8’))
transcription_text = response_body[‘text’]

return transcription_text

We use Amazon Bedrock to generate concise summaries from the transcriptions:
def generate_summary(transcription):
# Format the prompt with the transcription
prompt = f”{transcription}nnGive me the summary, speakers, key discussions, and action items with owners”

# Call Bedrock for summarization
response = bedrock_runtime.invoke_model(
modelId=”anthropic.claude-3-5-sonnet-20240620-v1:0″,
body=json.dumps({
“prompt”: prompt,
“max_tokens_to_sample”: 4096,
“temperature”: 0.7,
“top_p”: 0.9,
})
)

# Extract and return the summary
result = json.loads(response.get(‘body’).read())
return result.get(‘completion’)
A critical component of our solution is the automatic redaction of PII. We implemented this using Amazon Bedrock Guardrails to support compliance with privacy regulations:
def apply_guardrail(bedrock_runtime, content, guardrail_id):
# Format content according to API requirements
formatted_content = [{“text”: {“text”: content}}]

# Call the guardrail API
response = bedrock_runtime.apply_guardrail(
guardrailIdentifier=guardrail_id,
guardrailVersion=”DRAFT”,
source=”OUTPUT”, # Using OUTPUT parameter for proper flow
content=formatted_content
)

# Extract redacted text from response
if ‘action’ in response and response[‘action’] == ‘GUARDRAIL_INTERVENED’:
if len(response[‘outputs’]) > 0:
output = response[‘outputs’][0]
if ‘text’ in output and isinstance(output[‘text’], str):
return output[‘text’]

# Return original content if redaction fails
return content
When PII is detected, it’s replaced with type indicators (for example, {PHONE} or {EMAIL}), making sure that summaries remain informative while protecting sensitive data.
To manage the complex processing pipeline, we use Step Functions to orchestrate the Lambda functions:
{
“Comment”: “Audio Summarization Workflow”,
“StartAt”: “TranscribeAudio”,
“States”: {
“TranscribeAudio”: {
“Type”: “Task”,
“Resource”: “arn:aws:states:::lambda:invoke”,
“Parameters”: {
“FunctionName”: “WhisperTranscriptionFunction”,
“Payload”: {
“bucket”: “$.bucket”,
“key”: “$.key”
}
},
“Next”: “IdentifySpeakers”
},
“IdentifySpeakers”: {
“Type”: “Task”,
“Resource”: “arn:aws:states:::lambda:invoke”,
“Parameters”: {
“FunctionName”: “SpeakerIdentificationFunction”,
“Payload”: {
“Transcription.$”: “$.Payload”
}
},
“Next”: “GenerateSummary”
},
“GenerateSummary”: {
“Type”: “Task”,
“Resource”: “arn:aws:states:::lambda:invoke”,
“Parameters”: {
“FunctionName”: “BedrockSummaryFunction”,
“Payload”: {
“SpeakerIdentification.$”: “$.Payload”
}
},
“End”: true
}
}
}
This workflow makes sure each step completes successfully before proceeding to the next, with automatic error handling and retry logic built in.
Test the solution
After you have successfully completed the deployment, you can use the CloudFront URL to test the solution functionality.

Security considerations
Security is a critical aspect of this solution, and we’ve implemented several best practices to support data protection and compliance:

Sensitive data redaction – Automatically redact PII to protect user privacy.
Fine-Grained IAM Permissions – Apply the principle of least privilege across AWS services and resources.
Amazon S3 access controls – Use strict bucket policies to limit access to authorized users and roles.
API security – Secure API endpoints using Amazon Cognito for user authentication (optional but recommended).
CloudFront protection – Enforce HTTPS and apply modern TLS protocols to facilitate secure content delivery.
Amazon Bedrock data security – Amazon Bedrock (including Amazon Bedrock Marketplace) protects customer data and does not send data to providers or train using customer data. This makes sure your proprietary information remains secure when using AI capabilities.

Clean up
To prevent unnecessary charges, make sure to delete the resources provisioned for this solution when you’re done:

Delete the Amazon Bedrock guardrail:

On the Amazon Bedrock console, in the navigation menu, choose Guardrails.
Choose your guardrail, then choose Delete.

Delete the Whisper Large V3 Turbo model deployed through the Amazon Bedrock Marketplace:

On the Amazon Bedrock console, choose Marketplace deployments in the navigation pane.
In the Managed deployments section, select the deployed endpoint and choose Delete.

Delete the AWS CDK stack by running the command cdk destroy, which deletes the AWS infrastructure.

Conclusion
This serverless audio summarization solution demonstrates the benefits of combining AWS services to create a sophisticated, secure, and scalable application. By using Amazon Bedrock for AI capabilities, Lambda for serverless processing, and CloudFront for content delivery, we’ve built a solution that can handle large volumes of audio content efficiently while helping you align with security best practices.
The automatic PII redaction feature supports compliance with privacy regulations, making this solution well-suited for regulated industries such as healthcare, finance, and legal services where data security is paramount. To get started, deploy this architecture within your AWS environment to accelerate your audio processing workflows.

About the Authors
Kaiyin Hu is a Senior Solutions Architect for Strategic Accounts at Amazon Web Services, with years of experience across enterprises, startups, and professional services. Currently, she helps customers build cloud solutions and drives GenAI adoption to cloud. Previously, Kaiyin worked in the Smart Home domain, assisting customers in integrating voice and IoT technologies.
Sid Vantair is a Solutions Architect with AWS covering Strategic accounts. He thrives on resolving complex technical issues to overcome customer hurdles. Outside of work, he cherishes spending time with his family and fostering inquisitiveness in his children.

Implement semantic video search using open source large vision models …

Posted on June 7, 2025 by i-genie

As companies and individual users deal with constantly growing amounts of video content, the ability to perform low-effort search to retrieve videos or video segments using natural language becomes increasingly valuable. Semantic video search offers a powerful solution to this problem, so users can search for relevant video content based on textual queries or descriptions. This approach can be used in a wide range of applications, from personal photo and video libraries to professional video editing, or enterprise-level content discovery and moderation, where it can significantly improve the way we interact with and manage video content.
Large-scale pre-training of computer vision models with self-supervision directly from natural language descriptions of images has made it possible to capture a wide set of visual concepts, while also bypassing the need for labor-intensive manual annotation of training data. After pre-training, natural language can be used to either reference the learned visual concepts or describe new ones, effectively enabling zero-shot transfer to a diverse set of computer vision tasks, such as image classification, retrieval, and semantic analysis.
In this post, we demonstrate how to use large vision models (LVMs) for semantic video search using natural language and image queries. We introduce some use case-specific methods, such as temporal frame smoothing and clustering, to enhance the video search performance. Furthermore, we demonstrate the end-to-end functionality of this approach by using both asynchronous and real-time hosting options on Amazon SageMaker AI to perform video, image, and text processing using publicly available LVMs on the Hugging Face Model Hub. Finally, we use Amazon OpenSearch Serverless with its vector engine for low-latency semantic video search.
About large vision models
In this post, we implement video search capabilities using multimodal LVMs, which integrate textual and visual modalities during the pre-training phase, using techniques such as contrastive multimodal representation learning, Transformer-based multimodal fusion, or multimodal prefix language modeling (for more details, see, Review of Large Vision Models and Visual Prompt Engineering by J. Wang et al.). Such LVMs have recently emerged as foundational building blocks for various computer vision tasks. Owing to their capability to learn a wide variety of visual concepts from massive datasets, these models can effectively solve diverse downstream computer vision tasks across different image distributions without the need for fine-tuning. In this section, we briefly introduce some of the most popular publicly available LVMs (which we also use in the accompanying code sample).
The CLIP (Contrastive Language-Image Pre-training) model, introduced in 2021, represents a significant milestone in the field of computer vision. Trained on a collection of 400 million image-text pairs harvested from the internet, CLIP showcased the remarkable potential of using large-scale natural language supervision for learning rich visual representations. Through extensive evaluations across over 30 computer vision benchmarks, CLIP demonstrated impressive zero-shot transfer capabilities, often matching or even surpassing the performance of fully supervised, task-specific models. For instance, a notable achievement of CLIP is its ability to match the top accuracy of a ResNet-50 model trained on the 1.28 million images from the ImageNet dataset, despite operating in a true zero-shot setting without a need for fine-tuning or other access to labeled examples.
Following the success of CLIP, the open-source initiative OpenCLIP further advanced the state-of-the-art by releasing an open implementation pre-trained on the massive LAION-2B dataset, comprised of 2.3 billion English image-text pairs. This substantial increase in the scale of training data enabled OpenCLIP to achieve even better zero-shot performance across a wide range of computer vision benchmarks, demonstrating further potential of scaling up natural language supervision for learning more expressive and generalizable visual representations.
Finally, the set of SigLIP (Sigmoid Loss for Language-Image Pre-training) models, including one trained on a 10 billion multilingual image-text dataset spanning over 100 languages, further pushed the boundaries of large-scale multimodal learning. The models propose an alternative loss function for the contrastive pre-training scheme employed in CLIP and have shown superior performance in language-image pre-training, outperforming both CLIP and OpenCLIP baselines on a variety of computer vision tasks.
Solution overview
Our approach uses a multimodal LVM to enable efficient video search and retrieval based on both textual and visual queries. The approach can be logically split into an indexing pipeline, which can be carried out offline, and an online video search logic. The following diagram illustrates the pipeline workflows.

The indexing pipeline is responsible for ingesting video files and preprocessing them to construct a searchable index. The process begins by extracting individual frames from the video files. These extracted frames are then passed through an embedding module, which uses the LVM to map each frame into a high-dimensional vector representation containing its semantic information. To account for temporal dynamics and motion information present in the video, a temporal smoothing technique is applied to the frame embeddings. This step makes sure the resulting representations capture the semantic continuity across multiple subsequent video frames, rather than treating each frame independently (also see the results discussed later in this post, or consult the following paper for more details). The temporally smoothed frame embeddings are then ingested into a vector index data structure, which is designed for efficient storage, retrieval, and similarity search operations. This indexed representation of the video frames serves as the foundation for the subsequent search pipeline.
The search pipeline facilitates content-based video retrieval by accepting textual queries or visual queries (images) from users. Textual queries are first embedded into the shared multimodal representation space using the LVM’s text encoding capabilities. Similarly, visual queries (images) are processed through the LVM’s visual encoding branch to obtain their corresponding embeddings.
After the textual or visual queries are embedded, we can build a hybrid query to account for keywords or filter constraints provided by the user (for example, to search only across certain video categories, or to search within a particular video). This hybrid query is then used to retrieve the most relevant frame embeddings based on their conceptual similarity to the query, while adhering to any supplementary keyword constraints.
The retrieved frame embeddings are then subjected to temporal clustering (also see the results later in this post for more details), which aims to group contiguous frames into semantically coherent video segments, thereby returning an entire video sequence (rather than disjointed individual frames).
Furthermore, maintaining search diversity and quality is crucial when retrieving content from videos. As mentioned previously, our approach incorporates various methods to enhance search results. For example, during the video indexing phase, the following techniques are employed to control the search results (the parameters of which might need to be tuned to get the best results):

Adjusting the sampling rate, which determines the number of frames embedded from each second of video. Less frequent frame sampling might make sense when working with longer videos, whereas more frequent frame sampling might be needed to catch fast-occurring events.
Modifying the temporal smoothing parameters to, for example, remove inconsistent search hits based on just a single frame hit, or merge repeated frame hits from the same scene.

During the semantic video search phase, you can use the following methods:

Applying temporal clustering as a post-filtering step on the retrieved timestamps to group contiguous frames into semantically coherent video clips (that can be, in principle, directly played back by the end-users). This makes sure the search results maintain temporal context and continuity, avoiding disjointed individual frames.
Setting the search size, which can be effectively combined with temporal clustering. Increasing the search size makes sure the relevant frames are included in the final results, albeit at the cost of higher computational load (see, for example, this guide for more details).

Our approach aims to strike a balance between retrieval quality, diversity, and computational efficiency by employing these techniques during both the indexing and search phases, ultimately enhancing the user experience in semantic video search.
The proposed solution architecture provides efficient semantic video search by using open source LVMs and AWS services. The architecture can be logically divided into two components: an asynchronous video indexing pipeline and online content search logic. The accompanying sample code on GitHub showcases how to build, experiment locally, as well as host and invoke both parts of the workflow using several open source LVMs available on the Hugging Face Model Hub (CLIP, OpenCLIP, and SigLIP). The following diagram illustrates this architecture.

The pipeline for asynchronous video indexing is comprised of the following steps:

The user uploads a video file to an Amazon Simple Storage Service (Amazon S3) bucket, which initiates the indexing process.
The video is sent to a SageMaker asynchronous endpoint for processing. The processing steps involve:

Decoding of frames from the uploaded video file.
Generation of frame embeddings by LVM.
Application of temporal smoothing, accounting for temporal dynamics and motion information present in the video.

The frame embeddings are ingested into an OpenSearch Serverless vector index, designed for efficient storage, retrieval, and similarity search operations.

SageMaker asynchronous inference endpoints are well-suited for handling requests with large payloads, extended processing times, and near real-time latency requirements. This SageMaker capability queues incoming requests and processes them asynchronously, accommodating large payloads and long processing times. Asynchronous inference enables cost optimization by automatically scaling the instance count to zero when there are no requests to process, so computational resources are used only when actively handling requests. This flexibility makes it an ideal choice for applications involving large data volumes, such as video processing, while maintaining responsiveness and efficient resource utilization.
OpenSearch Serverless is an on-demand serverless version for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the LVM. The index created in the OpenSearch Serverless collection serves as the vector store, enabling efficient storage and rapid similarity-based retrieval of relevant video segments.
The online content search then can be broken down to the following steps:

The user provides a textual prompt or an image (or both) representing the desired content to be searched.
The user prompt is sent to a real-time SageMaker endpoint, which results in the following actions:

An embedding is generated for the text or image query.
The query with embeddings is sent to the OpenSearch vector index, which performs a k-nearest neighbors (k-NN) search to retrieve relevant frame embeddings.
The retrieved frame embeddings undergo temporal clustering.

The final search results, comprising relevant video segments, are returned to the user.

SageMaker real-time inference suits workloads needing real-time, interactive, low-latency responses. Deploying models to SageMaker hosting services provides fully managed inference endpoints with automatic scaling capabilities, providing optimal performance for real-time requirements.
Code and environment
This post is accompanied by a sample code on GitHub that provides comprehensive annotations and code to set up the necessary AWS resources, experiment locally with sample video files, and then deploy and run the indexing and search pipelines. The code sample is designed to exemplify best practices when developing ML solutions on SageMaker, such as using configuration files to define flexible inference stack parameters and conducting local tests of the inference artifacts before deploying them to SageMaker endpoints. It also contains guided implementation steps with explanations and reference for configuration parameters. Additionally, the notebook automates the cleanup of all provisioned resources.
Prerequisites
The prerequisite to run the provided code is to have an active AWS account and set up Amazon SageMaker Studio. Refer to Use quick setup for Amazon SageMaker AI to set up SageMaker if you’re a first-time user and then follow the steps to open SageMaker Studio.
Deploy the solution
To start the implementation to clone the repository, open the notebook semantic_video_search_demo.ipynb, and follow the steps in the notebook.
In Section 2 of the notebook, install the required packages and dependencies, define global variables, set up Boto3 clients, and attach required permissions to the SageMaker AWS Identity and Access Management (IAM) role to interact with Amazon S3 and OpenSearch Service from the notebook.
In Section 3, create security components for OpenSearch Serverless (encryption policy, network policy, and data access policy) and then create an OpenSearch Serverless collection. For simplicity, in this proof of concept implementation, we allow public internet access to the OpenSearch Serverless collection resource. However, for production environments, we strongly suggest using private connections between your Virtual Private Cloud (VPC) and OpenSearch Serverless resources through a VPC endpoint. For more details, see Access Amazon OpenSearch Serverless using an interface endpoint (AWS PrivateLink).
In Section 4, import and inspect the config file, and choose an embeddings model for video indexing and corresponding embeddings dimension. In Section 5, create a vector index within the OpenSearch collection you created earlier.
To demonstrate the search results, we also provide references to a few sample videos that you can experiment with in Section 6. In Section 7, you can experiment with the proposed semantic video search approach locally in the notebook, before deploying the inference stacks.
In Sections 8, 9, and 10, we provide code to deploy two SageMaker endpoints: an asynchronous endpoint for video embedding and indexing and a real-time inference endpoint for video search. After these steps, we also test our deployed sematic video search solution with a few example queries.
Finally, Section 11 contains the code to clean up the created resources to avoid recurring costs.
Results
The solution was evaluated across a diverse range of use cases, including the identification of key moments in sports games, specific outfit pieces or color patterns on fashion runways, and other tasks in full-length films on the fashion industry. Additionally, the solution was tested for detecting action-packed moments like explosions in action movies, identifying when individuals entered video surveillance areas, and extracting specific events such as sports award ceremonies.
For our demonstration, we created a video catalog consisting of the following videos: A Look Back at New York Fashion Week: Men’s, F1 Insights powered by AWS, Amazon Air’s newest aircraft, the A330, is here, and Now Go Build with Werner Vogels – Autonomous Trucking.
To demonstrate the search capability for identifying specific objects across this video catalog, we employed four text prompts and four images. The presented results were obtained using the google/siglip-so400m-patch14-384 model, with temporal clustering enabled and a timestamp filter set to 1 second. Additionally, smoothing was enabled with a kernel size of 11, and the search size was set to 20 (which were found to be good default values for shorter videos). The left column in the subsequent figures specifies the search type, either by image or text, along with the corresponding image name or text prompt used.
The following figure shows the text prompts we used and the corresponding results.

The following figure shows the images we used to perform reverse images search and corresponding search results for each image.

As mentioned, we implemented temporal clustering in the lookup code, allowing for the grouping of frames based on their ordered timestamps. The accompanying notebook with sample code showcases the temporal clustering functionality by displaying (a few frames from) the returned video clip and highlighting the key frame with the highest search score within each group, as illustrated in the following figure. This approach facilitates a convenient presentation of the search results, enabling users to return entire playable video clips (even if not all frames were actually indexed in a vector store).

To showcase the hybrid search capabilities with OpenSearch Service, we present results for the textual prompt “sky,” with all other search parameters set identically to the previous configurations. We demonstrate two distinct cases: an unconstrained semantic search across the entire indexed video catalog, and a search confined to a specific video. The following figure illustrates the results obtained from an unconstrained semantic search query.

We conducted the same search for “sky,” but now confined to trucking videos.

To illustrate the effects of temporal smoothing, we generated search signal score charts (based on cosine similarity) for the prompt F1 crews change tyres in the formulaone video, both with and without temporal smoothing. We set a threshold of 0.315 for illustration purposes and highlighted video segments with scores exceeding this threshold. Without temporal smoothing (see the following figure), we observed two adjacent episodes around t=35 seconds and two additional episodes after t=65 seconds. Notably, the third and fourth episodes were significantly shorter than the first two, despite exhibiting higher scores. However, we can do better, if our objective is to prioritize longer semantically cohesive video episodes in the search.

To address this, we apply temporal smoothing. As shown in the following figure, now the first two episodes appear to be merged into a single, extended episode with the highest score. The third episode experienced a slight score reduction, and the fourth episode became irrelevant due to its brevity. Temporal smoothing facilitated the prioritization of longer and more coherent video moments associated with the search query by consolidating adjacent high-scoring segments and suppressing isolated, transient occurrences.

Clean up
To clean up the resources created as part of this solution, refer to the cleanup section in the provided notebook and execute the cells in this section. This will delete the created IAM policies, OpenSearch Serverless resources, and SageMaker endpoints to avoid recurring charges.
Limitations
Throughout our work on this project, we also identified several potential limitations that could be addressed through future work:

Video quality and resolution might impact search performance, because blurred or low-resolution videos can make it challenging for the model to accurately identify objects and intricate details.
Small objects within videos, such as a hockey puck or a football, might be difficult for LVMs to consistently recognize due to their diminutive size and visibility constraints.
LVMs might struggle to comprehend scenes that represent a temporally prolonged contextual situation, such as detecting a point-winning shot in tennis or a car overtaking another vehicle.
Accurate automatic measurement of solution performance is hindered without the availability of manually labeled ground truth data for comparison and evaluation.

Summary
In this post, we demonstrated the advantages of the zero-shot approach to implementing semantic video search using either text prompts or images as input. This approach readily adapts to diverse use cases without the need for retraining or fine-tuning models specifically for video search tasks. Additionally, we introduced techniques such as temporal smoothing and temporal clustering, which significantly enhance the quality and coherence of video search results.
The proposed architecture is designed to facilitate a cost-effective production environment with minimal effort, eliminating the requirement for extensive expertise in machine learning. Furthermore, the current architecture seamlessly accommodates the integration of open source LVMs, enabling the implementation of custom preprocessing or postprocessing logic during both the indexing and search phases. This flexibility is made possible by using SageMaker asynchronous and real-time deployment options, providing a powerful and versatile solution.
You can implement semantic video search using different approaches or AWS services. For related content, refer to the following AWS blog posts as examples on semantic search using proprietary ML models: Implement serverless semantic search of image and live video with Amazon Titan Multimodal Embeddings or Build multimodal search with Amazon OpenSearch Service.

About the Authors
Dr. Alexander Arzhanov is an AI/ML Specialist Solutions Architect based in Frankfurt, Germany. He helps AWS customers design and deploy their ML solutions across the EMEA region. Prior to joining AWS, Alexander was researching origins of heavy elements in our universe and grew passionate about ML after using it in his large-scale scientific calculations.
Dr. Ivan Sosnovik is an Applied Scientist in the AWS Machine Learning Solutions Lab. He develops ML solutions to help customers to achieve their business goals.
Nikita Bubentsov is a Cloud Sales Representative based in Munich, Germany, and part of Technical Field Community (TFC) in computer vision and machine learning. He helps enterprise customers drive business value by adopting cloud solutions and supports AWS EMEA organizations in the computer vision area. Nikita is passionate about computer vision and the future potential that it holds.

Multi-account support for Amazon SageMaker HyperPod task governance

Posted on June 7, 2025 by i-genie

GPUs are a precious resource; they are both short in supply and much more costly than traditional CPUs. They are also highly adaptable to many different use cases. Organizations building or adopting generative AI use GPUs to run simulations, run inference (both for internal or external usage), build agentic workloads, and run data scientists’ experiments. The workloads range from ephemeral single-GPU experiments run by scientists to long multi-node continuous pre-training runs. Many organizations need to share a centralized, high-performance GPU computing infrastructure across different teams, business units, or accounts within their organization. With this infrastructure, they can maximize the utilization of expensive accelerated computing resources like GPUs, rather than having siloed infrastructure that might be underutilized. Organizations also use multiple AWS accounts for their users. Larger enterprises might want to separate different business units, teams, or environments (production, staging, development) into different AWS accounts. This provides more granular control and isolation between these different parts of the organization. It also makes it straightforward to track and allocate cloud costs to the appropriate teams or business units for better financial oversight.
The specific reasons and setup can vary depending on the size, structure, and requirements of the enterprise. But in general, a multi-account strategy provides greater flexibility, security, and manageability for large-scale cloud deployments. In this post, we discuss how an enterprise with multiple accounts can access a shared Amazon SageMaker HyperPod cluster for running their heterogenous workloads. We use SageMaker HyperPod task governance to enable this feature.
Solution overview
SageMaker HyperPod task governance streamlines resource allocation and provides cluster administrators the capability to set up policies to maximize compute utilization in a cluster. Task governance can be used to create distinct teams with their own unique namespace, compute quotas, and borrowing limits. In a multi-account setting, you can restrict which accounts have access to which team’s compute quota using role-based access control.
In this post, we describe the settings required to set up multi-account access for SageMaker HyperPod clusters orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS) and how to use SageMaker HyperPod task governance to allocate accelerated compute to multiple teams in different accounts.
The following diagram illustrates the solution architecture.

In this architecture, one organization is splitting resources across a few accounts. Account A hosts the SageMaker HyperPod cluster. Account B is where the data scientists reside. Account C is where the data is prepared and stored for training usage. In the following sections, we demonstrate how to set up multi-account access so that data scientists in Account B can train a model on Account A’s SageMaker HyperPod and EKS cluster, using the preprocessed data stored in Account C. We break down this setup in two sections: cross-account access for data scientists and cross-account access for prepared data.
Cross-account access for data scientists
When you create a compute allocation with SageMaker HyperPod task governance, your EKS cluster creates a unique Kubernetes namespace per team. For this walkthrough, we create an AWS Identity and Access Management (IAM) role per team, called cluster access roles, that are then scoped access only to the team’s task governance-generated namespace in the shared EKS cluster. Role-based access control is how we make sure the data science members of Team A will not be able to submit tasks on behalf of Team B.
To access Account A’s EKS cluster as a user in Account B, you will need to assume a cluster access role in Account A. The cluster access role will have only the needed permissions for data scientists to access the EKS cluster. For an example of IAM roles for data scientists using SageMaker HyperPod, see IAM users for scientists.
Next, you will need to assume the cluster access role from a role in Account B. The cluster access role in Account A will then need to have a trust policy for the data scientist role in Account B. The data scientist role is the role in account B that will be used to assume the cluster access role in Account A. The following code is an example of the policy statement for the data scientist role so that it can assume the cluster access role in Account A:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: “sts:AssumeRole”,
“Resource”: “arn:aws:iam::XXXXXXXXXXAAA:role/ClusterAccessRole”
}
]
}

The following code is an example of the trust policy for the cluster access role so that it allows the data scientist role to assume it:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:iam::XXXXXXXXXXBBB:role/DataScientistRole”
},
“Action”: “sts:AssumeRole”
}
]
}

The final step is to create an access entry for the team’s cluster access role in the EKS cluster. This access entry should also have an access policy, such as EKSEditPolicy, that is scoped to the namespace of the team. This makes sure that Team A users in Account B can’t launch tasks outside of their assigned namespace. You can also optionally set up custom role-based access control; see Setting up Kubernetes role-based access control for more information.
For users in Account B, you can repeat the same setup for each team. You must create a unique cluster access role for each team to align the access role for the team with their associated namespace. To summarize, we use two different IAM roles:

Data scientist role – The role in Account B used to assume the cluster access role in Account A. This role just needs to be able to assume the cluster access role.
Cluster access role – The role in Account A used to give access to the EKS cluster. For an example, see IAM role for SageMaker HyperPod.

Cross-account access to prepared data
In this section, we demonstrate how to set up EKS Pod Identity and S3 Access Points so that pods running training tasks in Account A’s EKS cluster have access to data stored in Account C. EKS Pod Identity allow you to map an IAM role to a service account in a namespace. If a pod uses the service account that has this association, then Amazon EKS will set the environment variables in the containers of the pod.
S3 Access Points are named network endpoints that simplify data access for shared datasets in S3 buckets. They act as a way to grant fine-grained access control to specific users or applications accessing a shared dataset within an S3 bucket, without requiring those users or applications to have full access to the entire bucket. Permissions to the access point is granted through S3 access point policies. Each S3 Access Point is configured with an access policy specific to a use case or application. Since the HyperPod cluster in this blog post can be used by multiple teams, each team could have its own S3 access point and access point policy.
Before following these steps, ensure you have the EKS Pod Identity Add-on installed on your EKS cluster.

In Account A, create an IAM Role that contains S3 permissions (such as s3:ListBucket and s3:GetObject to the access point resource) and has a trust relationship with Pod Identity; this will be your Data Access Role. Below is an example of a trust policy.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “AllowEksAuthToAssumeRoleForPodIdentity”,
“Effect”: “Allow”,
“Principal”: {
“Service”: “pods.eks.amazonaws.com”
},
“Action”: [
“sts:AssumeRole”,
“sts:TagSession”
]
}
]
}

In Account C, create an S3 access point by following the steps here.
Next, configure your S3 access point to allow access to the role created in step 1. This is an example access point policy that gives Account A permission to access points in account C.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:iam::<Account-A-ID>:role/<Data-Access-Role-Name>”
},
“Action”: [
“s3:ListBucket”,
“s3:GetObject”
],
“Resource”: [
“arn:aws:s3:<Region>:<Account-C-ID>:accesspoint/<Access-Point-Name>”,
“arn:aws:s3:<Region>:<Account-C-ID>:accesspoint/<Access-Point-Name>/object/*”
]
}
]
}

Ensure your S3 bucket policy is updated to allow Account A access. This is an example S3 bucket policy:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: “*”,
“Action”: [
“s3:GetObject”,
“s3:ListBucket”
],
“Resource”: [
“arn:aws:s3:::<bucket-name>”,
“arn:aws:s3:::<bucket-name>/*”
],
“Condition”: {
“StringEquals”: {
“s3:DataAccessPointAccount”: “<Account-C-ID>”
}
}
}
]
}

In Account A, create a pod identity association for your EKS cluster using the AWS CLI.

aws eks create-pod-identity-association
–cluster-name <EKS-Cluster-Name>
–role-arn arn:aws:iam::<Account-A-ID>:role/<Data-Access-Role-Name>
–namespace hyperpod-ns-eng
–service-account my-service-account

Pods accessing cross-account S3 buckets will need the service account name referenced in their pod specification.

You can test cross-account data access by spinning up a test pod and the executing into the pod to run Amazon S3 commands:

kubectl exec -it aws-test -n hyperpod-ns-team-a — aws s3 ls s3://<access-point>

This example shows creating a single data access role for a single team. For multiple teams, use a namespace-specific ServiceAccount with its own data access role to help prevent overlapping resource access across teams. You can also configure cross-account Amazon S3 access for an Amazon FSx for Lustre file system in Account A, as described in Use Amazon FSx for Lustre to share Amazon S3 data across accounts. FSx for Lustre and Amazon S3 will need to be in the same AWS Region, and the FSx for Lustre file system will need to be in the same Availability Zone as your SageMaker HyperPod cluster.
Conclusion
In this post, we provided guidance on how to set up cross-account access to data scientists accessing a centralized SageMaker HyperPod cluster orchestrated by Amazon EKS. In addition, we covered how to provide Amazon S3 data access from one account to an EKS cluster in another account. With SageMaker HyperPod task governance, you can restrict access and compute allocation to specific teams. This architecture can be used at scale by organizations wanting to share a large compute cluster across accounts within their organization. To get started with SageMaker HyperPod task governance, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and SageMaker HyperPod task governance documentation.

About the Authors

Nisha Nadkarni is a Senior GenAI Specialist Solutions Architect at AWS, where she guides companies through best practices when deploying large scale distributed training and inference on AWS. Prior to her current role, she spent several years at AWS focused on helping emerging GenAI startups develop models from ideation to production.
Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.
Kareem Syed-Mohammed is a Product Manager at AWS. He is focused on compute optimization and cost governance. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.
Rajesh Ramchander is a Principal ML Engineer in Professional Services at AWS. He helps customers at various stages in their AI/ML and GenAI journey, from those that are just getting started all the way to those that are leading their business with an AI-first strategy.

A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent …

Posted on June 6, 2025 by i-genie

In this tutorial, we demonstrate how to build a multi-step, intelligent query-handling agent using LangGraph and Gemini 1.5 Flash. The core idea is to structure AI reasoning as a stateful workflow, where an incoming query is passed through a series of purposeful nodes: routing, analysis, research, response generation, and validation. Each node operates as a functional block with a well-defined role, making the agent not just reactive but analytically aware. Using LangGraph’s StateGraph, we orchestrate these nodes to create a looping system that can re-analyze and improve its output until the response is validated as complete or a max iteration threshold is reached.

Copy CodeCopiedUse a different Browser!pip install langgraph langchain-google-genai python-dotenv

First, the command !pip install langgraph langchain-google-genai python-dotenv installs three Python packages essential for building intelligent agent workflows. langgraph enables graph-based orchestration of AI agents, langchain-google-genai provides integration with Google’s Gemini models, and python-dotenv allows secure loading of environment variables from .env files.

Copy CodeCopiedUse a different Browserimport os
from typing import Dict, Any, List
from dataclasses import dataclass
from langgraph.graph import Graph, StateGraph, END
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.schema import HumanMessage, SystemMessage
import json

os.environ[“GOOGLE_API_KEY”] = “Use Your API Key Here”

We import essential modules and libraries for building agent workflows, including ChatGoogleGenerativeAI for interacting with Gemini models and StateGraph for managing conversational state. The line os.environ[“GOOGLE_API_KEY”] = “Use Your API Key Here” assigns the API key to an environment variable, allowing the Gemini model to authenticate and generate responses.

Check out the Notebook here

This AgentState dataclass defines the shared state that persists across different nodes in a LangGraph workflow. It tracks key fields, including the user’s query, retrieved context, any analysis performed, the generated response, and the recommended next action. It also includes an iteration counter and a max_iterations limit to control how many times the workflow can loop, enabling iterative reasoning or decision-making by the agent.

Copy CodeCopiedUse a different Browser@dataclass
class AgentState:
“””State shared across all nodes in the graph”””
query: str = “”
context: str = “”
analysis: str = “”
response: str = “”
next_action: str = “”
iteration: int = 0
max_iterations: int = 3
This AgentState dataclass defines the shared state that persists across different nodes in a LangGraph workflow. It tracks key fields, including the user’s query, retrieved context, any analysis performed, the generated response, and the recommended next action. It also includes an iteration counter and a max_iterations limit to control how many times the workflow can loop, enabling iterative reasoning or decision-making by the agent.

class GraphAIAgent:
def __init__(self, api_key: str = None):
if api_key:
os.environ[“GOOGLE_API_KEY”] = api_key

self.llm = ChatGoogleGenerativeAI(
model=”gemini-1.5-flash”,
temperature=0.7,
convert_system_message_to_human=True
)

self.analyzer = ChatGoogleGenerativeAI(
model=”gemini-1.5-flash”,
temperature=0.3,
convert_system_message_to_human=True
)

self.graph = self._build_graph()

def _build_graph(self) -> StateGraph:
“””Build the LangGraph workflow”””
workflow = StateGraph(AgentState)

workflow.add_node(“router”, self._router_node)
workflow.add_node(“analyzer”, self._analyzer_node)
workflow.add_node(“researcher”, self._researcher_node)
workflow.add_node(“responder”, self._responder_node)
workflow.add_node(“validator”, self._validator_node)

workflow.set_entry_point(“router”)
workflow.add_edge(“router”, “analyzer”)
workflow.add_conditional_edges(
“analyzer”,
self._decide_next_step,
{
“research”: “researcher”,
“respond”: “responder”
}
)
workflow.add_edge(“researcher”, “responder”)
workflow.add_edge(“responder”, “validator”)
workflow.add_conditional_edges(
“validator”,
self._should_continue,
{
“continue”: “analyzer”,
“end”: END
}
)

return workflow.compile()

def _router_node(self, state: AgentState) -> Dict[str, Any]:
“””Route and categorize the incoming query”””
system_msg = “””You are a query router. Analyze the user’s query and provide context.
Determine if this is a factual question, creative request, problem-solving task, or analysis.”””

messages = [
SystemMessage(content=system_msg),
HumanMessage(content=f”Query: {state.query}”)
]

response = self.llm.invoke(messages)

return {
“context”: response.content,
“iteration”: state.iteration + 1
}

def _analyzer_node(self, state: AgentState) -> Dict[str, Any]:
“””Analyze the query and determine the approach”””
system_msg = “””Analyze the query and context. Determine if additional research is needed
or if you can provide a direct response. Be thorough in your analysis.”””

messages = [
SystemMessage(content=system_msg),
HumanMessage(content=f”””
Query: {state.query}
Context: {state.context}
Previous Analysis: {state.analysis}
“””)
]

response = self.analyzer.invoke(messages)
analysis = response.content

if “research” in analysis.lower() or “more information” in analysis.lower():
next_action = “research”
else:
next_action = “respond”

return {
“analysis”: analysis,
“next_action”: next_action
}

def _researcher_node(self, state: AgentState) -> Dict[str, Any]:
“””Conduct additional research or information gathering”””
system_msg = “””You are a research assistant. Based on the analysis, gather relevant
information and insights to help answer the query comprehensively.”””

messages = [
SystemMessage(content=system_msg),
HumanMessage(content=f”””
Query: {state.query}
Analysis: {state.analysis}
Research focus: Provide detailed information relevant to the query.
“””)
]

response = self.llm.invoke(messages)

updated_context = f”{state.context}nnResearch: {response.content}”

return {“context”: updated_context}

def _responder_node(self, state: AgentState) -> Dict[str, Any]:
“””Generate the final response”””
system_msg = “””You are a helpful AI assistant. Provide a comprehensive, accurate,
and well-structured response based on the analysis and context provided.”””

messages = [
SystemMessage(content=system_msg),
HumanMessage(content=f”””
Query: {state.query}
Context: {state.context}
Analysis: {state.analysis}

Provide a complete and helpful response.
“””)
]

response = self.llm.invoke(messages)

return {“response”: response.content}

def _validator_node(self, state: AgentState) -> Dict[str, Any]:
“””Validate the response quality and completeness”””
system_msg = “””Evaluate if the response adequately answers the query.
Return ‘COMPLETE’ if satisfactory, or ‘NEEDS_IMPROVEMENT’ if more work is needed.”””

messages = [
SystemMessage(content=system_msg),
HumanMessage(content=f”””
Original Query: {state.query}
Response: {state.response}

Is this response complete and satisfactory?
“””)
]

response = self.analyzer.invoke(messages)
validation = response.content

return {“context”: f”{state.context}nnValidation: {validation}”}

def _decide_next_step(self, state: AgentState) -> str:
“””Decide whether to research or respond directly”””
return state.next_action

def _should_continue(self, state: AgentState) -> str:
“””Decide whether to continue iterating or end”””
if state.iteration >= state.max_iterations:
return “end”
if “COMPLETE” in state.context:
return “end”
if “NEEDS_IMPROVEMENT” in state.context:
return “continue”
return “end”

def run(self, query: str) -> str:
“””Run the agent with a query”””
initial_state = AgentState(query=query)
result = self.graph.invoke(initial_state)
return result[“response”]

Check out the Notebook here

The GraphAIAgent class defines a LangGraph-based AI workflow using Gemini models to iteratively analyze, research, respond, and validate answers to user queries. It utilizes modular nodes, such as router, analyzer, researcher, responder, and validator, to reason through complex tasks, refining responses through controlled iterations.

Copy CodeCopiedUse a different Browserdef main():
agent = GraphAIAgent(“Use Your API Key Here”)

test_queries = [
“Explain quantum computing and its applications”,
“What are the best practices for machine learning model deployment?”,
“Create a story about a robot learning to paint”
]

print(” Graph AI Agent with LangGraph and Gemini”)
print(“=” * 50)

for i, query in enumerate(test_queries, 1):
print(f”n Query {i}: {query}”)
print(“-” * 30)

try:
response = agent.run(query)
print(f” Response: {response}”)
except Exception as e:
print(f” Error: {str(e)}”)

print(“n” + “=”*50)

if __name__ == “__main__”:
main()

Finally, the main() function initializes the GraphAIAgent with a Gemini API key and runs it on a set of test queries covering technical, strategic, and creative tasks. It prints each query and the AI-generated response, showcasing how the LangGraph-driven agent processes diverse types of input using Gemini’s reasoning and generation capabilities.

In conclusion, by combining LangGraph’s structured state machine with the power of Gemini’s conversational intelligence, this agent represents a new paradigm in AI workflow engineering, one that mirrors human reasoning cycles of inquiry, analysis, and validation. The tutorial provides a modular and extensible template for developing advanced AI agents that can autonomously handle various tasks, ranging from answering complex queries to generating creative content.

Check out the Notebook here. All credit for this research goes to the researchers of this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and Gemini appeared first on MarkTechPost.

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents …

Posted on June 6, 2025 by i-genie

Web automation agents have become a growing focus in artificial intelligence, particularly due to their ability to execute human-like actions in digital environments. These agents interact with websites via Graphical User Interfaces (GUIs), mimicking human behaviors such as clicking, typing, and navigating across web pages. This approach bypasses the need for dedicated Application Programming Interfaces (APIs), which are often unavailable or limited in many web applications. Instead, these agents can operate universally across web domains, making them flexible tools for a broad range of tasks. The evolution of large language models (LLMs) has enabled these agents to not only interpret web content but also reason, plan, and act with increasing sophistication. As their abilities grow, so too does the need to evaluate them on more than just simple browsing tasks. Benchmarks that once sufficed for early models are no longer capable of measuring the full extent of modern agents’ capabilities.

As these web agents progress, a pressing issue arises: their competence in handling mundane, memory-intensive, and multi-step digital chores remains insufficiently measured. Many tasks that humans perform on websites, such as retrieving data from different pages, performing calculations based on previous inputs, or applying complex rules, require significant cognitive effort. These are not merely navigation challenges; they test memory, logic, and long-term planning. Yet most benchmarks focus on simplified scenarios, failing to reflect the types of digital chores people often prefer to avoid. Furthermore, the limitations in these benchmarks become more apparent as agents improve their performance. Ambiguities in task instructions or inconsistencies in expected outputs begin to skew evaluations. When agents generate reasonable but slightly divergent answers, they are penalized incorrectly due to vague task definitions. Such flaws make it difficult to distinguish between true model limitations and benchmark shortcomings.

Previous efforts to evaluate web agents have focused on benchmarks such as WebArena. WebArena gained widespread adoption due to its reproducibility and ability to simulate real-world websites, including Reddit, GitLab, and E-Commerce Platforms. It offered over 800 tasks designed to test an agent’s ability to complete web-based goals within these environments. However, these tasks mostly focused on general browsing and did not adequately challenge more advanced agents. Other benchmarks, such as Mind2Web, GAIA, and MMIn, contributed by exploring real web tasks or platform-specific environments like ServiceNow, but each came with trade-offs. Some lacked interactivity, others did not support reproducibility, and some were too narrowly scoped. These limitations created a gap in measuring agent progress in areas that require complex decision-making, long-term memory, and accurate data processing across multiple webpages.

Researchers from the University of Tokyo introduced WebChoreArena. This expanded framework builds upon the structure of WebArena but significantly increases task difficulty and complexity. WebChoreArena features a total of 532 newly curated tasks, distributed across the same four simulated websites. These tasks are designed to be more demanding, reflecting scenarios where agents must engage in tasks like data aggregation, memory recall, and multi-step reasoning. Importantly, the benchmark was constructed to ensure full reproducibility and standardization, enabling fair comparisons between agents and avoiding the ambiguities found in earlier tools. The inclusion of diverse task types and input modalities helps simulate realistic web usage and evaluates agents on a more practical and challenging scale.

WebChoreArena categorizes its tasks into four main types. One hundred seventeen tasks fall under Massive Memory, requiring agents to extract and remember large volumes of information, such as compiling all customer names linked to high-value transactions. Calculation tasks, which include 132 entries, involve arithmetic operations like identifying the highest spending months based on multiple data points. Long-Term Memory tasks number 127 and test the agent’s ability to connect information across various pages, such as retrieving pricing rules from one site and applying them on another. An additional 65 tasks are categorized as ‘Others’, including operations such as assigning labels in GitLab that do not fit traditional task formats. Each task specifies its input modality, with 451 tasks solvable with any observation type, 69 requiring only textual input, and 12 dependent exclusively on image inputs.

In evaluating the benchmark, the researchers used three prominent large language models: GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro. These were tested in conjunction with two advanced web agents, AgentOccam and BrowserGym. The results highlighted the increased difficulty of WebChoreArena compared to previous benchmarks. GPT-4o, which had achieved 42.8% accuracy on WebArena, managed only 6.8% on WebChoreArena. Claude 3.7 Sonnet and Gemini 2.5 Pro performed better, with Gemini reaching a peak accuracy of 44.9%. Despite being the top performer, this result still reflected significant gaps in capability when dealing with the more complex tasks of WebChoreArena. The benchmark also proved more sensitive in detecting performance differences between models, making it a valuable tool for benchmarking ongoing advances in web agent technologies.

Several Key Takeaways from the research include:

WebChoreArena includes 532 tasks: 117 Massive Memory, 132 Calculation, 127 Long-Term Memory, and 65 Others.

Tasks are distributed across Shopping (117), Shopping Admin (132), Reddit (91), GitLab (127), and 65 Cross-site scenarios.

Input types: 451 tasks are solvable with any input, 69 require textual input, and 12 need image input.

GPT-4o scored only 6.8% on WebChoreArena compared to 42.8% on WebArena.

Gemini 2.5 Pro achieved the highest score at 44.9%, indicating current limitations in handling complex tasks.

WebChoreArena provides a clearer performance gradient between models than WebArena, enhancing benchmarking value.

A total of 117 task templates were used to ensure diversity and reproducibility across roughly 4.5 instances per template.

The benchmark demanded over 300 hours of annotation and refinement, reflecting its rigorous construction.

Evaluations utilize string matching, URL matching, and HTML structure comparisons to assess accuracy.

In conclusion, this research highlights the disparity between general browsing proficiency and the higher-order cognitive abilities necessary for web-based tasks. The newly introduced WebChoreArena stands as a robust and detailed benchmark designed specifically to push web agents into territories where they must rely on reasoning, memory, and logic. It replaces ambiguity with standardization, and its tasks mimic the digital drudgery that agents must learn to handle if they are to become truly useful in automating real-world activities.

Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks appeared first on MarkTechPost.

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterp …

Posted on June 6, 2025 by i-genie

AI agents powered by LLMs show great promise for handling complex business tasks, especially in areas like Customer Relationship Management (CRM). However, evaluating their real-world effectiveness is challenging due to the lack of publicly available, realistic business data. Existing benchmarks often focus on simple, one-turn interactions or narrow applications, such as customer service, missing out on broader domains, including sales, CPQ processes, and B2B operations. They also fail to test how well agents manage sensitive information. These limitations make it challenging to fully comprehend how LLM agents perform across the diverse range of real-world business scenarios and communication styles.

Previous benchmarks have largely focused on customer service tasks in B2C scenarios, overlooking key business operations, such as sales and CPQ processes, as well as the unique challenges of B2B interactions, including longer sales cycles. Moreover, many benchmarks lack realism, often ignoring multi-turn dialogue or skipping expert validation of tasks and environments. Another critical gap is the absence of confidentiality evaluation, vital in workplace settings where AI agents routinely engage with sensitive business and customer data. Without assessing data awareness, these benchmarks fail to address serious practical concerns, such as privacy, legal risk, and trust.

Researchers from Salesforce AI Research have introduced CRMArena-Pro, a benchmark designed to realistically evaluate LLM agents like Gemini 2.5 Pro in professional business environments. It features expert-validated tasks across customer service, sales, and CPQ, spanning both B2B and B2C contexts. The benchmark tests multi-turn conversations and assesses confidentiality awareness. Findings show that even top-performing models such as Gemini 2.5 Pro achieve only around 58% accuracy in single-turn tasks, with performance dropping to 35% in multi-turn settings. Workflow Execution is an exception, where Gemini 2.5 Pro exceeds 83%, but confidentiality handling remains a major challenge across all evaluated models.

CRMArena-Pro is a new benchmark created to rigorously test LLM agents in realistic business settings, including customer service, sales, and CPQ scenarios. Built using synthetic yet structurally accurate enterprise data generated with GPT-4 and based on Salesforce schemas, the benchmark simulates business environments through sandboxed Salesforce Organizations. It features 19 tasks grouped under four key skills: database querying, textual reasoning, workflow execution, and policy compliance. CRMArena-Pro also includes multi-turn conversations with simulated users and tests confidentiality awareness. Expert evaluations confirmed the realism of the data and environment, ensuring a reliable testbed for LLM agent performance.

The evaluation compared top LLM agents across 19 business tasks, focusing on task completion and awareness of confidentiality. Metrics varied by task type—exact match was used for structured outputs, and F1 score for generative responses. A GPT-4o-based LLM Judge assessed whether models appropriately refused to share sensitive information. Models like Gemini-2.5-Pro and o1, with advanced reasoning, clearly outperformed lighter or non-reasoning versions, especially in complex tasks. While performance was similar across B2B and B2C settings, nuanced trends emerged based on model strength. Confidentiality-aware prompts improved refusal rates but sometimes reduced task accuracy, highlighting a trade-off between privacy and performance.

In conclusion, CRMArena-Pro is a new benchmark designed to test how well LLM agents handle real-world business tasks in customer relationship management. It includes 19 expert-reviewed tasks across both B2B and B2C scenarios, covering sales, service, and pricing operations. While top agents performed decently in single-turn tasks (about 58% success), their performance dropped sharply to around 35% in multi-turn conversations. Workflow execution was the easiest area, but most other skills proved challenging. Confidentiality awareness was low, and improving it through prompting often reduced task accuracy. These findings reveal a clear gap between the capabilities of LLMs and the needs of enterprises.

Check out the Paper, GitHub Page, Hugging Face Page and Technical Blog. All credit for this research goes to the researchers of this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents appeared first on MarkTechPost.

Modernize and migrate on-premises fraud detection machine learning wor …

Posted on June 6, 2025 by i-genie

This post is co-written with Qing Chen and Mark Sinclair from Radial.
Radial is the largest 3PL fulfillment provider, also offering integrated payment, fraud detection, and omnichannel solutions to mid-market and enterprise brands. With over 30 years of industry expertise, Radial tailors its services and solutions to align strategically with each brand’s unique needs.
Radial supports brands in tackling common ecommerce challenges, from scalable, flexible fulfillment enabling delivery consistency to providing secure transactions. With a commitment to fulfilling promises from click to delivery, Radial empowers brands to navigate the dynamic digital landscape with the confidence and capability to deliver a seamless, secure, and superior ecommerce experience.
In this post, we share how Radial optimized the cost and performance of their fraud detection machine learning (ML) applications by modernizing their ML workflow using Amazon SageMaker.
Businesses need for fraud detection models
ML has proven to be an effective approach in fraud detection compared to traditional approaches. ML models can analyze vast amounts of transactional data, learn from historical fraud patterns, and detect anomalies that signal potential fraud in real time. By continuously learning and adapting to new fraud patterns, ML can make sure fraud detection systems stay resilient and robust against evolving threats, enhancing detection accuracy and reducing false positives over time. This post showcases how companies like Radial can modernize and migrate their on-premises fraud detection ML workflows to SageMaker. By using the AWS Experience-Based Acceleration (EBA) program, they can enhance efficiency, scalability, and maintainability through close collaboration.
Challenges of on-premises ML models
Although ML models are highly effective at combating evolving fraud trends, managing these models on premises presents significant scalability and maintenance challenges.
Scalability
On-premises systems are inherently limited by the physical hardware available. During peak shopping seasons, when transaction volumes surge, the infrastructure might struggle to keep up without substantial upfront investment. This can result in slower processing times or a reduced capacity to run multiple ML applications concurrently, potentially leading to missed fraud detections. Scaling an on-premises infrastructure is typically a slow and resource-intensive process, hindering a business’s ability to adapt quickly to increased demand. On the model training side, data scientists often face bottlenecks due to limited resources, forcing them to wait for infrastructure availability or reduce the scope of their experiments. This delays innovation and can lead to suboptimal model performance, putting businesses at a disadvantage in a rapidly changing fraud landscape.
Maintenance
Maintaining an on-premises infrastructure for fraud detection requires a dedicated IT team to manage servers, storage, networking, and backups. Maintaining uptime often involves implementing and maintaining redundant systems, because a failure could result in critical downtime and an increased risk of undetected fraud. Moreover, fraud detection models naturally degrade over time and require regular retraining, deployment, and monitoring. On-premises systems typically lack the built-in automation tools needed to manage the full ML lifecycle. As a result, IT teams must manually handle tasks such as updating models, monitoring for drift, and deploying new versions. This adds operational complexity, increases the likelihood of errors, and diverts valuable resources from other business-critical activities.
Common modernization challenges in ML cloud migration
Organizations face several significant challenges when modernizing their ML workloads through cloud migration. One major hurdle is the skill gap, where developers and data scientists might lack expertise in microservices architecture, advanced ML tools, and DevOps practices for cloud environments. This can lead to development delays, complex and costly architectures, and increased security vulnerabilities. Cross-functional barriers, characterized by limited communication and collaboration between teams, can also impede modernization efforts by hindering information sharing. Slow decision-making is another critical challenge. Many organizations take too long to make choices about their cloud move. They spend too much time thinking about options instead of taking action. This delay can cause them to miss chances to speed up their modernization. It also stops them from using the cloud’s ability to quickly try new things and make changes. In the fast-moving world of ML and cloud technology, being slow to decide can put companies behind their competitors. Another significant obstacle is complex project management, because modernization initiatives often require coordinating work across multiple teams with conflicting priorities. This challenge is compounded by difficulties in aligning stakeholders on business outcomes, quantifying and tracking benefits to demonstrate value, and balancing long-term benefits with short-term goals. To address these challenges and streamline modernization efforts, AWS offers the EBA program. This methodology is designed to assist customers in aligning executives’ vision and resolving roadblocks, accelerating their cloud journey, and achieving a successful migration and modernization of their ML workloads to the cloud.
EBA: AWS team collaboration
EBA is a 3-day interactive workshop that uses SageMaker to accelerate business outcomes. It guides participants through a prescriptive ML lifecycle, starting with identifying business goals and ML problem framing, and progressing through data processing, model development, production deployment, and monitoring.
We recognize that customers have different starting points. For those beginning from scratch, it’s often simpler to start with low code or no code solutions like Amazon SageMaker Canvas and Amazon SageMaker JumpStart, gradually transitioning to developing custom models on Amazon SageMaker Studio. However, because Radial has an existing on-premises ML infrastructure, we can begin directly by using SageMaker to address challenges in their current solution.
During the EBA, experienced AWS ML subject matter experts and the AWS Account Team worked closely with Radial’s cross-functional team. The AWS team offered tailored advice, tackled obstacles, and enhanced the organization’s capacity for ongoing ML integration. Instead of concentrating solely on data and ML technology, the emphasis is on addressing critical business challenges. This strategy helps organizations extract significant value from previously underutilized resources.
Modernizing ML workflows: From a legacy on-premises data center to SageMaker
Before modernization, Radial hosted its ML applications on premises within its data center. The legacy ML workflow presented several challenges, particularly in the time-intensive model development and deployment processes.
Legacy workflow: On-premises ML development and deployment
When the data science team needed to build a new fraud detection model, the development process typically took 2–4 weeks. During this phase, data scientists performed tasks such as the following:

Data cleaning and exploratory data analysis (EDA)
Feature engineering
Model prototyping and training experiments
Model evaluation to finalize the fraud detection model

These steps were carried out using on-premises servers, which limited the number of experiments that could be run concurrently due to hardware constraints. After the model was finalized, the data science team handed over the model artifacts and implementation code—along with detailed instructions—to the software developers and DevOps teams. This transition initiated the model deployment process, which involved:

Provisioning infrastructure – The software team set up the necessary infrastructure to host the ML API in a test environment.
API implementation and testing – Extensive testing and communication between the data science and software teams were required to make sure the model inference API behaved as expected. This phase typically added 2–3 weeks to the timeline.
Production deployment – The DevOps and system engineering teams provisioned and scaled on-premises hardware to deploy the ML API into production, a process that could take up to several weeks depending on resource availability.

Overall, the legacy workflow was prone to delays and inefficiencies, with significant communication overhead and a reliance on manual provisioning.
Modern workflow: SageMaker and MLOps
With the migration to SageMaker and the adoption of a machine learning operations (MLOps) architecture, Radial streamlined its entire ML lifecycle—from development to deployment. The new workflow consists of the following stages:

Model development – The data science team continues to perform tasks such as data cleaning, EDA, feature engineering, and model training within 2–4 weeks. However, with the scalable and on-demand compute resources of SageMaker, they can conduct more training experiments in the same timeframe, leading to improved model performance and faster iterations.
Seamless model deployment – When a model is ready, the data science team approves it in SageMaker and triggers the MLOps pipeline to deploy the model to the test (pre-production) environment. This eliminates the need for back-and-forth communication with the software team at this stage. Key improvements include:

The ML API inference code is preconfigured and wrapped by the data scientists during development, providing consistent behavior between development and deployment.
Deployment to test environments takes minutes, because the MLOps pipeline automates infrastructure provisioning and deployment.

Final integration and testing – The software team quickly integrates the API and performs necessary tests, such as integration and load testing. After the tests are successful, the team triggers the pipeline to deploy the ML models into production, which takes only minutes.

The MLOps pipeline not only automates the provisioning of cloud resources, but also provides consistency between pre-production and production environments, minimizing deployment risks.
Legacy vs. modern workflow comparison
The new workflow significantly reduces time and complexity:

Manual provisioning and communication overheads are reduced
Deployment times are reduced from weeks to minutes
Consistency between environments provides smoother transitions from development to production

This transformation enables Radial to respond more quickly to evolving fraud trends while maintaining high standards of efficiency and reliability. The following figure provides a visual comparison of the legacy and modern ML workflows.
Solution overview
When Radial migrated their fraud detection systems to the cloud, they collaborated with AWS Machine Learning Specialists and Solutions Architects to redesign how Radial manage the lifecycle of ML models. By using AWS and integrating continuous integration and delivery (CI/CD) pipelines with GitLab, Terraform, and AWS CloudFormation, Radial developed a scalable, efficient, and secure MLOps architecture. This new design accelerates model development and deployment, so Radial can respond faster to evolving fraud detection challenges.
The architecture incorporates best practices in MLOps, making sure that the different stages of the ML lifecycle—from data preparation to production deployment—are optimized for performance and reliability. Key components of the solution include:

SageMaker – Central to the architecture, SageMaker facilitates model training, evaluation, and deployment with built-in tools for monitoring and version control
GitLab CI/CD pipelines – These pipelines automate the workflows for testing, building, and deploying ML models, reducing manual overhead and providing consistent processes across environments
Terraform and AWS CloudFormation – These services enable infrastructure as code (IaC) to provision and manage AWS resources, providing a repeatable and scalable setup for ML applications

The overall solution architecture is illustrated in the following figure, showcasing how each component integrates seamlessly to support Radial’s fraud detection initiatives.

Account isolation for secure and scalable MLOps
To streamline operations and enforce security, the MLOps architecture is built on a multi-account strategy that isolates environments based on their purpose. This design enforces strict security boundaries, reduces risks, and promotes efficient collaboration across teams. The accounts are as follows:

Development account (model development workspace) – The development account is a dedicated workspace for data scientists to experiment and develop models. Secure data management is enforced by isolating datasets within Amazon Simple Storage Service (Amazon S3) buckets. Data scientists use SageMaker Studio for data exploration, feature engineering, and scalable model training. When the model build CI/CD pipeline in GitLab is triggered, Terraform and CloudFormation scripts automate the provisioning of infrastructure and AWS resources needed for SageMaker training pipelines. Trained models that meet predefined evaluation metrics are versioned and registered in the Amazon SageMaker Model Registry. With this setup, data scientists and ML engineers can perform multiple rounds of training experiments, review results, and finalize the best model for deployment testing.
Pre-production account (staging environment) – After a model is validated and approved in the development account, it’s moved to the pre-production account for staging. At this stage, the data science team triggers the model deploy CI/CD pipeline in GitLab to configure the endpoint in the pre-production environment. Model artifacts and inference images are synced from the development account to the pre-production environment. The latest approved model is deployed as an API in a SageMaker endpoint, where it undergoes thorough integration and load testing to validate performance and reliability.
Production account (live environment) – After passing the pre-production tests, the model is promoted to the production account for live deployment. This account mirrors the configurations of the pre-production environment to maintain consistency and reliability. The MLOps production team triggers the model deploy CI/CD pipeline to launch the production ML API. When it’s live, the model is continuously monitored using Amazon SageMaker Model Monitor and Amazon CloudWatch to make sure it performs as expected. In the event of deployment issues, automated rollback mechanisms revert to a stable model version, minimizing disruptions and maintaining business continuity.

With this multi-account architecture, data scientists can work independently while providing seamless transitions between development and production. The automation of CI/CD pipelines reduces deployment cycles, enhances scalability, and provides the security and performance necessary to maintain effective fraud detection systems.
Data privacy and compliance requirements
Radial prioritizes the protection and security of their customers’ data. As a leader in ecommerce solutions, they are committed to meeting the high standards of data privacy and regulatory compliance such as CPPA and PCI. Radial fraud detection ML APIs process sensitive information such as transaction details and behavioral analytics. To meet strict compliance requirements, they use AWS Direct Connect, Amazon Virtual Private Cloud (Amazon VPC), and Amazon S3 with AWS Key Management Service (AWS KMS) encryption to build a secure and compliant architecture.
Protecting data in transit with Direct Connect
Data is never exposed to the public internet at any stage. To maintain the secure transfer of sensitive data between on-premises systems and AWS environments, Radial uses Direct Connect, which offers the following capabilities:

Dedicated network connection – Direct Connect establishes a private, high-speed connection between the data center and AWS, alleviating the risks associated with public internet traffic, such as interception or unauthorized access
Consistent and reliable performance – Direct Connect provides consistent bandwidth and low latency, making sure fraud detection APIs operate without delays, even during peak transaction volumes

Isolating workloads with Amazon VPC
When data reaches AWS, it’s processed in a VPC for maximum security. This offers the following benefits:

Private subnets for sensitive data – The components of the fraud detection ML API, including SageMaker endpoints and AWS Lambda functions, reside in private subnets, which are not accessible from the public internet
Controlled access with security groups – Strict access control is enforced through security groups and network access control lists (ACLs), allowing only authorized systems and users to interact with VPC resources
Data segregation by account – As mentioned previously regarding the multi-account strategy, workloads are isolated across development, staging, and production accounts, each with its own VPC, to limit cross-environment access and maintain compliance.

Securing data at rest with Amazon S3 and AWS KMS encryption
Data involved in the fraud detection workflows (for both model development and real-time inference) is securely stored in Amazon S3, with encryption powered by AWS KMS. This offers the following benefits:

AWS KMS encryption for sensitive data – Transaction logs, model artifacts, and prediction results are encrypted at rest using managed KMS keys
Encryption in transit – Interactions with Amazon S3, including uploads and downloads, are encrypted to make sure data remains secure during transfer
Data retention policies – Lifecycle policies enforce data retention limits, making sure sensitive data is stored only as long as necessary for compliance and business purposes before scheduled deletion

Data privacy by design
Data privacy is integrated into every step of the ML API workflow:

Secure inference – Incoming transaction data is processed within VPC-secured SageMaker endpoints, making sure predictions are made in a private environment
Minimal data retention – Real-time transaction data is anonymized where possible, and only aggregated results are stored for future analysis
Access control and governance – Resources are governed by AWS Identity and Access Management (IAM) policies, making sure only authorized personnel and services can access data and infrastructure

Benefits of the new ML workflow on AWS
To summarize, the implementation of the new ML workflow on AWS offers several key benefits:

Dynamic scalability – AWS enables Radial to scale their infrastructure dynamically to handle spikes in both model training and real-time inference traffic, providing optimal performance during peak periods.
Faster infrastructure provisioning – The new workflow accelerates the model deployment cycle, reducing the time to provision infrastructure and deploy new models by up to several weeks.
Consistency in model training and deployment – By streamlining the process, Radial achieves consistent model training and deployment across environments. This reduces communication overhead between the data science team and engineering/DevOps teams, simplifying the implementation of model deployment.
Infrastructure as code – With IaC, they benefit from version control and reusability, reducing manual configurations and minimizing the risk of errors during deployment.
Built-in model monitoring – The built-in capabilities of SageMaker, such as experiment tracking and data drift detection, help them maintain model performance and provide timely updates.

Key takeaways and lessons learned from Radial’s ML model migration
To help modernize your MLOps workflow on AWS, the following are a few key takeaways and lessons learned from Radial’s experience:

Collaborate with AWS for customized solutions – Engage with AWS to discuss your specific use cases and identify templates that closely match your requirements. Although AWS offers a wide range of templates for common MLOps scenarios, they might need to be customized to fit your unique needs. Explore how to adapt these templates for migrating or revamping your ML workflows.
Iterative customization and support – As you customize your solution, work closely with both your internal team and AWS Support to address any issues. Plan for execution-based assessments and schedule workshops with AWS to resolve challenges at each stage. This might be an iterative process, but it makes sure your modules are optimized for your environment.
Use account isolation for security and collaboration – Use account isolation to separate model development, pre-production, and production environments. This setup promotes seamless collaboration between your data science team and DevOps/MLOps team, while also enforcing strong security boundaries between environments.
Maintain scalability with proper configuration – Radial’s fraud detection models successfully handled transaction spikes during peak seasons. To maintain scalability, configure instance quota limits correctly within AWS, and conduct thorough load testing before peak traffic periods to avoid any performance issues during high-demand times.
Secure model metadata sharing – Consider opting out of sharing model metadata when building your SageMaker pipeline to make sure your aggregate-level model information remains secure.
Prevent image conflicts with proper configuration – When using an AWS managed image for model inference, specify a hash digest within your SageMaker pipeline. Because the latest hash digest might change dynamically for the same image model version, this step helps avoid conflicts when retrieving inference images during model deployment.
Fine-tune scaling metrics through load testing – Fine-tune scaling metrics, such as instance type and automatic scaling thresholds, based on proper load testing. Simulate your business’s traffic patterns during both normal and peak periods to confirm your infrastructure scales effectively.
Applicability beyond fraud detection – Although the implementation described here is tailored to fraud detection, the MLOps architecture is adaptable to a wide range of ML use cases. Companies looking to modernize their MLOps workflows can apply the same principles to various ML projects.

Conclusion
This post demonstrated the high-level approach taken by Radial’s fraud team to successfully modernize their ML workflow by implementing an MLOps pipeline and migrating from on premises to the AWS Cloud. This was achieved through close collaboration with AWS during the EBA process. The EBA process begins with 4–6 weeks of preparation, culminating in a 3-day intensive workshop where a minimum viable MLOps pipeline is created using SageMaker, Amazon S3, GitLab, Terraform, and AWS CloudFormation. Following the EBA, teams typically spend an additional 2–6 weeks to refine the pipeline and fine-tune the models through feature engineering and hyperparameter optimization before production deployment. This approach enabled Radial to effectively select relevant AWS services and features, accelerating the training, deployment, and testing of ML models in a pre-production SageMaker environment. As a result, Radial successfully deployed multiple new ML models on AWS in their production environment around Q3 2024, achieving a more than 75% reduction in ML model deployment cycle and a 9% improvement in overall model performance.

“In the ecommerce retail space, mitigating fraudulent transactions and enhancing consumer experiences are top priorities for merchants. High-performing machine learning models have become invaluable tools in achieving these goals. By leveraging AWS services, we have successfully built a modernized machine learning workflow that enables rapid iterations in a stable and secure environment.”
– Lan Zhang, Head of Data Science and Advanced Analytics

To learn more about EBAs and how this approach can benefit your organization, reach out to your AWS Account Manager or Customer Solutions Manager. For additional information, refer to Using experience-based acceleration to achieve your transformation and Get to Know EBA.

About the Authors
Jake Wen is a Solutions Architect at AWS, driven by a passion for Machine Learning, Natural Language Processing, and Deep Learning. He assists Enterprise customers in achieving modernization and scalable deployment in the Cloud. Beyond the tech world, Jake finds delight in skateboarding, hiking, and piloting air drones.
Qing Chen is a senior data scientist at Radial, a full-stack solution provider for ecommerce merchants. In his role, he modernizes and manages the machine learning framework in the payment & fraud organization, driving a solid data-driven fraud decisioning flow to balance risk & customer friction for merchants.
Mark Sinclair is a senior cloud architect at Radial, a full-stack solution provider for ecommerce merchants. In his role, he designs, implements and manages the cloud infrastructure and DevOps for Radial engineering systems, driving a solid engineering architecture and workflow to provide highly scalable transactional services for Radial clients.

Contextual retrieval in Anthropic using Amazon Bedrock Knowledge Bases

Posted on June 6, 2025 by i-genie

For an AI model to perform effectively in specialized domains, it requires access to relevant background knowledge. A customer support chat assistant, for instance, needs detailed information about the business it serves, and a legal analysis tool must draw upon a comprehensive database of past cases.
To equip large language models (LLMs) with this knowledge, developers often use Retrieval Augmented Generation (RAG). This technique retrieves pertinent information from a knowledge base and incorporates it into the user’s prompt, significantly improving the model’s responses. However, a key limitation of traditional RAG systems is that they often lose contextual nuances when encoding data, leading to irrelevant or incomplete retrievals from the knowledge base.
Challenges in traditional RAG
In traditional RAG, documents are often divided into smaller chunks to optimize retrieval efficiency. Although this method performs well in many cases, it can introduce challenges when individual chunks lack the necessary context. For example, if a policy states that remote work requires “6 months of tenure” (chunk 1) and “HR approval for exceptions” (chunk 3), but omits the middle chunk linking exceptions to manager approval, a user asking about eligibility for a 3-month tenure employee might receive a misleading “No” instead of the correct “Only with HR approval.” This occurs because isolated chunks fail to preserve dependencies between clauses, highlighting a key limitation of basic chunking strategies in RAG systems.
Contextual retrieval enhances traditional RAG by adding chunk-specific explanatory context to each chunk before generating embeddings. This approach enriches the vector representation with relevant contextual information, enabling more accurate retrieval of semantically related content when responding to user queries. For instance, when asked about remote work eligibility, it fetches both the tenure requirement and the HR exception clause, enabling the LLM to provide an accurate response such as “Normally no, but HR may approve exceptions.” By intelligently stitching fragmented information, contextual retrieval mitigates the pitfalls of rigid chunking, delivering more reliable and nuanced answers.
In this post, we demonstrate how to use contextual retrieval with Anthropic and Amazon Bedrock Knowledge Bases.
Solution overview
This solution uses Amazon Bedrock Knowledge Bases, incorporating a custom Lambda function to transform data during the knowledge base ingestion process. This Lambda function processes documents from Amazon Simple Storage Service (Amazon S3), chunks them into smaller pieces, enriches each chunk with contextual information using Anthropic’s Claude in Amazon Bedrock, and then saves the results back to an intermediate S3 bucket. Here’s a step-by-step explanation:

Read input files from an S3 bucket specified in the event.
Chunk input data into smaller chunks.
Generate contextual information for each chunk using Anthropic’s Claude 3 Haiku
Write processed chunks with their metadata back to intermediate S3 bucket

The following diagram is the solution architecture.

Prerequisites
To implement the solution, complete the following prerequisite steps:

Have an active AWS account.
Create an AWS Identity and Access Management (IAM) role for the Lambda function to access Amazon Bedrock and documents from Amazon S3. For instructions, refer to Create a role to delegate permissions to an AWS service.
Add policy permissions to the IAM role.
Request access to Amazon Titan and Anthropic’s Claude 3 Haiku models in Amazon Bedrock.

Before you begin, you can deploy this solution by downloading the required files and following the instructions in its corresponding GitHub repository. This architecture is built around using the proposed chunking solution to implement contextual retrieval using Amazon Bedrock Knowledge Bases.
Implement contextual retrieval in Amazon Bedrock
In this section, we demonstrate how to use the proposed custom chunking solution to implement contextual retrieval using Amazon Bedrock Knowledge Bases. Developers can use custom chunking strategies in Amazon Bedrock to optimize how large documents or datasets are divided into smaller, more manageable pieces for processing by foundation models (FMs). This approach enables more efficient and effective handling of long-form content, improving the quality of responses. By tailoring the chunking method to the specific characteristics of the data and the requirements of the task at hand, developers can enhance the performance of natural language processing applications built on Amazon Bedrock. Custom chunking can involve techniques such as semantic segmentation, sliding windows with overlap, or using document structure to create logical divisions in the text.
To implement contextual retrieval in Amazon Bedrock, complete the following steps, which can be found in the notebook in the GitHub repository.
To set up the environment, follow these steps:

Install the required dependencies:

%pip install –upgrade pip –quiet %pip install -r requirements.txt –no-deps

Import the required libraries and set up AWS clients:

import os
import sys
import time
import boto3
import logging
import pprint
import json
from pathlib import Path

# AWS Clients Setup
s3_client = boto3.client(‘s3’)
sts_client = boto3.client(‘sts’)
session = boto3.session.Session()
region = session.region_name
account_id = sts_client.get_caller_identity()[“Account”]
bedrock_agent_client = boto3.client(‘bedrock-agent’)
bedrock_agent_runtime_client = boto3.client(‘bedrock-agent-runtime’)

# Configure logging
logging.basicConfig(
format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s – %(message)s’,
level=logging.INFO
)
logger = logging.getLogger(__name__)

Define knowledge base parameters:

# Generate unique suffix for resource names
timestamp_str = time.strftime(“%Y%m%d%H%M%S”, time.localtime(time.time()))[-7:]
suffix = f”{timestamp_str}”

# Resource names
knowledge_base_name_standard = ‘standard-kb’
knowledge_base_name_custom = ‘custom-chunking-kb’
knowledge_base_description = “Knowledge Base containing complex PDF.”
bucket_name = f'{knowledge_base_name_standard}-{suffix}’
intermediate_bucket_name = f'{knowledge_base_name_standard}-intermediate-{suffix}’
lambda_function_name = f'{knowledge_base_name_custom}-lambda-{suffix}’
foundation_model = “anthropic.claude-3-sonnet-20240229-v1:0”

# Define data sources
data_source=[{“type”: “S3”, “bucket_name”: bucket_name}]

Create knowledge bases with different chunking strategies
To create knowledge bases with different chunking strategies, use the following code.

Standard fixed chunking:

# Create knowledge base with fixed chunking
knowledge_base_standard = BedrockKnowledgeBase(
    kb_name=f'{knowledge_base_name_standard}-{suffix}’,
    kb_description=knowledge_base_description,
    data_sources=data_source,
    chunking_strategy=”FIXED_SIZE”,
    suffix=f'{suffix}-f’
)

# Upload data to S3
def upload_directory(path, bucket_name):
    for root, dirs, files in os.walk(path):
        for file in files:
            file_to_upload = os.path.join(root, file)
            if file not in [“LICENSE”, “NOTICE”, “README.md”]:
                print(f”uploading file {file_to_upload} to {bucket_name}”)
                s3_client.upload_file(file_to_upload, bucket_name, file)
            else:
                print(f”Skipping file {file_to_upload}”)

upload_directory(“../synthetic_dataset”, bucket_name)

# Start ingestion job
time.sleep(30) # ensure KB is available
knowledge_base_standard.start_ingestion_job()
kb_id_standard = knowledge_base_standard.get_knowledge_base_id()

Custom chunking with Lambda function

# Create Lambda function for custom chunking
def create_lambda_function():
    with open(‘lambda_function.py’, ‘r’) as file:
        lambda_code = file.read()

    response = lambda_client.create_function(
        FunctionName=lambda_function_name,
        Runtime=’python3.9′,
        Role=lambda_role_arn,
        Handler=’lambda_function.lambda_handler’,
        Code={‘ZipFile’: lambda_code.encode()},
        Timeout=900,
        MemorySize=256
    )
    return response[‘FunctionArn’]

# Create knowledge base with custom chunking
knowledge_base_custom = BedrockKnowledgeBase(
    kb_name=f'{knowledge_base_name_custom}-{suffix}’,
    kb_description=knowledge_base_description,
    data_sources=data_source,
    lambda_function_name=lambda_function_name,
    intermediate_bucket_name=intermediate_bucket_name,
    chunking_strategy=”CUSTOM”,
    suffix=f'{suffix}-c’
)

# Start ingestion job
time.sleep(30)
knowledge_base_custom.start_ingestion_job()
kb_id_custom = knowledge_base_custom.get_knowledge_base_id()

Evaluate performance using RAGAS framework
To evaluate performance using the RAGAS framework, follow these steps:

Set up RAGAS evaluation:

from ragas import SingleTurnSample, EvaluationDataset
from ragas import evaluate
from ragas.metrics import (
context_recall,
context_precision,
answer_correctness
)

# Initialize Bedrock models for evaluation
TEXT_GENERATION_MODEL_ID = “anthropic.claude-3-haiku-20240307-v1:0”
EVALUATION_MODEL_ID = “anthropic.claude-3-sonnet-20240229-v1:0″

llm_for_evaluation = ChatBedrock(model_id=EVALUATION_MODEL_ID, client=bedrock_client)
bedrock_embeddings = BedrockEmbeddings(
model_id=”amazon.titan-embed-text-v2:0”,
client=bedrock_client
)

Prepare evaluation dataset:

# Define test questions and ground truths
questions = [
“What was the primary reason for the increase in net cash provided by operating activities for Octank Financial in 2021?”,
“In which year did Octank Financial have the highest net cash used in investing activities, and what was the primary reason for this?”,
# Add more questions…
]

ground_truths = [
“The increase in net cash provided by operating activities was primarily due to an increase in net income and favorable changes in operating assets and liabilities.”,
“Octank Financial had the highest net cash used in investing activities in 2021, at $360 million…”,
# Add corresponding ground truths…
]

def prepare_eval_dataset(kb_id, questions, ground_truths):
samples = []
for question, ground_truth in zip(questions, ground_truths):
# Get response and context
response = retrieve_and_generate(question, kb_id)
answer = response[“output”][“text”]

# Process contexts
contexts = []
for citation in response[“citations”]:
context_texts = [
ref[“content”][“text”]
for ref in citation[“retrievedReferences”]
if “content” in ref and “text” in ref[“content”]
]
contexts.extend(context_texts)

# Create sample
sample = SingleTurnSample(
user_input=question,
retrieved_contexts=contexts,
response=answer,
reference=ground_truth
)
samples.append(sample)

return EvaluationDataset(samples=samples)

Run evaluation and compare results:

# Evaluate both approaches
contextual_chunking_dataset = prepare_eval_dataset(kb_id_custom, questions, ground_truths)
default_chunking_dataset = prepare_eval_dataset(kb_id_standard, questions, ground_truths)

# Define metrics
metrics = [context_recall, context_precision, answer_correctness]

# Run evaluation
contextual_chunking_result = evaluate(
dataset=contextual_chunking_dataset,
metrics=metrics,
llm=llm_for_evaluation,
embeddings=bedrock_embeddings,
)

default_chunking_result = evaluate(
dataset=default_chunking_dataset,
metrics=metrics,
llm=llm_for_evaluation,
embeddings=bedrock_embeddings,
)

# Compare results
comparison_df = pd.DataFrame({
‘Default Chunking’: default_chunking_result.to_pandas().mean(),
‘Contextual Chunking’: contextual_chunking_result.to_pandas().mean()
})

# Visualize results
def highlight_max(s):
is_max = s == s.max()
return [‘background-color: #90EE90’ if v else ” for v in is_max]

comparison_df.style.apply(
highlight_max,
axis=1,
subset=[‘Default Chunking’, ‘Contextual Chunking’]

Performance benchmarks
To evaluate the performance of the proposed contextual retrieval approach, we used the AWS Decision Guide: Choosing a generative AI service as the document for RAG testing. We set up two Amazon Bedrock knowledge bases for the evaluation:

One knowledge base with the default chunking strategy, which uses 300 tokens per chunk with a 20% overlap
Another knowledge base with the custom contextual retrieval chunking approach, which has a custom contextual retrieval Lambda transformer in addition to the fixed chunking strategy that also uses 300 tokens per chunk with a 20% overlap

We used the RAGAS framework to assess the performance of these two approaches using small datasets. Specifically, we looked at the following metrics:

context_recall – Context recall measures how many of the relevant documents (or pieces of information) were successfully retrieved
context_precision – Context precision is a metric that measures the proportion of relevant chunks in the retrieved_contexts
answer_correctness – The assessment of answer correctness involves gauging the accuracy of the generated answer when compared to the ground truth

from ragas import SingleTurnSample, EvaluationDataset
from ragas import evaluate
from ragas.metrics import (
    context_recall,
    context_precision,
    answer_correctness
)

#specify the metrics here
metrics = [
    context_recall,
    context_precision,
    answer_correctness
]

questions = [
    “What are the main AWS generative AI services covered in this guide?”,
    “How does Amazon Bedrock differ from the other generative AI services?”,
    “What are some key factors to consider when choosing a foundation model for your use case?”,
    “What infrastructure services does AWS offer to support training and inference of large AI models?”,
    “Where can I find more resources and information related to the AWS generative AI services?”
]
ground_truths = [
    “The main AWS generative AI services covered in this guide are Amazon Q Business, Amazon Q Developer, Amazon Bedrock, and Amazon SageMaker AI.”,
    “Amazon Bedrock is a fully managed service that allows you to build custom generative AI applications with a choice of foundation models, including the ability to fine-tune and customize the models with your own data.”,
    “Key factors to consider when choosing a foundation model include the modality (text, image, etc.), model size, inference latency, context window, pricing, fine-tuning capabilities, data quality and quantity, and overall quality of responses.”,
    “AWS offers specialized hardware like AWS Trainium and AWS Inferentia to maximize the performance and cost-efficiency of training and inference for large AI models.”,
    “You can find more resources like architecture diagrams, whitepapers, and solution guides on the AWS website. The document also provides links to relevant blog posts and documentation for the various AWS generative AI services.”
]

The results obtained using the default chunking strategy are presented in the following table.

The results obtained using the contextual retrieval chunking strategy are presented in the following table. It demonstrates improved performance across the key metrics evaluated, including context recall, context precision, and answer correctness.

By aggregating the results, we can observe that the contextual chunking approach outperformed the default chunking strategy across the context_recall, context_precision, and answer_correctness metrics. This indicates the benefits of the more sophisticated contextual retrieval techniques implemented.

Implementation considerations
When implementing contextual retrieval using Amazon Bedrock, several factors need careful consideration. First, the custom chunking strategy must be optimized for both performance and accuracy, requiring thorough testing across different document types and sizes. The Lambda function’s memory allocation and timeout settings should be calibrated based on the expected document complexity and processing requirements, with initial recommendations of 1024 MB memory and 900-second timeout serving as baseline configurations. Organizations must also configure IAM roles with the principle of least privilege while maintaining sufficient permissions for Lambda to interact with Amazon S3 and Amazon Bedrock services. Additionally, the vectorization process and knowledge base configuration should be fine-tuned to balance between retrieval accuracy and computational efficiency, particularly when scaling to larger datasets.
Infrastructure scalability and monitoring considerations are equally crucial for successful implementation. Organizations should implement robust error-handling mechanisms within the Lambda function to manage various document formats and potential processing failures gracefully. Monitoring systems should be established to track key metrics such as chunking performance, retrieval accuracy, and system latency, enabling proactive optimization and maintenance.
Using Langfuse with Amazon Bedrock is a good option to introduce observability to this solution. The S3 bucket structure for both source and intermediate storage should be designed with clear lifecycle policies and access controls and consider Regional availability and data residency requirements. Furthermore, implementing a staged deployment approach, starting with a subset of data before scaling to full production workloads, can help identify and address potential bottlenecks or optimization opportunities early in the implementation process.
Cleanup
When you’re done experimenting with the solution, clean up the resources you created to avoid incurring future charges.
Conclusion
By combining Anthropic’s sophisticated language models with the robust infrastructure of Amazon Bedrock, organizations can now implement intelligent systems for information retrieval that deliver deeply contextualized, nuanced responses. The implementation steps outlined in this post provide a clear pathway for organizations to use contextual retrieval capabilities through Amazon Bedrock. By following the detailed configuration process, from setting up IAM permissions to deploying custom chunking strategies, developers and organizations can unlock the full potential of context-aware AI systems.
By leveraging Anthropic’s language models, organizations can deliver more accurate and meaningful results to their users while staying at the forefront of AI innovation. You can get started today with contextual retrieval using Anthropic’s language models through Amazon Bedrock and transform how your AI processes information with a small-scale proof of concept using your existing data. For personalized guidance on implementation, contact your AWS account team.

About the Authors
Suheel Farooq is a Principal Engineer in AWS Support Engineering, specializing in Generative AI, Artificial Intelligence, and Machine Learning. As a Subject Matter Expert in Amazon Bedrock and SageMaker, he helps enterprise customers design, build, modernize, and scale their AI/ML and Generative AI workloads on AWS. In his free time, Suheel enjoys working out and hiking.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.
Vinita is a Senior Serverless Specialist Solutions Architect at AWS. She combines AWS knowledge with strong business acumen to architect innovative solutions that drive quantifiable value for customers and has been exceptional at navigating complex challenges. Vinita’s technical expertise on application modernization, GenAI, cloud computing and ability to drive measurable business impact make her show great impact in customer’s journey with AWS.
Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.
Venkata Moparthi is a Senior Solutions Architect, specializes in cloud migrations, generative AI, and secure architecture for financial services and other industries. He combines technical expertise with customer-focused strategies to accelerate digital transformation and drive business outcomes through optimized cloud solutions.

Run small language models cost-efficiently with AWS Graviton and Amazo …

Posted on June 6, 2025 by i-genie

As organizations look to incorporate AI capabilities into their applications, large language models (LLMs) have emerged as powerful tools for natural language processing tasks. Amazon SageMaker AI provides a fully managed service for deploying these machine learning (ML) models with multiple inference options, allowing organizations to optimize for cost, latency, and throughput. AWS has always provided customers with choice. That includes model choice, hardware choice, and tooling choice. In terms of hardware choice, in addition to NVIDIA GPUs and AWS custom AI chips, CPU-based instances represent (thanks to the latest innovations in CPU hardware) an additional choice for customers who want to run generative AI inference, like hosting small language models and asynchronous agents.
Traditional LLMs with billions of parameters require significant computational resources. For example, a 7-billion-parameter model like Meta Llama 7B at BFloat16 (2 bytes per parameter) typically needs around 14 GB of GPU memory to store the model weights—the total GPU memory requirement is usually 3–4 times bigger at long sequence lengths. However, recent developments in model quantization and knowledge distillation have made it possible to run smaller, efficient language models on CPU infrastructure. Although these models might not match the capabilities of the largest LLMs, they offer a practical alternative for many real-world applications where cost optimization is crucial.
In this post, we demonstrate how to deploy a small language model on SageMaker AI by extending our pre-built containers to be compatible with AWS Graviton instances. We first provide an overview of the solution, and then provide detailed implementation steps to help you get started. You can find the example notebook in the GitHub repo.
Solution overview
Our solution uses SageMaker AI with Graviton3 processors to run small language models cost-efficiently. The key components include:

SageMaker AI hosted endpoints for model serving
Graviton3 based instances (ml.c7g series) for computation
A container installed with llama.cpp for the Graviton optimized inference
Pre-quantized GGUF format models

Graviton processors, which are specifically designed for cloud workloads, provide an optimal platform for running these quantized models. Graviton3 based instances can deliver up to 50% better price-performance compared to traditional x86-based CPU instances for ML inference workloads.
We have used Llama.cpp as the inference framework. It supports quantized general matrix multiply-add (GEMM) kernels for faster inference and reduced memory use. The quantized GEMM kernels are optimized for Graviton processors using Arm Neon and SVE-based matrix multiply-accumulate (MMLA) instructions.
Llama.cpp uses GGUF, a special binary format for storing the model and metadata. GGUF format is optimized for quick loading and saving of models, making it highly efficient for inference purposes. Existing models need to be converted to GGUF format before they can be used for the inference. You can find most of popular GGUF format models from the following Hugging Face repo, or you can also convert your model to GGUF format using the following tool.
The following diagram illustrates the solution architecture.

To deploy your model on SageMaker with Graviton, you will need to complete the following steps:

Create a Docker container compatible with ARM64 architecture.
Prepare your model and inference code.
Create a SageMaker model and deploy to an endpoint with a Graviton instance.

We walk through these steps in the following sections.
Prerequisites
To implement this solution, you need an AWS account with the necessary permissions.
Create a Docker container compatible with ARM64 architecture
Let’s first review how SageMaker AI works with Docker containers. Basically, by packaging an algorithm in a container, you can bring almost any code to the SageMaker environment, regardless of programming language, environment, framework, or dependencies. For more information and an example of how to build your own Docker container for training and inference with SageMaker AI, see Build your own algorithm container.
You can also extend a pre-built container to accommodate your needs. By extending a pre-built image, you can use the included deep learning libraries and settings without having to create an image from scratch. You can extend the container to add libraries, modify settings, and install additional dependencies. For a list of available pre-built containers, refer to the following GitHub repo. In this example, we show how to package a pre-built PyTorch container that supports Graviton instances, extending the SageMaker PyTorch container, with a Python example that works with the DeepSeek distilled model.
Firstly, let’s review how SageMaker AI runs your Docker container. Typically, you specify a program (such as a script) as an ENTRYPOINT in the Dockerfile; that program will run at startup and decide what to do. The original ENTRYPOINT specified within the SageMaker PyTorch is listed in the GitHub repo. To learn how to extend our pre-built container for model training, refer to Extend a Pre-built Container. In this example, we only use the inference container.
Running your container during hosting
Hosting has a very different model than training because hosting is responding to inference requests that come in through HTTP. At the time of writing, the SageMaker PyTorch containers use our TorchServe to provide robust and scalable serving of inference requests, as illustrated in the following diagram.

SageMaker uses two URLs in the container:

/ping receives GET requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
/invocations is the endpoint that receives client inference POST The format of the request and the response is up to the algorithm. If the client supplied ContentType and Accept headers, these are passed in as well.

The container has the model files in the same place that they were written to during training:
/opt/ml `– model `– <model files>
Custom files available to build the container used in this example
The container directory has all the components you need to extend the SageMaker PyTorch container to use as a sample algorithm:
. |– Dockerfile |– build_and_push.sh `– code `– inference.py `– requirements.txt
Let’s discuss each of these in turn:

Dockerfile describes how to build your Docker container image for inference.
sh is a script that uses the Dockerfile to build your container images and then pushes it to Amazon Elastic Container Registry (Amazon ECR). We invoke the commands directly later in this notebook, but you can copy and run the script for your own algorithms. To build a Graviton compatible Docker image, we launch a ARM64 architecture-based AWS CodeBuild environment and build the Docker image from the Dockerfile, then push the Docker image to the ECR repo. Refer to the script for more details.
code is the directory that contains our user code to be invoked.

In this application, we install or update a few libraries for running Llama.cpp in Python. We put the following files in the container:

py is the program that implements our inference code (used only for the inference container)
txt is the text file that contains additional Python packages that will be installed during deployment time

The Dockerfile describes the image that we want to build. We start from the SageMaker PyTorch image as the base inference one. The SageMaker PyTorch ECR image that supports Graviton in this case would be:
FROM 763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference-arm64:2.5.1-cpu-py311-ubuntu22.04-sagemaker
Next, we install the required additional libraries and add the code that implements our specific algorithm to the container, and set up the right environment for it to run under. We recommend configuring the following optimizations for Graviton in the Dockerfile and the inference code for better performance:

In the Dockerfile, add compile flags like -mcpu=native -fopenmp when installing the llama.cpp Python package. The combination of these flags can lead to code optimized for the specific ARM architecture of Graviton and parallel execution that takes full advantage of the multi-core nature of Graviton processors.
Set n_threads to the number of vCPUs explicitly in the inference code to use all cores (vCPUs) on Graviton.
Use quantized q4_0 models, which minimizes accuracy loss while aligning well with CPU architectures, improving CPU inference performance by reducing memory footprint and enhancing cache utilization. For information on how to quantize models, refer to the llama.cpp README.

The build_and_push.sh script describes how to automate the setup of a CodeBuild project specifically designed for building Docker images on ARM64 architecture. It sets up essential configuration variables; creates necessary AWS Identity and Access Management (IAM) roles with appropriate permissions for Amazon CloudWatch Logs, Amazon Simple Storage Service (Amazon S3), and Amazon ECR access; and establishes a CodeBuild project using an ARM-based container environment. The script includes functions to check for project existence and wait for project readiness, while configuring the build environment with required variables and permissions for building and pushing Docker images, particularly for the llama.cpp inference code.
Prepare your model and inference code
Given the use of a pre-built SageMaker PyTorch container, we can simply write an inference script that defines the following functions to handle input data deserialization, model loading, and prediction:

model_fn() reads the content of an existing model weights directory from the `/opt/ml/model` or uses the model_dir parameter passed to the function, which is a directory where the model is saved
input_fn() is used to format the data received from a request made to the endpoint
predict_fn() calls the output of model_fn() to run inference on the output of input_fn()
output_fn() optionally serializes predictions from predict_fn to the format that can be transferred back through HTTP packages, such as JSON

Normally, you would compress model files into a TAR file; however, this can cause startup time to take longer due to having to download and untar large files. To improve startup times, SageMaker AI supports use of uncompressed files. This removes the need to untar large files. In this example, we upload all the files to an S3 prefix and then pass the location into the model with “CompressionType”: “None”.
Create a SageMaker model and deploy to an endpoint with a Graviton instance
Now we can use the PyTorchModel class provided by SageMaker Python SDK to create a PyTorch SageMaker model that can be deployed to a SageMaker endpoint:

pytorch_model = PyTorchModel(model_data={
   “S3DataSource”: {
   “S3Uri”: model_path,
   “S3DataType”: “S3Prefix”,
   “CompressionType”: “None”,
   }
   },
   role=role,
   env={
   ‘MODEL_FILE_GGUF’:file_name
   },
   image_uri=f”{sagemaker_session.account_id()}.dkr.ecr.{region}.amazonaws.com/llama-cpp-python:latest”,
   model_server_workers=2
)

predictor = pytorch_model.deploy(instance_type=’ml.c7g.12xlarge’, initial_instance_count=1)

TorchServe runs multiple workers on the container for inference, where each worker hosts a copy of the model. model_server_workers controls the number of workers that TorchServe will run by configuring the ‘SAGEMAKER_MODEL_SERVER_WORKERS‘ environment variable. Therefore, we recommend using a small number for the model server workers.
Then we can invoke the endpoint with either the predictor object returned by the deploy function or use a low-level Boto3 API as follows:

client = boto3.client(‘sagemaker-runtime’)

response = client.invoke_endpoint(
   EndpointName=endpoint_name,
   ContentType=”application/json”,
   Body=json.dumps(prompt)
)
print(response[‘Body’].read().decode(“utf-8”))

Performance optimization discussion
When you’re happy with a specific model, or a quantized version of it, for your use case, you can start tuning your compute capacity to serve your users at scale. When running LLM inference, we look at two main metrics to evaluate performance: latency and throughput. Tools like LLMPerf enable measuring these metrics on SageMaker AI endpoints.

Latency – Represents the per-user experience by measuring the time needed to process a user request or prompt
Throughput – Represents the overall token throughput, measured in tokens per seconds, aggregated for user requests

When serving users in parallel, batching those parallel requests together can improve throughput and increase compute utilization by moving the multiple inputs together with the model weights from the host memory to the CPU in order to generate the output tokens. Model serving backends like vLLM and Llama.cpp support continuous batching, which automatically adds new requests to the existing batch, replacing old requests that finished their token generation phases. However, configuring higher batch sizes comes at the expense of per-user latency, so you should tune the batch size for the best latency-throughput combination on the ML instance you’re using on SageMaker AI. In addition to batching, using prompt or prefix caching to reuse the precomputed attention matrices in similar subsequent requests can further boost latency.
When you find the optimal batch size for your use case, you can tune your endpoint’s auto scaling policy to serve your users at scale using an endpoint exposing multiple CPU-based ML instances, which scales according the application load. Let’s say you are able to successfully serve 10 requests users in parallel with one ML instance. You can scale out by increasing the number of instances to reach the number of instances needed to serve your target number of users—for example, you would need 10 instances to serve 100 users in parallel.
Clean up
To avoid unwanted charges, clean up the resources you created as part of this solution if you no longer need it.
Conclusion
SageMaker AI with Graviton processors offers a compelling solution for organizations looking to deploy AI capabilities cost-effectively. By using CPU-based inference with quantized models, this approach delivers up to 50% cost savings compared to traditional deployments while maintaining robust performance for many real-world applications. The combination of simplified operations through the fully managed SageMaker infrastructure, flexible auto scaling with zero-cost downtime, and enhanced deployment speed with container caching technology makes it an ideal platform for production AI workloads.
To get started, explore our sample notebooks on GitHub and reference documentation to evaluate whether CPU-based inference suits your use case. You can also refer to the AWS Graviton Technical Guide, which provides the list of optimized libraries and best practices that can help you achieve cost benefits with Graviton instances across different workloads.

About the Authors
Vincent Wang is an Efficient Compute Specialist Solutions Architect at AWS based in Sydney, Australia. He helps customers optimize their cloud infrastructure by leveraging AWS’s silicon innovations, including AWS Graviton processors and AWS Neuron technology. Vincent’s expertise lies in developing AI/ML applications that harness the power of open-source software combined with AWS’s specialized AI chips, enabling organizations to achieve better performance and cost-efficiency in their cloud deployments.
Andrew Smith is a Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Oussama Maxime Kandakji is a Senior AI/ML Solutions Architect at AWS focusing on AI Inference and Agents. He works with companies of all sizes on solving business and performance challenges in AI and Machine Learning workloads. He enjoys contributing to open source and working with data.
Romain Legret is a Senior Efficient Compute Specialist Solutions Architect at AWS. Romain promotes the benefits of AWS Graviton, EC2 Spot, Karpenter, or Auto-Scaling while helping French customers in their adoption journey. “Always try to achieve more with less” is his motto !

Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant …

Posted on June 5, 2025 by i-genie

Mistral AI announced the release of Mistral Code, an AI-powered coding assistant tailored for enterprise software development environments. This release signals Mistral’s move toward addressing long-standing requirements in professional development pipelines: control, security, and model adaptability.

Addressing Enterprise-Grade Requirements

Mistral Code targets several key limitations observed in traditional AI coding tools:

Data Sovereignty and Control: Organizations can maintain full control over their code and infrastructure. Mistral Code offers options for on-premises deployment, enabling compliance with internal data governance policies.

Customizability: Unlike off-the-shelf assistants, Mistral Code is fully tunable to an enterprise’s internal codebase, allowing the assistant to reflect project-specific conventions and logic structures.

Beyond Completion: The tool supports end-to-end workflows including debugging, test generation, and code transformation, moving beyond standard autocomplete functionality.

Unified Vendor Management: Mistral provides a single vendor solution with full visibility across the development stack, simplifying integration and support processes.

Initial deployments have been conducted with their partners such as Capgemini, Abanca, and SNCF, suggesting the assistant’s applicability across both regulated and large-scale environments.

System Architecture and Capabilities

Mistral Code integrates four foundational models, each designed for a distinct set of development tasks:

Codestral: Specializes in code completion and in-filling, optimized for latency and multi-language support.

Codestral Embed: Powers semantic search and code retrieval tasks through dense vector embeddings.

Devstral: Designed for longer-horizon tasks, such as multi-step problem-solving and refactoring.

Mistral Medium: Enables conversational interactions and contextual Q&A inside the IDE.

The assistant supports over 80 programming languages and interfaces seamlessly with development artifacts like file structures, Git diffs, and terminal outputs. Developers can use natural language to initiate refactors, generate unit tests, or receive in-line explanations—all within their IDE.

Deployment Models

Mistral Code offers flexible deployment modes to meet diverse IT policies and performance needs:

Cloud: For teams working in managed cloud environments.

Reserved Cloud Capacity: Dedicated infrastructure to meet latency, throughput, or compliance requirements.

On-Premises: For enterprises with strict infrastructure control needs, especially in regulated sectors.

The assistant is currently in private beta for JetBrains IDEs and Visual Studio Code, with broader IDE support expected as adoption grows.

Administrative Features for IT Oversight

To align with enterprise security and operational practices, Mistral Code includes a comprehensive management layer:

Role-Based Access Control (RBAC): Configurable access policies to manage user permissions at scale.

Audit Logs: Full traceability of actions and interactions with the assistant for compliance auditing.

Usage Analytics: Detailed reporting dashboards to monitor adoption, performance, and optimization opportunities.

These features support internal security reviews, cost accountability, and usage governance.

Conclusion

Mistral Code introduces a modular and enterprise-aligned approach to AI-assisted development. By prioritizing adaptability, transparency, and data integrity, Mistral AI offers an alternative to generalized coding assistants that often fall short in production-grade environments. The tool’s architecture and deployment flexibility position it as a viable solution for organizations seeking to integrate AI without compromising on internal controls or development rigor.

Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise Workflows appeared first on MarkTechPost.

LifelongAgentBench: A Benchmark for Evaluating Continuous Learning in …

Posted on June 5, 2025 by i-genie

Lifelong learning is crucial for intelligent agents navigating ever-changing environments, yet current LLM-based agents fall short—they lack memory and treat every task as a fresh start. While LLMs have transformed language tasks and inspired agent-based systems, these agents remain stateless and unable to learn from past experiences. True progress toward general intelligence requires agents that can retain, adapt, and reuse knowledge over time. Unfortunately, current benchmarks primarily focus on isolated tasks, overlooking the reuse of skills and knowledge retention. Without standardized evaluations for lifelong learning, it’s difficult to measure real progress, and issues like label errors and reproducibility further hinder practical development.

Lifelong learning, also known as continual learning, aims to help AI systems build and retain knowledge across tasks while avoiding catastrophic forgetting. Most previous work in this area has focused on non-interactive tasks, such as image classification or sequential fine-tuning, where models process static inputs and outputs without needing to respond to changing environments. However, applying lifelong learning to LLM-based agents that operate in dynamic, interactive settings remains underexplored. Existing benchmarks, such as WebArena, AgentBench, and VisualWebArena, assess one-time task performance but don’t support learning over time. Even interactive studies involving games or tools lack standard frameworks for evaluating lifelong learning in agents.

Researchers from the South China University of Technology, MBZUAI, the Chinese Academy of Sciences, and East China Normal University have introduced LifelongAgentBench, the first comprehensive benchmark for evaluating lifelong learning in LLM-based agents. It features interdependent, skill-driven tasks across three environments—Database, Operating System, and Knowledge Graph—with built-in label verification, reproducibility, and modular design. The study reveals that conventional experience replay is often ineffective due to the inclusion of irrelevant information and the limitation of context length. To address this, the team proposes a group self-consistency mechanism that clusters past experiences and applies voting strategies, significantly enhancing lifelong learning performance across various LLM architectures.

LifelongAgentBench is a benchmark designed to test how effectively language model-based agents learn and adapt across a series of tasks over time. The setup treats learning as a sequential decision-making problem using goal-conditioned POMDPs within three environments: Databases, Operating Systems, and Knowledge Graphs. Tasks are structured around core skills and crafted to reflect real-world complexity, with attention to factors like task difficulty, overlapping skills, and environmental noise. Task generation combines both automated and manual validation to ensure quality and diversity. This benchmark helps assess whether agents can build on past knowledge and improve continuously in dynamic, skill-driven settings.

LifelongAgentBench is a new evaluation framework designed to test how well LLM-based agents learn over time by tackling tasks in a strict sequence, unlike previous benchmarks that focus on isolated or parallel tasks. Its modular system includes components like an agent, environment, and controller, which can run independently and communicate via RPC. The framework prioritizes reproducibility and flexibility, supporting diverse environments and models. Through experiments, it has been shown that experience replay—feeding agents successful past trajectories—can significantly boost performance, especially on complex tasks. However, larger replays can lead to memory issues, underscoring the need for more efficient replay and memory management strategies.

In conclusion, LifelongAgentBench is a pioneering benchmark designed to evaluate the ability of LLM-based agents to learn continuously over time. Unlike earlier benchmarks that treat agents as static, this framework tests their ability to build, retain, and apply knowledge across interconnected tasks in dynamic environments, such as databases, operating systems, and knowledge graphs. It offers modular design, reproducibility, and automated evaluation. While experience replay and group self-consistency show promise in boosting learning, issues such as memory overload and inconsistent gains across models persist. This work lays the foundation for developing more adaptable, memory-efficient agents, with future directions focusing on smarter memory use and real-world multimodal tasks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post LifelongAgentBench: A Benchmark for Evaluating Continuous Learning in LLM-Based Agents appeared first on MarkTechPost.

NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language M …

Posted on June 5, 2025 by i-genie

NVIDIA has introduced Llama Nemotron Nano VL, a vision-language model (VLM) designed to address document-level understanding tasks with efficiency and precision. Built on the Llama 3.1 architecture and coupled with a lightweight vision encoder, this release targets applications requiring accurate parsing of complex document structures such as scanned forms, financial reports, and technical diagrams.

Model Overview and Architecture

Llama Nemotron Nano VL integrates the CRadioV2-H vision encoder with a Llama 3.1 8B Instruct-tuned language model, forming a pipeline capable of jointly processing multimodal inputs — including multi-page documents with both visual and textual elements.

The architecture is optimized for token-efficient inference, supporting up to 16K context length across image and text sequences. The model can process multiple images alongside textual input, making it suitable for long-form multimodal tasks. Vision-text alignment is achieved via projection layers and rotary positional encoding tailored for image patch embeddings.

Training was conducted in three phases:

Stage 1: Interleaved image-text pretraining on commercial image and video datasets.

Stage 2: Multimodal instruction tuning to enable interactive prompting.

Stage 3: Text-only instruction data re-blending, improving performance on standard LLM benchmarks.

All training was performed using NVIDIA’s Megatron-LLM framework with Energon dataloader, distributed over clusters with A100 and H100 GPUs.

Benchmark Results and Evaluation

Llama Nemotron Nano VL was evaluated on OCRBench v2, a benchmark designed to assess document-level vision-language understanding across OCR, table parsing, and diagram reasoning tasks. OCRBench includes 10,000+ human-verified QA pairs spanning documents from domains such as finance, healthcare, legal, and scientific publishing.

Results indicate that the model achieves state-of-the-art accuracy among compact VLMs on this benchmark. Notably, its performance is competitive with larger, less efficient models, particularly in extracting structured data (e.g., tables and key-value pairs) and answering layout-dependent queries.

updated as on June 3, 2025

The model also generalizes across non-English documents and degraded scan quality, reflecting its robustness under real-world conditions.

Deployment, Quantization, and Efficiency

Designed for flexible deployment, Nemotron Nano VL supports both server and edge inference scenarios. NVIDIA provides a quantized 4-bit version (AWQ) for efficient inference using TinyChat and TensorRT-LLM, with compatibility for Jetson Orin and other constrained environments.

Key technical features include:

Modular NIM (NVIDIA Inference Microservice) support, simplifying API integration

ONNX and TensorRT export support, ensuring hardware acceleration compatibility

Precomputed vision embeddings option, enabling reduced latency for static image documents

Conclusion

Llama Nemotron Nano VL represents a well-engineered tradeoff between performance, context length, and deployment efficiency in the domain of document understanding. Its architecture—anchored in Llama 3.1 and enhanced with a compact vision encoder—offers a practical solution for enterprise applications that require multimodal comprehension under strict latency or hardware constraints.

By topping OCRBench v2 while maintaining a deployable footprint, Nemotron Nano VL positions itself as a viable model for tasks such as automated document QA, intelligent OCR, and information extraction pipelines.

Check out the Technical details and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding appeared first on MarkTechPost.

Impel enhances automotive dealership customer experience with fine-tun …

Posted on June 5, 2025 by i-genie

This post is co-written with Tatia Tsmindashvili, Ana Kolkhidashvili, Guram Dentoshvili, Dachi Choladze from Impel.
Impel transforms automotive retail through an AI-powered customer lifecycle management solution that drives dealership operations and customer interactions. Their core product, Sales AI, provides all-day personalized customer engagement, handling vehicle-specific questions and automotive trade-in and financing inquiries. By replacing their existing third-party large language model (LLM) with a fine-tuned Meta Llama model deployed on Amazon SageMaker AI, Impel achieved 20% improved accuracy and greater cost controls. The implementation using the comprehensive feature set of Amazon SageMaker, including model training, Activation-Aware Weight Quantization (AWQ), and Large Model Inference (LMI) containers. This domain-specific approach not only improved output quality but also enhanced security and operational overhead compared to general-purpose LLMs.
In this post, we share how Impel enhances the automotive dealership customer experience with fine-tuned LLMs on SageMaker.
Impel’s Sales AI
Impel optimizes how automotive retailers connect with customers by delivering personalized experiences at every touchpoint—from initial research to purchase, service, and repeat business, acting as a digital concierge for vehicle owners, while giving retailers personalization capabilities for customer interactions. Sales AI uses generative AI to provide instant responses around the clock to prospective customers through email and text. This maintained engagement during the early stages of a customer’s car buying journey leads to showroom appointments or direct connections with sales teams. Sales AI has three core features to provide this consistent customer engagement:

Summarization – Summarizes past customer engagements to derive customer intent
Follow-up generation – Provides consistent follow-up to engaged customers to help prevent stalled customer purchasing journeys
Response personalization – Personalizes responses to align with retailer messaging and customer’s purchasing specifications

Two key factors drove Impel to transition from their existing LLM provider: the need for model customization and cost optimization at scale. Their previous solution’s per-token pricing model became cost-prohibitive as transaction volumes grew, and limitations on fine-tuning prevented them from fully using their proprietary data for model improvement. By deploying a fine-tuned Meta Llama model on SageMaker, Impel achieved the following:

Cost predictability through hosted pricing, mitigating per-token charges
Greater control of model training and customization, leading to 20% improvement across core features
Secure processing of proprietary data within their AWS account
Automatic scaling to meet the spike in inference demand

Solution overview
Impel chose SageMaker AI, a fully managed cloud service that builds, trains, and deploys machine learning (ML) models using AWS infrastructure, tools, and workflows to fine-tune a Meta Llama model for Sales AI. Meta Llama is a powerful model, well-suited for industry-specific tasks due to its strong instruction-following capabilities, support for extended context windows, and efficient handling of domain knowledge.
Impel used SageMaker LMI containers to deploy LLM inference on SageMaker endpoints. These purpose-built Docker containers offer optimized performance for models like Meta Llama with support for LoRA fine-tuned models and AWQ. Impel used LoRA fine-tuning, an efficient and cost-effective technique to adapt LLMs for specialized applications, through Amazon SageMaker Studio notebooks running on ml.p4de.24xlarge instances. This managed environment simplified the development process, enabling Impel’s team to seamlessly integrate popular open source tools like PyTorch and torchtune for model training. For model optimization, Impel applied AWQ techniques to reduce model size and improve inference performance.
In production, Impel deployed inference endpoints on ml.g6e.12xlarge instances, powered by four NVIDIA GPUs and high memory capacity, suitable for serving large models like Meta Llama efficiently. Impel used the SageMaker built-in automatic scaling feature to automatically scale serving containers based on concurrent requests, which helped meet variable production traffic demands while optimizing for cost.
The following diagram illustrates the solution architecture, showcasing model fine-tuning and customer inference.

Impel’s Sales AI reference architecture.

Impel’s R&D team partnered closely with various AWS teams, including its Account team, GenAI strategy team, and SageMaker service team. This virtual team collaborated over multiple sprints leading up to the fine-tuned Sales AI launch date to review model evaluations, benchmark SageMaker performance, optimize scaling strategies, and identify the optimal SageMaker instances. This partnership encompassed technical sessions, strategic alignment meetings, and cost and operational discussions for post-implementation. The tight collaboration between Impel and AWS was instrumental in realizing the full potential of Impel’s fine-tuned model hosted on SageMaker AI.
Fine-tuned model evaluation process
Impel’s transition to its fine-tuned Meta Llama model delivered improvements across key performance metrics with noticeable improvements in understanding automotive-specific terminology and generating personalized responses. Structured human evaluations revealed enhancements in critical customer interaction areas: personalized replies improved from 73% to 86% accuracy, conversation summarization increased from 70% to 83%, and follow-up message generation showed the most significant gain, jumping from 59% to 92% accuracy. The following screenshot shows how customers interact with Sales AI. The model evaluation process included Impel’s R&D team grading various use cases served by the incumbent LLM provider and Impel’s fine-tuned models.

Example of a customer interaction with Sales AI.

In addition to output quality, Impel measured latency and throughput to validate the model’s production readiness. Using awscurl for SigV4-signed HTTP requests, the team confirmed these improvements in real-world performance metrics, ensuring optimal customer experience in production environments.
Using domain-specific models for better performance
Impel’s evolution of Sales AI progressed from a general-purpose LLM to a domain-specific, fine-tuned model. Using anonymized customer interaction data, Impel fine-tuned a publicly available foundation model, resulting in several key improvements. The new model exhibited a 20% increase in accuracy across core features, showcasing enhanced automotive industry comprehension and more efficient context window utilization. By transitioning to this approach, Impel achieved three primary benefits:

Enhanced data security through in-house processing within their AWS accounts
Reduced reliance on external APIs and third-party providers
Greater operational control for scaling and customization

These advancements, coupled with the significant output quality improvement, validated Impel’s strategic shift towards a domain-specific AI model for Sales AI.
Expanding AI innovation in automotive retail
Impel’s success deploying fine-tuned models on SageMaker has established a foundation for extending its AI capabilities to support a broader range of use cases tailored to the automotive industry. Impel is planning to transition to in-house, domain-specific models to extend the benefits of improved accuracy and performance throughout their Customer Engagement Product suite.Looking ahead, Impel’s R&D team is advancing their AI capabilities by incorporating Retrieval Augmented Generation (RAG) workflows, advanced function calling, and agentic workflows. These innovations can help deliver adaptive, context-aware systems designed to interact, reason, and act across complex automotive retail tasks.
Conclusion
In this post, we discussed how Impel has enhanced the automotive dealership customer experience with fine-tuned LLMs on SageMaker.
For organizations considering similar transitions to fine-tuned models, Impel’s experience demonstrates how working with AWS can help achieve both accuracy improvements and model customization opportunities while building long-term AI capabilities tailored to specific industry needs. Connect with your account team or visit Amazon SageMaker AI to learn how SageMaker can help you deploy and manage fine-tuned models.

About the Authors
Nicholas Scozzafava is a Senior Solutions Architect at AWS, focused on startup customers. Prior to his current role, he helped enterprise customers navigate their cloud journeys. He is passionate about cloud infrastructure, automation, DevOps, and helping customers build and scale on AWS.
Sam Sudakoff is a Senior Account Manager at AWS, focused on strategic startup ISVs. Sam specializes in technology landscapes, AI/ML, and AWS solutions. Sam’s passion lies in scaling startups and driving SaaS and AI transformations. Notably, his work with AWS’s top startup ISVs has focused on building strategic partnerships and implementing go-to-market initiatives that bridge enterprise technology with innovative startup solutions, while maintaining strict adherence with data security and privacy requirements.
Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Dmitry Soldatkin is a Senior AI/ML Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.
Tatia Tsmindashvili is a Senior Deep Learning Researcher at Impel with an MSc in Biomedical Engineering and Medical Informatics. She has over 5 years of experience in AI, with interests spanning LLM agents, simulations, and neuroscience. You can find her on LinkedIn.
Ana Kolkhidashvili is the Director of R&D at Impel, where she leads AI initiatives focused on large language models and automated conversation systems. She has over 8 years of experience in AI, specializing in large language models, automated conversation systems, and NLP. You can find her on LinkedIn.
Guram Dentoshvili is the Director of Engineering and R&D at Impel, where he leads the development of scalable AI solutions and drives innovation across the company’s conversational AI products. He began his career at Pulsar AI as a Machine Learning Engineer and played a key role in building AI technologies tailored to the automotive industry. You can find him on LinkedIn.
Dachi Choladze is the Chief Innovation Officer at Impel, where he leads initiatives in AI strategy, innovation, and product development. He has over 10 years of experience in technology entrepreneurship and artificial intelligence. Dachi is the co-founder of Pulsar AI, Georgia’s first globally successful AI startup, which later merged with Impel. You can find him on LinkedIn.
Deepam Mishra is a Sr Advisor to Startups at AWS and advises startups on ML, Generative AI, and AI Safety and Responsibility. Before joining AWS, Deepam co-founded and led an AI business at Microsoft Corporation and Wipro Technologies. Deepam has been a serial entrepreneur and investor, having founded 4 AI/ML startups. Deepam is based in the NYC metro area and enjoys meeting AI founders.