Manage multi-tenant Amazon Bedrock costs using application inference p …

Successful generative AI software as a service (SaaS) systems require a balance between service scalability and cost management. This becomes critical when building a multi-tenant generative AI service designed to serve a large, diverse customer base while maintaining rigorous cost controls and comprehensive usage monitoring.
Traditional cost management approaches for such systems often reveal limitations. Operations teams encounter challenges in accurately attributing costs across individual tenants, particularly when usage patterns demonstrate extreme variability. Enterprise clients might have different consumption behaviors—some experiencing sudden usage spikes during peak periods, whereas others maintain consistent resource consumption patterns.
A robust solution requires a context-driven, multi-tiered alerting system that exceeds conventional monitoring standards. By implementing graduated alert levels—from green (normal operations) to red (critical interventions)—systems can develop intelligent, automated responses that dynamically adapt to evolving usage patterns. This approach enables proactive resource management, precise cost allocation, and rapid, targeted interventions that help prevent potential financial overruns.
The breaking point often comes when you experience significant cost overruns. These overruns aren’t due to a single factor but rather a combination of multiple enterprise tenants increasing their usage while your monitoring systems fail to catch the trend early enough. Your existing alerting system might only provide binary notifications—either everything is fine or there’s a problem—that lack the nuanced, multi-level approach needed for proactive cost management. The situation is further complicated by a tiered pricing model, where different customers have varying SLA commitments and usage quotas. Without a sophisticated alerting system that can differentiate between normal usage spikes and genuine problems, your operations team might find itself constantly taking reactive measures rather than proactive ones.
This post explores how to implement a robust monitoring solution for multi-tenant AI deployments using a feature of Amazon Bedrock called application inference profiles. We demonstrate how to create a system that enables granular usage tracking, accurate cost allocation, and dynamic resource management across complex multi-tenant environments.
What are application inference profiles?
Application inference profiles in Amazon Bedrock enable granular cost tracking across your deployments. You can associate metadata with each inference request, creating a logical separation between different applications, teams, or customers accessing your foundation models (FMs). By implementing a consistent tagging strategy with application inference profiles, you can systematically track which tenant is responsible for each API call and the corresponding consumption.
For example, you can define key-value pair tags such as TenantID, business-unit, or ApplicationID and send these tags with each request to partition your usage data. You can also send the application inference profile ID with your request. When combined with AWS resource tagging, these tag-enabled profiles provide visibility into the utilization of Amazon Bedrock models. This tagging approach introduces accurate chargeback mechanisms to help you allocate costs proportionally based on actual usage rather than arbitrary distribution approaches. To attach tags to the inference profile, see Tagging Amazon Bedrock resources and Organizing and tracking costs using AWS cost allocation tags. Furthermore, you can use application inference profiles to identify optimization opportunities specific to each tenant, helping you implement targeted improvements for the greatest impact to both performance and cost-efficiency.
Solution overview
Imagine a scenario where an organization has multiple tenants, each with their respective generative AI applications using Amazon Bedrock models. To demonstrate multi-tenant cost management, we provide a sample, ready-to-deploy solution on GitHub. It deploys two tenants with two applications, each within a single AWS Region. The solution uses application inference profiles for cost tracking, Amazon Simple Notification Service (Amazon SNS) for notifications, and Amazon CloudWatch to produce tenant-specific dashboards. You can modify the source code of the solution to suit your needs.
The following diagram illustrates the solution architecture.

The solution handles the complexities of collecting and aggregating usage data across tenants, storing historical metrics for trend analysis, and presenting actionable insights through intuitive dashboards. This solution provides the visibility and control needed to manage your Amazon Bedrock costs while maintaining the flexibility to customize components to match your specific organizational requirements.
In the following sections, we walk through the steps to deploy the solution.
Prerequisites
Before setting up the project, you must have the following prerequisites:

AWS account – An active AWS account with permissions to create and manage resources such as Lambda functions, API Gateway endpoints, CloudWatch dashboards, and SNS alerts
Python environment – Python 3.12 or higher installed on your local machine
Virtual environment – It’s recommended to use a virtual environment to manage project dependencies

Create the virtual environment
The first step is to clone the GitHub repo or copy the code into a new project to create the virtual environment.

Update models.json
Review and update the models.json file to reflect the correct input and output token pricing based on your organization’s contract, or use the default settings. Verifying you have the right data at this stage is critical for accurate cost tracking.

Update config.json
Modify config.json to define the profiles you want to set up for cost tracking. Each profile can have multiple key-value pairs for tags. For every profile, each tag key must be unique, and each tag key can have only one value. Each incoming request should contain these tags or the profile name as HTTP headers at runtime.
As part of the solution, you also configure a unique Amazon Simple Storage Service (Amazon S3) bucket for saving configuration artifacts and an admin email alias that will receive alerts when a particular threshold is breached.

Create user roles and deploy solution resources
After you modify config.json and models.json, run the following command in the terminal to create the assets, including the user roles:
python setup.py –create-user-roles
Alternately, you can create the assets without creating user roles by running the following command:
python setup.py
Make sure that you are executing this command from the project directory. Note that full access policies are not advised for production use cases.
The setup command triggers the process of creating the inference profiles, building a CloudWatch dashboard to capture the metrics for each profile, deploying the inference Lambda function that executes the Amazon Bedrock Converse API and extracts the inference metadata and metrics related to the inference profile, sets up the SNS alerts, and finally creates the API Gateway endpoint to invoke the Lambda function.

When the setup is complete, you will see the inference profile IDs and API Gateway ID listed in the config.json file. (The API Gateway ID will also be listed in the final part of the output in the terminal)

When the API is live and inferences are invoked from it, the CloudWatch dashboard will show cost tracking. If you experience significant traffic, the alarms will trigger an SNS alert email.

For a video version of this walkthrough, refer to Track, Allocate, and Manage your Generative AI cost & usage with Amazon Bedrock.
You are now ready to use Amazon Bedrock models with this cost management solution. Make sure that you are using the API Gateway endpoint to consume these models and send the requests with the tags or application inference profile IDs as headers, which you provided in the config.json file. This solution will automatically log the invocations and track costs for your application on a per-tenant basis.
Alarms and dashboards
The solution creates the following alarms and dashboards:

BedrockTokenCostAlarm-{profile_name} – Alert when total token cost for {profile_name} exceeds {cost_threshold} in 5 minutes
BedrockTokensPerMinuteAlarm-{profile_name} – Alert when tokens per minute for {profile_name} exceed {tokens_per_min_threshold}
BedrockRequestsPerMinuteAlarm-{profile_name} – Alert when requests per minute for {profile_name} exceed {requests_per_min_threshold}

You can monitor and receive alerts about your AWS resources and applications across multiple Regions.
A metric alarm has the following possible states:

OK – The metric or expression is within the defined threshold
ALARM – The metric or expression is outside of the defined threshold
INSUFFICIENT_DATA – The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state

After you add an alarm to a dashboard, the alarm turns gray when it’s in the INSUFFICIENT_DATA state and red when it’s in the ALARM state. The alarm is shown with no color when it’s in the OK state.
An alarm invokes actions only when the alarm changes state from OK to ALARM. In this solution, an email is sent to through your SNS subscription to an admin as specified in your config.json file. You can specify additional actions when the alarm changes state between OK, ALARM, and INSUFFICIENT_DATA.
Considerations
Although the API Gateway maximum integration timeout (30 seconds) is lower than the Lambda timeout (15 minutes), long-running model inference calls might be cut off by API Gateway. Lambda and Amazon Bedrock enforce strict payload and token size limits, so make sure your requests and responses fit within these boundaries. For example, the maximum payload size is 6 MB for synchronous Lambda invocations and the combined request line and header values can’t exceed 10,240 bytes for API Gateway payloads. If your workload can work within these limits, you will be able to use this solution.
Clean up
To delete your assets, run the following command:
python unsetup.py
Conclusion
In this post, we demonstrated how to implement effective cost monitoring for multi-tenant Amazon Bedrock deployments using application inference profiles, CloudWatch metrics, and custom CloudWatch dashboards. With this solution, you can track model usage, allocate costs accurately, and optimize resource consumption across different tenants. You can customize the solution according to your organization’s specific needs.
This solution provides the framework for building an intelligent system that can understand context—distinguishing between a gradual increase in usage that might indicate healthy business growth and sudden spikes that could signal potential issues. An effective alerting system needs to be sophisticated enough to consider historical patterns, time of day, and customer tier when determining alert levels. Furthermore, these alerts can trigger different types of automated responses based on the alert level: from simple notifications, to automatic customer communications, to immediate rate-limiting actions.
Try out the solution for your own use case, and share your feedback and questions in the comments.

About the authors
Claudio Mazzoni is a Sr Specialist Solutions Architect on the Amazon Bedrock GTM team. Claudio exceeds at guiding costumers through their Gen AI journey. Outside of work, Claudio enjoys spending time with family, working in his garden, and cooking Uruguayan food.
Fahad Ahmed is a Senior Solutions Architect at AWS and assists financial services customers. He has over 17 years of experience building and designing software applications. He recently found a new passion of making AI services accessible to the masses.

Manish Yeladandi is a Solutions Architect at AWS, specializing in AI/ML, containers, and security. Combining deep cloud expertise with business acumen, Manish architects secure, scalable solutions that help organizations optimize their technology investments and achieve transformative business outcomes.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.
Abhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as Artificial Intelligence, distributed computing, networking, and storage. His expertise lies in Deep Learning in the domains of Natural Language Processing (NLP) and Computer Vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.

GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding …

Vision-language models (VLMs) play a crucial role in today’s intelligent systems by enabling a detailed understanding of visual content. The complexity of multimodal intelligence tasks has grown, ranging from scientific problem-solving to the development of autonomous agents. Current demands on VLMs have far exceeded simple visual content perception, with increasing attention on advanced reasoning. While recent works show that long-form reasoning and scalable RL significantly enhance LLMs’ problem-solving abilities, current efforts mainly focus on specific domains to improve VLM reasoning. The open-source community currently lacks a multimodal reasoning model that outperforms traditional non-thinking models of comparable parameter scale across diverse tasks.

Researchers from Zhipu AI and Tsinghua University have proposed GLM-4.1V-Thinking, a VLM designed to advance general-purpose multimodal understanding and reasoning. The approach then introduces Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the model’s full potential, enabling improvements across STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. Researchers open-sourced GLM-4.1V-9B-Thinking, which sets a new benchmark among similarly sized models. It also delivers competitive, and in some cases superior performance compared to proprietary models like GPT-4o on challenging tasks such as long document understanding and STEM reasoning.

GLM-4.1V-Thinking contains three core components: a vision encoder, an MLP adapter, and an LLM decoder. It utilizes AIMv2-Huge as the vision encoder and GLM as the LLM, replacing the original 2D convolutions with 3D convolutions for temporal downsampling. The model integrates 2D-RoPE to support arbitrary image resolutions and aspect ratios, and process extreme aspect ratios over 200:1 and high resolutions beyond 4K. Researchers extend RoPE to 3D-RoPE in the LLM to improve spatial understanding in multimodal contexts. For temporal modeling in videos, time index tokens are added after each frame token, with timestamps encoded as strings to help the model understand real-world temporal gaps between frames

During pre-training, the researchers use a variety of datasets, combining large academic corpora with interleaved image-text data rich in knowledge. By including pure text data, the model’s core language capabilities are preserved, resulting in better pass@k performance than other state-of-the-art pre-trained base models of similar size. The supervised fine-tuning stage transforms the base VLM into one capable of long CoT inference using a curated long-CoT corpus across verifiable, like STEM problems, and non-verifiable tasks such as instruction following. Finally, the RL phase employs a combination of RLVR and RLHF to conduct large-scale training across all multimodal domains, including STEM problem solving, grounding, optical character recognition, GUI agents, and many more.

GLM-4.1V-9B-Thinking outperforms all competing open-source models under 10B parameters in General VQA tasks covering both single-image and multi-image settings. It achieves the highest performance on challenging STEM benchmarks, including MMMU_Val, MMMU_Pro, VideoMMMU, and AI2D. In the OCR and Chart domains, the model sets new state-of-the-art scores on ChartQAPro and ChartMuseum. For Long Document Understanding, GLM-4.1V-9B-Thinking outperforms all other models on MMLongBench, while establishing new state-of-the-art results in GUI Agents and multimodal Coding tasks. Lastly, the model shows robust Video Understanding performance, outperforming VideoMME, MMVU, and MotionBench benchmarks.

In conclusion, researchers introduced GLM-4.1V-Thinking, which represents a step toward general-purpose multimodal reasoning. Its 9B-parameter model outperforms larger models like the one that exceeds 70B parameters. However, several limitations remain, such as inconsistent improvements in reasoning quality through RL, instability during training, and difficulties with complex cases. Future developments should focus on improving supervision and evaluation of model reasoning, with reward models evaluating intermediate reasoning steps while detecting hallucinations and logical inconsistencies. Moreover, exploring strategies to prevent reward hacking in subjective evaluation tasks is crucial to achieve general-purpose intelligence.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project.

Sponsorship OpportunityReach the most influential AI developers worldwide. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding and Reasoning appeared first on MarkTechPost.

Mirage: Multimodal Reasoning in VLMs Without Rendering Images

While VLMs are strong at understanding both text and images, they often rely solely on text when reasoning, limiting their ability to solve tasks that require visual thinking, such as spatial puzzles. People naturally visualize solutions rather than describing every detail, but VLMs struggle to do the same. Although some recent models can generate both text and images, training them for image generation often weakens their ability to reason. Producing images also doesn’t support step-by-step visual reasoning. As a result, unlocking the full potential of VLMs for complex, visually grounded thinking remains a key challenge in the field. 

CoT prompting encourages models to reason through problems step by step using examples with intermediate explanations. This idea has been extended to multimodal tasks, where visual information is integrated into the reasoning flow. Methods like ICoT embed image regions within text sequences, whereas Visual CoT utilizes visual annotations to train models for improved spatial understanding. Some recent models can generate both text and images simultaneously; however, they require heavy supervision and incur high computational costs. Separately, researchers are exploring ways to embed reasoning internally within models by guiding their hidden states, using special tokens or latent representations instead of explicit reasoning steps. 

Researchers from the University of Massachusetts Amherst and MIT propose an approach inspired by how humans use mental imagery, which involves forming simple, task-relevant visuals internally while thinking. They introduce Mirage, a framework that enables VLMs to interleave visual reasoning directly into their text outputs without generating full images. Instead, the model inserts compact visual cues derived from its hidden states. It’s trained in two phases: first with both text and visual supervision, then with text-only guidance. Reinforcement learning further refines its reasoning skills. Mirage enables VLMs to think more like humans, thereby improving their performance on complex, multimodal tasks. 

Mirage is a framework inspired by human mental imagery that enables VLMs to reason using compact visual cues instead of generating full images. It employs two training stages: first, it grounds compressed visual features, known as latent tokens, within the reasoning process using helper images and joint supervision. Then, it relaxes this constraint, allowing the model to generate its latent tokens and use them to guide reasoning. This setup enables interleaved multimodal reasoning. A final reinforcement learning stage further fine-tunes the model using accuracy and formatting rewards, encouraging both correct answers and structured thought processes. 

The study evaluates the model on four spatial reasoning tasks, such as visual puzzles and geometry problems, using a small dataset of 1,000 training samples. To support reasoning, it generates synthetic helper images and thought steps, mimicking how humans use sketches and cues to facilitate thought processes. The model consistently outperforms both text-only and multimodal baselines, even in tasks that require extensive planning, such as maze solving. A smaller version of the model also yields strong results, demonstrating that the method is robust. Ablation studies confirm that grounding latent visual tokens first, followed by flexible training, is key. Overall, interleaving visual and text reasoning without real images boosts both understanding and accuracy. 

In conclusion, inspired by how humans use mental imagery to reason, the study introduces a lightweight approach that lets VLMs think visually, without ever generating actual images. By interleaving compact visual cues with text during decoding, the model learns to reason multimodally through a two-phase training process: first, anchoring these cues to real image features, then allowing them to evolve freely to support reasoning. A final reinforcement learning step sharpens performance. Tested on spatial reasoning tasks, the method consistently outperforms traditional text-only models. However, challenges remain in scaling to other tasks and improving the quality of the synthetic training data. 

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project.

Sponsorship OpportunityReach the most influential AI developers worldwide. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]

The post Mirage: Multimodal Reasoning in VLMs Without Rendering Images appeared first on MarkTechPost.

NVIDIA AI Releases Canary-Qwen-2.5B: A State-of-the-Art ASR-LLM Hybrid …

NVIDIA has just released Canary-Qwen-2.5B, a groundbreaking automatic speech recognition (ASR) and language model (LLM) hybrid, which now tops the Hugging Face OpenASR leaderboard with a record-setting Word Error Rate (WER) of 5.63%. Licensed under CC-BY, this model is both commercially permissive and open-source, pushing forward enterprise-ready speech AI without usage restrictions. This release marks a significant technical milestone by unifying transcription and language understanding into a single model architecture, enabling downstream tasks like summarization and question answering directly from audio.

Key Highlights

5.63% WER – lowest on Hugging Face OpenASR leaderboard

RTFx of 418 – high inference speed on 2.5B parameters

Supports both ASR and LLM modes – enabling transcribe-then-analyze workflows

Commercial license (CC-BY) – ready for enterprise deployment

Open-source via NeMo – customizable and extensible for research and production

Model Architecture: Bridging ASR and LLM

The core innovation behind Canary-Qwen-2.5B lies in its hybrid architecture. Unlike traditional ASR pipelines that treat transcription and post-processing (summarization, Q&A) as separate stages, this model unifies both capabilities through:

FastConformer encoder: A high-speed speech encoder specialized for low-latency and high-accuracy transcription.

Qwen3-1.7B LLM decoder: An unmodified pretrained large language model (LLM) that receives audio-transcribed tokens via adapters.

The use of adapters ensures modularity, allowing the Canary encoder to be detached and Qwen3-1.7B to operate as a standalone LLM for text-based tasks. This architectural decision promotes multi-modal flexibility — a single deployment can handle both spoken and written inputs for downstream language tasks.

Performance Benchmarks

Canary-Qwen-2.5B achieves a record WER of 5.63%, outperforming all prior entries on Hugging Face’s OpenASR leaderboard. This is particularly notable given its relatively modest size of 2.5 billion parameters, compared to some larger models with inferior performance.

MetricValueWER5.63%Parameter Count2.5BRTFx418Training Hours234,000LicenseCC-BY

The 418 RTFx (Real-Time Factor) indicates that the model can process input audio 418× faster than real-time, a critical feature for real-world deployments where latency is a bottleneck (e.g., transcription at scale or live captioning systems).

Dataset and Training Regime

The model was trained on an extensive dataset comprising 234,000 hours of diverse English-language speech, far exceeding the scale of prior NeMo models. This dataset includes a wide range of accents, domains, and speaking styles, enabling superior generalization across noisy, conversational, and domain-specific audio.

Training was conducted using NVIDIA’s NeMo framework, with open-source recipes available for community adaptation. The integration of adapters allows for flexible experimentation — researchers can substitute different encoders or LLM decoders without retraining entire stacks.

Deployment and Hardware Compatibility

Canary-Qwen-2.5B is optimized for a wide range of NVIDIA GPUs:

Data Center: A100, H100, and newer Hopper/Blackwell-class GPUs

Workstation: RTX PRO 6000 (Blackwell), RTX A6000

Consumer: GeForce RTX 5090 and below

The model is designed to scale across hardware classes, making it suitable for both cloud inference and on-prem edge workloads.

Use Cases and Enterprise Readiness

Unlike many research models constrained by non-commercial licenses, Canary-Qwen-2.5B is released under a CC-BY license, enabling:

Enterprise transcription services

Audio-based knowledge extraction

Real-time meeting summarization

Voice-commanded AI agents

Regulatory-compliant documentation (healthcare, legal, finance)

The model’s LLM-aware decoding also introduces improvements in punctuation, capitalization, and contextual accuracy, which are often weak spots in ASR outputs. This is especially valuable for sectors like healthcare or legal where misinterpretation can have costly implications.

Open: A Recipe for Speech-Language Fusion

By open-sourcing the model and its training recipe, the NVIDIA research team aims to catalyze community-driven advances in speech AI. Developers can mix and match other NeMo-compatible encoders and LLMs, creating task-specific hybrids for new domains or languages.

The release also sets a precedent for LLM-centric ASR, where LLMs are not post-processors but integrated agents in the speech-to-text pipeline. This approach reflects a broader trend toward agentic models — systems capable of full comprehension and decision-making based on real-world multimodal inputs.

Conclusion

NVIDIA’s Canary-Qwen-2.5B is more than an ASR model — it’s a blueprint for integrating speech understanding with general-purpose language models. With SoTA performance, commercial usability, and open innovation pathways, this release is poised to become a foundational tool for enterprises, developers, and researchers aiming to unlock the next generation of voice-first AI applications.

Check out the Leaderboard, Model on Hugging Face and Try it here. All credit for this research goes to the researchers of this project.

Reach the most influential AI developers worldwide. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post NVIDIA AI Releases Canary-Qwen-2.5B: A State-of-the-Art ASR-LLM Hybrid Model with SoTA Performance on OpenASR Leaderboard appeared first on MarkTechPost.

Evaluating generative AI models with Amazon Nova LLM-as-a-Judge on Ama …

Evaluating the performance of large language models (LLMs) goes beyond statistical metrics like perplexity or bilingual evaluation understudy (BLEU) scores. For most real-world generative AI scenarios, it’s crucial to understand whether a model is producing better outputs than a baseline or an earlier iteration. This is especially important for applications such as summarization, content generation, or intelligent agents where subjective judgments and nuanced correctness play a central role.
As organizations deepen their deployment of these models in production, we’re experiencing an increasing demand from customers who want to systematically assess model quality beyond traditional evaluation methods. Current approaches like accuracy measurements and rule-based evaluations, although helpful, can’t fully address these nuanced assessment needs, particularly when tasks require subjective judgments, contextual understanding, or alignment with specific business requirements. To bridge this gap, LLM-as-a-judge has emerged as a promising approach, using the reasoning capabilities of LLMs to evaluate other models more flexibly and at scale.
Today, we’re excited to introduce a comprehensive approach to model evaluation through the Amazon Nova LLM-as-a-Judge capability on Amazon SageMaker AI, a fully managed Amazon Web Services (AWS) service to build, train, and deploy machine learning (ML) models at scale. Amazon Nova LLM-as-a-Judge is designed to deliver robust, unbiased assessments of generative AI outputs across model families. Nova LLM-as-a-Judge is available as optimized workflows on SageMaker AI, and with it, you can start evaluating model performance against your specific use cases in minutes. Unlike many evaluators that exhibit architectural bias, Nova LLM-as-a-Judge has been rigorously validated to remain impartial and has achieved leading performance on key judge benchmarks while closely reflecting human preferences. With its exceptional accuracy and minimal bias, it sets a new standard for credible, production-grade LLM evaluation.
Nova LLM-as-a-Judge capability provides pairwise comparisons between model iterations, so you can make data-driven decisions about model improvements with confidence.
How Nova LLM-as-a-Judge was trained
Nova LLM-as-a-Judge was built through a multistep training process comprising supervised training and reinforcement learning stages that used public datasets annotated with human preferences. For the proprietary component, multiple annotators independently evaluated thousands of examples by comparing pairs of different LLM responses to the same prompt. To verify consistency and fairness, all annotations underwent rigorous quality checks, with final judgments calibrated to reflect broad human consensus rather than an individual viewpoint.
The training data was designed to be both diverse and representative. Prompts spanned a wide range of categories, including real-world knowledge, creativity, coding, mathematics, specialized domains, and toxicity, so the model could evaluate outputs across many real-world scenarios. Training data included data from over 90 languages and is primarily composed of English, Russian, Chinese, German, Japanese, and Italian.Importantly, an internal bias study evaluating over 10,000 human-preference judgments against 75 third-party models confirmed that Amazon Nova LLM-as-a-Judge shows only a 3% aggregate bias relative to human annotations. Although this is a significant achievement in reducing systematic bias, we still recommend occasional spot checks to validate critical comparisons.
In the following figure, you can see how the Nova LLM-as-a-Judge bias compares to human preferences when evaluating Amazon Nova outputs compared to outputs from other models. Here, bias is measured as the difference between the judge’s preference and human preference across thousands of examples. A positive value indicates the judge slightly favors Amazon Nova models, and a negative value indicates the opposite. To quantify the reliability of these estimates, 95% confidence intervals were computed using the standard error for the difference of proportions, assuming independent binomial distributions.

Amazon Nova LLM-as-a-Judge achieves advanced performance among evaluation models, demonstrating strong alignment with human judgments across a range of tasks. For example, it scores 45% accuracy on JudgeBench (compared to 42% for Meta J1 8B) and 68% on PPE (versus 60% for Meta J1 8B). The data from Meta’s J1 8B was pulled from Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning.
These results highlight the strength of Amazon Nova LLM-as-a-Judge in chatbot-related evaluations, as shown in the PPE benchmark. Our benchmarking follows current best practices, reporting reconciled results for positionally swapped responses on JudgeBench, CodeUltraFeedback, Eval Bias, and LLMBar, while using single-pass results for PPE.

Model
Eval Bias
Judge Bench
LLM Bar
PPE
CodeUltraFeedback

Nova LLM-as-a-Judge
0.76
0.45
0.67
0.68
0.64

Meta J1 8B

0.42

0.60

Nova Micro (8B)
0.56
0.37
0.55
0.6

In this post, we present a streamlined approach to implementing Amazon Nova LLM-as-a-Judge evaluations using SageMaker AI, interpreting the resulting metrics, and applying this process to improve your generative AI applications.
Overview of the evaluation workflow
The evaluation process starts by preparing a dataset in which each example includes a prompt and two alternative model outputs. The JSONL format looks like this:

{
“prompt”:”Explain photosynthesis.”,
“response_A”:”Answer A…”,
“response_B”:”Answer B…”
}
{
“prompt”:”Summarize the article.”,
“response_A”:”Answer A…”,
“response_B”:”Answer B…”
}

After preparing this dataset, you use the given SageMaker evaluation recipe, which configures the evaluation strategy, specifies which model to use as the judge, and defines the inference settings such as temperature and top_p.
The evaluation runs inside a SageMaker training job using pre-built Amazon Nova containers. SageMaker AI provisions compute resources, orchestrates the evaluation, and writes the output metrics and visualizations to Amazon Simple Storage Service (Amazon S3).
When it’s complete, you can download and analyze the results, which include preference distributions, win rates, and confidence intervals.
Understanding how Amazon Nova LLM-as-a-Judge works
The Amazon Nova LLM-as-a-Judge uses an evaluation method called binary overall preference judge. The binary overall preference judge is a method where a language model compares two outputs side by side and picks the better one or declares a tie. For each example, it produces a clear preference. When you aggregate these judgments over many samples, you get metrics like win rate and confidence intervals. This approach uses the model’s own reasoning to assess qualities like relevance and clarity in a straightforward, consistent way.

This judge model is meant to provide low-latency general overall preferences in situations where granular feedback isn’t necessary
The output of this model is one of [[A>B]] or [[B>A]]
Use cases for this model are primarily those where automated, low-latency, general pairwise preferences are required, such as automated scoring for checkpoint selection in training pipelines

Understanding Amazon Nova LLM-as-a-Judge evaluation metrics
When using the Amazon Nova LLM-as-a-Judge framework to compare outputs from two language models, SageMaker AI produces a comprehensive set of quantitative metrics. You can use these metrics to assess which model performs better and how reliable the evaluation is. The results fall into three main categories: core preference metrics, statistical confidence metrics, and standard error metrics.
The core preference metrics report how often each model’s outputs were preferred by the judge model. The a_scores metric counts the number of examples where Model A was favored, and b_scores counts cases where Model B was chosen as better. The ties metric captures instances in which the judge model rated both responses equally or couldn’t identify a clear preference. The inference_error metric counts cases where the judge couldn’t generate a valid judgment due to malformed data or internal errors.
The statistical confidence metrics quantify how likely it is that the observed preferences reflect true differences in model quality rather than random variation. The winrate reports the proportion of all valid comparisons in which Model B was preferred. The lower_rate and upper_rate define the lower and upper bounds of the 95% confidence interval for this win rate. For example, a winrate of 0.75 with a confidence interval between 0.60 and 0.85 suggests that, even accounting for uncertainty, Model B is consistently favored over Model A. The score field often matches the count of Model B wins but can also be customized for more complex evaluation strategies.
The standard error metrics provide an estimate of the statistical uncertainty in each count. These include a_scores_stderr, b_scores_stderr, ties_stderr, inference_error_stderr, andscore_stderr. Smaller standard error values indicate more reliable results. Larger values can point to a need for additional evaluation data or more consistent prompt engineering.
Interpreting these metrics requires attention to both the observed preferences and the confidence intervals:

If the winrate is substantially above 0.5 and the confidence interval doesn’t include 0.5, Model B is statistically favored over Model A.
Conversely, if the winrate is below 0.5 and the confidence interval is fully below 0.5, Model A is preferred.
When the confidence interval overlaps 0.5, the results are inconclusive and further evaluation is recommended.
High values in inference_error or large standard errors suggest there might have been issues in the evaluation process, such as inconsistencies in prompt formatting or insufficient sample size.

The following is an example metrics output from an evaluation run:

{
“a_scores”: 16.0,
“a_scores_stderr”: 0.03,
“b_scores”: 10.0,
“b_scores_stderr”: 0.09,
“ties”: 0.0,
“ties_stderr”: 0.0,
“inference_error”: 0.0,
“inference_error_stderr”: 0.0,
“score”: 10.0,
“score_stderr”: 0.09,
“winrate”: 0.38,
“lower_rate”: 0.23,
“upper_rate”: 0.56
}

In this example, Model A was preferred 16 times, Model B was preferred 10 times, and there were no ties or inference errors. The winrate of 0.38 indicates that Model B was preferred in 38% of cases, with a 95% confidence interval ranging from 23% to 56%. Because the interval includes 0.5, this outcome suggests the evaluation was inconclusive, and additional data might be needed to clarify which model performs better overall.
These metrics, automatically generated as part of the evaluation process, provide a rigorous statistical foundation for comparing models and making data-driven decisions about which one to deploy.
Solution overview
This solution demonstrates how to evaluate generative AI models on Amazon SageMaker AI using the Nova LLM-as-a-Judge capability. The provided Python code guides you through the entire workflow.
First, it prepares a dataset by sampling questions from SQuAD and generating candidate responses from Qwen2.5 and Anthropic’s Claude 3.7. These outputs are saved in a JSONL file containing the prompt and both responses.
We accessed Anthropic’s Claude 3.7 Sonnet in Amazon Bedrock using the bedrock-runtime client. We accessed Qwen2.5 1.5B using a SageMaker hosted Hugging Face endpoint.
Next, a PyTorch Estimator launches an evaluation job using an Amazon Nova LLM-as-a-Judge recipe. The job runs on GPU instances such as ml.g5.12xlarge and produces evaluation metrics, including win rates, confidence intervals, and preference counts. Results are saved to Amazon S3 for analysis.
Finally, a visualization function renders charts and tables, summarizing which model was preferred, how strong the preference was, and how reliable the estimates are. Through this end-to-end approach, you can assess improvements, track regressions, and make data-driven decisions about deploying generative models—all without manual annotation.
Prerequisites
You need to complete the following prerequisites before you can run the notebook:

Make the following quota increase requests for SageMaker AI. For this use case, you need to request a minimum of 1 g5.12xlarge instance. On the Service Quotas console, request the following SageMaker AI quotas, 1 G5 instances (g5.12xlarge) for training job usage
(Optional) You can create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding role. (You can use JupyterLab in your local setup, too.)

Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonS3FullAccess, and AmazonBedrockFullAccess to give required access to SageMaker AI and Amazon Bedrock to run the examples.
Assign as trust relationship to your IAM role the following policy:

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Sid”: “”,
            “Effect”: “Allow”,
            “Principal”: {
                “Service”: [
                    “bedrock.amazonaws.com”,
                    “sagemaker.amazonaws.com”
                ]
            },
            “Action”: “sts:AssumeRole”
        }
    ]
}

Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets:

git clone https://github.com/aws-samples/amazon-nova-samples.git
cd customization/SageMakerTrainingJobs/Amazon-Nova-LLM-As-A-Judge/

Next, run the notebook Nova Amazon-Nova-LLM-as-a-Judge-Sagemaker-AI.ipynb to start using the Amazon Nova LLM-as-a-Judge implementation on Amazon SageMaker AI.
Model setup
To conduct an Amazon Nova LLM-as-a-Judge evaluation, you need to generate outputs from the candidate models you want to compare. In this project, we used two different approaches: deploying a Qwen2.5 1.5B model on Amazon SageMaker and invoking Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock. First, we deployed Qwen2.5 1.5B, an open-weight multilingual language model, on a dedicated SageMaker endpoint. This was achieved by using the HuggingFaceModel deployment interface. To deploy the Qwen2.5 1.5B model, we provided a convenient script for you to invoke:python3 deploy_sm_model.py
When it’s deployed, inference can be performed using a helper function wrapping the SageMaker predictor API:

# Initialize the predictor once
predictor = HuggingFacePredictor(endpoint_name=”qwen25-<endpoint_name_here>”)
def generate_with_qwen25(prompt: str, max_tokens: int = 500, temperature: float = 0.9) -> str:
“””
Sends a prompt to the deployed Qwen2.5 model on SageMaker and returns the generated response.
Args:
prompt (str): The input prompt/question to send to the model.
max_tokens (int): Maximum number of tokens to generate.
temperature (float): Sampling temperature for generation.
Returns:
str: The model-generated text.
“””
response = predictor.predict({
“inputs”: prompt,
“parameters”: {
“max_new_tokens”: max_tokens,
“temperature”: temperature
}
})
return response[0][“generated_text”]
answer = generate_with_qwen25(“What is the Grotto at Notre Dame?”)
print(answer)

In parallel, we integrated Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock. Amazon Bedrock provides a managed API layer for accessing proprietary foundation models (FMs) without managing infrastructure. The Claude generation function used the bedrock-runtime AWS SDK for Python (Boto3) client, which accepted a user prompt and returned the model’s text completion:

# Initialize Bedrock client once
bedrock = boto3.client(“bedrock-runtime”, region_name=”us-east-1″)
# (Claude 3.7 Sonnet) model ID via Bedrock
MODEL_ID = “us.anthropic.claude-3-7-sonnet-20250219-v1:0”
def generate_with_claude4(prompt: str, max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.9) -> str:
“””
Sends a prompt to the Claude 4-tier model via Amazon Bedrock and returns the generated response.
Args:
prompt (str): The user message or input prompt.
max_tokens (int): Maximum number of tokens to generate.
temperature (float): Sampling temperature for generation.
top_p (float): Top-p nucleus sampling.
Returns:
str: The text content generated by Claude.
“””
payload = {
“anthropic_version”: “bedrock-2023-05-31”,
“messages”: [{“role”: “user”, “content”: prompt}],
“max_tokens”: max_tokens,
“temperature”: temperature,
“top_p”: top_p
}
response = bedrock.invoke_model(
modelId=MODEL_ID,
body=json.dumps(payload),
contentType=”application/json”,
accept=”application/json”
)
response_body = json.loads(response[‘body’].read())
return response_body[“content”][0][“text”]
answer = generate_with_claude4(“What is the Grotto at Notre Dame?”)
print(answer)

When you have both functions generated and tested, you can move on to creating the evaluation data for the Nova LLM-as-a-Judge.
Prepare the dataset
To create a realistic evaluation dataset for comparing the Qwen and Claude models, we used the Stanford Question Answering Dataset (SQuAD), a widely adopted benchmark in natural language understanding distributed under the CC BY-SA 4.0 license. SQuAD consists of thousands of crowd-sourced question-answer pairs covering a diverse range of Wikipedia articles. By sampling from this dataset, we made sure that our evaluation prompts reflected high-quality, factual question-answering tasks representative of real-world applications.
We began by loading a small subset of examples to keep the workflow fast and reproducible. Specifically, we used the Hugging Face datasets library to download and load the first 20 examples from the SQuAD training split:

from datasets import load_dataset
squad = load_dataset(“squad”, split=”train[:20]”)

This command retrieves a slice of the full dataset, containing 20 entries with structured fields including context, question, and answers. To verify the contents and inspect an example, we printed out a sample question and its ground truth answer:

print(squad[3][“question”])
print(squad[3][“answers”][“text”][0])

For the evaluation set, we selected the first six questions from this subset:
questions = [squad[i][“question”] for i in range(6)]
Generate the Amazon Nova LLM-as-a-Judge evaluation dataset
After preparing a set of evaluation questions from SQuAD, we generated outputs from both models and assembled them into a structured dataset to be used by the Amazon Nova LLM-as-a-Judge workflow. This dataset serves as the core input for SageMaker AI evaluation recipes. To do this, we iterated over each question prompt and invoked the two generation functions defined earlier:

generate_with_qwen25() for completions from the Qwen2.5 model deployed on SageMaker
generate_with_claude() for completions from Anthropic’s Claude 3.7 Sonnet in Amazon Bedrock

For each prompt, the workflow attempted to generate a response from each model. If a generation call failed due to an API error, timeout, or other issue, the system captured the exception and stored a clear error message indicating the failure. This made sure that the evaluation process could proceed gracefully even in the presence of transient errors:

import json
output_path = “llm_judge.jsonl”
with open(output_path, “w”) as f:
for q in questions:
try:
response_a = generate_with_qwen25(q)
except Exception as e:
response_a = f”[Qwen2.5 generation failed: {e}]”

try:
response_b = generate_with_claude4(q)
except Exception as e:
response_b = f”[Claude 3.7 generation failed: {e}]”
row = {
“prompt”: q,
“response_A”: response_a,
“response_B”: response_b
}
f.write(json.dumps(row) + “n”)
print(f”JSONL file created at: {output_path}”)

This workflow produced a JSON Lines file named llm_judge.jsonl. Each line contains a single evaluation record structured as follows:

{
“prompt”: “What is the capital of France?”,
“response_A”: “The capital of France is Paris.”,
“response_B”: “Paris is the capital city of France.”
}

Then, upload this llm_judge.jsonl to an S3 bucket that you’ve predefined:

upload_to_s3(
“llm_judge.jsonl”,
“s3://<YOUR_BUCKET_NAME>/datasets/byo-datasets-dev/custom-llm-judge/llm_judge.jsonl”
)

Launching the Nova LLM-as-a-Judge evaluation job
After preparing the dataset and creating the evaluation recipe, the final step is to launch the SageMaker training job that performs the Amazon Nova LLM-as-a-Judge evaluation. In this workflow, the training job acts as a fully managed, self-contained process that loads the model, processes the dataset, and generates evaluation metrics in your designated Amazon S3 location.
We use the PyTorch estimator class from the SageMaker Python SDK to encapsulate the configuration for the evaluation run. The estimator defines the compute resources, the container image, the evaluation recipe, and the output paths for storing results:

estimator = PyTorch(
output_path=output_s3_uri,
base_job_name=job_name,
role=role,
instance_type=instance_type,
training_recipe=recipe_path,
sagemaker_session=sagemaker_session,
image_uri=image_uri,
disable_profiler=True,
debugger_hook_config=False,
)

When the estimator is configured, you initiate the evaluation job using the fit() method. This call submits the job to the SageMaker control plane, provisions the compute cluster, and begins processing the evaluation dataset:
estimator.fit(inputs={“train”: evalInput})
Results from the Amazon Nova LLM-as-a-Judge evaluation job
The following graphic illustrates the results of the Amazon Nova LLM-as-a-Judge evaluation job.

To help practitioners quickly interpret the outcome of a Nova LLM-as-a-Judge evaluation, we created a convenience function that produces a single, comprehensive visualization summarizing key metrics. This function, plot_nova_judge_results, uses Matplotlib and Seaborn to render an image with six panels, each highlighting a different perspective of the evaluation outcome.
This function takes the evaluation metrics dictionary—produced when the evaluation job is complete—and generates the following visual components:

Score distribution bar chart – Shows how many times Model A was preferred, how many times Model B was preferred, how many ties occurred, and how often the judge failed to produce a decision (inference errors). This provides an immediate sense of how decisive the evaluation was and whether either model is dominating.
Win rate with 95% confidence interval – Plots Model B’s overall win rate against Model A, including an error bar reflecting the lower and upper bounds of the 95% confidence interval. A vertical reference line at 50% marks the point of no preference. If the confidence interval doesn’t cross this line, you can conclude the result is statistically significant.
Preference pie chart – Visually displays the proportion of times Model A, Model B, or neither was preferred. This helps quickly understand preference distribution among the valid judgments.
A vs. B score comparison bar chart – Compares the raw counts of preferences for each model side by side. A clear label annotates the margin of difference to emphasize which model had more wins.
Win rate gauge – Depicts the win rate as a semicircular gauge with a needle pointing to Model B’s performance relative to the theoretical 0–100% range. This intuitive visualization helps nontechnical stakeholders understand the win rate at a glance.
Summary statistics table – Compiles numerical metrics—including total evaluations, error counts, win rate, and confidence intervals—into a compact, clean table. This makes it straightforward to reference the exact numeric values behind the plots.

Because the function outputs a standard Matplotlib figure, you can quickly save the image, display it in Jupyter notebooks, or embed it in other documentation.
Clean up
Complete the following steps to clean up your resources:

Delete your Qwen 2.5 1.5B Endpoint

import boto3

# Create a low-level SageMaker service client.

sagemaker_client = boto3.client(‘sagemaker’, region_name=<region>)

# Delete endpoint

sagemaker_client.delete_endpoint(EndpointName=endpoint_name)

If you’re using a SageMaker Studio JupyterLab notebook, shut down the JupyterLab notebook instance.

How you can use this evaluation framework
The Amazon Nova LLM-as-a-Judge workflow offers a reliable, repeatable way to compare two language models on your own data. You can integrate this into model selection pipelines to decide which version performs best, or you can schedule it as part of continuous evaluation to catch regressions over time.
For teams building agentic or domain-specific systems, this approach provides richer insight than automated metrics alone. Because the entire process runs on SageMaker training jobs, it scales quickly and produces clear visual reports that can be shared with stakeholders.
Conclusion
This post demonstrates how Nova LLM-as-a-Judge—a specialized evaluation model available through Amazon SageMaker AI—can be used to systematically measure the relative performance of generative AI systems. The walkthrough shows how to prepare evaluation datasets, launch SageMaker AI training jobs with Nova LLM-as-a-Judge recipes, and interpret the resulting metrics, including win rates and preference distributions. The fully managed SageMaker AI solution simplifies this process, so you can run scalable, repeatable model evaluations that align with human preferences.
We recommend starting your LLM evaluation journey by exploring the official Amazon Nova documentation and examples. The AWS AI/ML community offers extensive resources, including workshops and technical guidance, to support your implementation journey.
To learn more, visit:

Amazon Nova Documentation
Amazon Bedrock Nova Overview
Fine-tuning Amazon Nova models
Amazon Nova customization guide

About the authors
Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.
Joel Carlson is a Senior Applied Scientist on the Amazon AGI foundation modeling team. He primarily works on developing novel approaches for improving the LLM-as-a-Judge capability of the Nova family of models.
Saurabh Sahu is an applied scientist in the Amazon AGI Foundation modeling team. He obtained his PhD in Electrical Engineering from University of Maryland College Park in 2019. He has a background in multi-modal machine learning working on speech recognition, sentiment analysis and audio/video understanding. Currently, his work focuses on developing recipes to improve the performance of LLM-as-a-judge models for various tasks.
Morteza Ziyadi is an Applied Science Manager at Amazon AGI, where he leads several projects on post-training recipes and (Multimodal) large language models in the Amazon AGI Foundation modeling team. Before joining Amazon AGI, he spent four years at Microsoft Cloud and AI, where he led projects focused on developing natural language-to-code generation models for various products. He has also served as an adjunct faculty at Northeastern University. He earned his PhD from the University of Southern California (USC) in 2017 and has since been actively involved as a workshop organizer, and reviewer for numerous NLP, Computer Vision and machine learning conferences.
Pradeep Natarajan is a Senior Principal Scientist in Amazon AGI Foundation modeling team working on post-training recipes and Multimodal large language models. He has 20+ years of experience in developing and launching multiple large-scale machine learning systems. He has a PhD in Computer Science from University of Southern California.
Michael Cai is a Software Engineer on the Amazon AGI Customization Team supporting the development of evaluation solutions. He obtained his MS in Computer Science from New York University in 2024. In his spare time he enjoys 3d printing and exploring innovative tech.

Building cost-effective RAG applications with Amazon Bedrock Knowledge …

Vector embeddings have become essential for modern Retrieval Augmented Generation (RAG) applications, but organizations face significant cost challenges as they scale. As knowledge bases grow and require more granular embeddings, many vector databases that rely on high-performance storage such as SSDs or in-memory solutions become prohibitively expensive. This cost barrier often forces organizations to limit the scope of their RAG applications or compromise on the granularity of their vector representations, potentially impacting the quality of results. Additionally, for use cases involving historical or archival data that still needs to remain searchable, storing vectors in specialized vector databases optimized for high throughput workloads represents an unnecessary ongoing expense.
Starting July 15, Amazon Bedrock Knowledge Bases customers can select Amazon S3 Vectors (preview), the first cloud object storage with built-in support to store and query vectors at a low cost, as a vector store. Amazon Bedrock Knowledge Bases users can now reduce vector upload, storage, and query costs by up to 90%. Designed for durable and cost-optimized storage of large vector datasets with subsecond query performance, S3 Vectors is ideal for RAG applications that require long-term storage of massive vector volumes and can tolerate the performance tradeoff compared to high queries per second (QPS), millisecond latency vector databases. The integration with Amazon Bedrock means you can build more economical RAG applications while preserving the semantic search performance needed for quality results.
In this post, we demonstrate how to integrate Amazon S3 Vectors with Amazon Bedrock Knowledge Bases for RAG applications. You’ll learn a practical approach to scale your knowledge bases to handle millions of documents while maintaining retrieval quality and using S3 Vectors cost-effective storage.
Amazon Bedrock Knowledge Bases and Amazon S3 Vectors integration overview
When creating a knowledge base in Amazon Bedrock, you can select S3 Vectors as your vector storage option. Using this approach, you can build cost-effective, scalable RAG applications without provisioning or managing complex infrastructure. The integration delivers significant cost savings while maintaining subsecond query performance, making it ideal for working with larger vector datasets generated from massive volumes of unstructured data including text, images, audio, and video. Using a pay-as-you-go pricing model at low price points, S3 Vectors offers industry-leading cost optimization that reduces the cost of uploading, storing, and querying vectors by up to 90% compared to alternative solutions. Advanced search capabilities include rich metadata filtering, so you can refine queries by document attributes such as dates, categories, and sources. The combination of S3 Vectors and Amazon Bedrock is ideal for organizations building large-scale knowledge bases that demand both cost efficiency and performant retrieval—from managing extensive document repositories to historical archives and applications requiring granular vector representations. The walkthrough follows these high-level steps:

Create a new knowledge base
Configure the data source
Configure data source and processing
Sync the data source
Test the knowledge base

Prerequisites
Before you get started, make sure that you have the following prerequisites:

An AWS Account with appropriate service access.
An AWS Identity and Access Management (IAM) role with the appropriate permissions to access Amazon Bedrock and Amazon Simple Storage Service (Amazon S3).
Enable model access for embedding and inference models such as Amazon Titan Text Embeddings V2 and Amazon Nova Pro.

Amazon Bedrock Knowledge Bases and Amazon S3 Vectors integration walkthrough
In this section, we walk through the step-by-step process of creating a knowledge base with Amazon S3 Vectors using the AWS Management Console. We cover the end-to-end process from configuring your vector store to ingesting documents and testing your retrieval capabilities.
For those who prefer to configure their knowledge base programmatically rather than using the console, the Amazon Bedrock Knowledge Bases with S3 Vectors repository in GitHub provides a guided notebook that you can follow to deploy the setup in your own account.
Create a new knowledge base
To create a new knowledge base, follow these steps:

On the Amazon Bedrock console in the left navigation pane, choose Knowledge Bases. To initiate the creation process, in the Create dropdown list, choose Knowledge Base with vector store.
On the Provide Knowledge Base details page, enter a descriptive name for your knowledge base and an optional description to identify its purpose. Select your IAM permissions approach—either create a new service role or use an existing one—to grant the necessary permissions for accessing AWS services, as shown in the following screenshot.

Choose Amazon S3. Optionally, add tags to help organize and categorize your resources and configure log delivery destinations such as an S3 bucket or Amazon CloudWatch for monitoring and troubleshooting.
Choose Next to proceed to the data source configuration.

Configure the data source
To configure the data source, follow these steps:

Assign a descriptive name to your knowledge base data.
In Data source location, select whether the S3 bucket exists in your current AWS account or another account, then specify the location where your documents are stored, as shown in the following screenshot.

In this step, configure your parsing strategy to determine how Amazon Bedrock processes your documents. Select Amazon Bedrock default parser for text-only documents at no additional cost. Select Amazon Bedrock Data Automation as parser or Foundation models as a parser for processing complex documents with visual elements.
The chunking strategy configuration is equally critical because it defines how your content is segmented into meaningful units for vector embedding, directly impacting retrieval quality and context preservation. We have selected Fixed-size chunking for this example due to its predictable token sizing and simplicity. Because both parsing and chunking decisions can’t be modified after creation, select options that best match your content structure and retrieval needs. For sensitive data, you can use advanced settings to implement AWS Key Management Service (AWS KMS) encryption or apply custom transformation functions to optimize your documents before ingestion. By default, S3 Vectors will use server-side encryption (SSE-S3).
Configure data storage and processing
To configure data storage and processing, first select the embeddings model, as shown in the following screenshot. The embeddings model will transform your text chunks into numerical vector representations for semantic search capabilities. If connecting to an existing S3 Vector as a vector store, make sure the embedding model dimensions match those used when creating your vector store because dimensional mismatches will cause ingestion failures. Amazon Bedrock offers several embeddings models to choose from, each with different vector dimensions and performance characteristics optimized for various use cases. Consider both the semantic richness of the model and its cost implications when making your selection.

Next, configure the vector store. For vector storage selection, choose how Amazon Bedrock Knowledge Bases will store and manage the vector embeddings generated from your documents in Amazon S3 Vectors, using one of the following two options:
Option 1. Quick create a new vector store
This recommended option, shown in the following screenshot, automatically creates an S3 vector bucket in your account during knowledge base creation. The system optimizes your vector storage for cost-effective, durable storage of large-scale vector datasets, creating an S3 vector bucket and vector index for you.

Option 2. Use an existing vector store
When creating your Amazon S3 Vector as a vector store index for use with Amazon Bedrock Knowledge Bases, you can attach metadata (such as, year, author, genre, and location) as key-value pairs to each vector. By default, metadata fields can be used as filters in similarity queries unless specified as nonfilterable metadata at the time of vector index creation. S3 Vector indexes support string, number, and Boolean types up to 40 KB per vector, with filterable metadata capped at 2 KB per vector.
To accommodate larger text chunks and richer metadata while still allowing filtering on other important attributes, add “AMAZON_BEDROCK_TEXT” to the nonFilterableMetadataKeys list in your index configuration. This approach optimizes your storage allocation for document content while preserving filtering capabilities for meaningful attributes like categories or dates. Keep in mind that fields added to the nonFilterableMetadataKeys array can’t be used with metadata filtering in queries and can’t be modified after the index is created.
Here’s an example for creating an Amazon S3 Vector index with proper metadata configuration:

s3vectors.create_index(
vectorBucketName=”my-first-vector-bucket”,
indexName=”my-first-vector-index”,
dimension=1024,
distanceMetric=”cosine”,
dataType=”float32″,
metadataConfiguration={“nonFilterableMetadataKeys”: [“AMAZON_BEDROCK_TEXT”]}
)

For details on how to create a vector store, refer to Introducing Amazon S3 Vectors in the AWS News Blog.
After you have an S3 Vector bucket and index, you can connect it to your knowledge base. You’ll need to provide both the S3 Vector bucket Amazon Resource Name (ARN) and vector index ARN, as shown in the following screenshot, to correctly link your knowledge base to your existing S3 Vector index.

Sync data source
After you’ve configured your knowledge base with S3 Vectors, you need to synchronize your data source to generate and store vector embeddings. From the Amazon Bedrock Knowledge Bases console, open your created knowledge base and locate your configured data source and choose Sync to initiate the process, as shown in the following screenshot. During synchronization, the system processes your documents according to your parsing and chunking configurations, generates embeddings using your selected model, and stores them in your S3 vector index. You can monitor the synchronization progress in real time if you’ve configured Amazon CloudWatch Logs and verify completion status before testing your knowledge base’s retrieval capabilities.

Test the knowledge base
After successfully configuring your knowledge base with S3 Vectors, you can validate its functionality using the built-in testing interface. You can use this interactive console to experiment with different query types and view both retrieval results and generated responses. Select between Retrieval only (Retrieve API) mode to examine raw source chunks or Retrieval and Response generation (RetrieveandGenerate API) to learn how foundation models (FMs) such as Amazon Nova use your retrieved content. The testing interface provides valuable insights into how your knowledge base processes queries, displaying source chunks, their relevance scores, and associated metadata.
You can also configure query settings for your knowledge base just as you would with other vector storage options, including filters for metadata-based selection, guardrails for appropriate responses, reranking capabilities, and query modification options. These tools help optimize retrieval quality and make sure the most relevant information is presented to your FMs. S3 Vectors currently supports semantic search functionality. Using this hands-on validation, you can refine your configuration before integrating the knowledge base with production applications.

Creating your Amazon Bedrock knowledge base programmatically
In the previous sections, we walked through creating a knowledge base with Amazon S3 Vectors using the AWS Management Console. For those who prefer to automate this process or integrate it into existing workflows, you can also create your knowledge base programmatically using the AWS SDK.
The following is a sample code showing how the API call looks when programmatically creating an Amazon Bedrock knowledge base with an existing Amazon S3 Vector index:

response = bedrock.create_knowledge_base(
    description=’Amazon Bedrock Knowledge Base integrated with Amazon S3 Vectors’,
    knowledgeBaseConfiguration={
        ‘type’: ‘VECTOR’,
        ‘vectorKnowledgeBaseConfiguration’: {
             ’embeddingModelArn’: f’arn:aws:bedrock:{region}::foundation-model/amazon.titan-embed-text-v2:0′,
             ’embeddingModelConfiguration’: {
                 ‘bedrockEmbeddingModelConfiguration’: {
                     ‘dimensions’: vector_dimension, #Verify this is the same value as S3 vector index configuration
                     ’embeddingDataType’: ‘FLOAT32’
                 }
             },
        },
    },
     name=knowledge_base_name,
     roleArn=roleArn,
     storageConfiguration={
         ‘s3VectorsConfiguration’: {
             ‘indexArn’: vector_index_arn
         },
         ‘type’: ‘S3_VECTORS’
     }
)

The role attached to the knowledge base should have several policies attached to it, including access to the S3 Vectors API, the models used for embedding, generation, and reranking (if used), and the S3 bucket used as data source. If you’re using a customer managed key for your S3 Vector as a vector store, you’ll need to provide an additional policy to allow the decryption of the data. The following is the policy needed to access the Amazon S3 Vector as a vector store:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “BedrockInvokeModelPermission”,
“Effect”: “Allow”,
“Action”: [
“bedrock:InvokeModel”
],
“Resource”: [
“arn:aws:bedrock:{REGION}::foundation-model/amazon.titan-embed-text-v2:0”
]
},
{
“Sid”: “KmsPermission”,
“Effect”: “Allow”,
“Action”: [
“kms:GenerateDataKey”,
“kms:Decrypt”
],
“Resource”: [
“arn:aws:kms:{REGION}:{ACCOUNT_ID}:key/{KMS_KEY_ID}”
]
},
{
“Sid”: “S3ListBucketPermission”,
“Effect”: “Allow”,
“Action”: [
“s3:ListBucket”
],
“Resource”: [
“arn:aws:s3:::{SOURCE_BUCKET_NAME}”
],
“Condition”: {
“StringEquals”: {
“aws:ResourceAccount”: [
“{ACCOUNT_ID}”
]
}
}
},
{
“Sid”: “S3GetObjectPermission”,
“Effect”: “Allow”,
“Action”: [
“s3:GetObject”
],
“Resource”: [
“arn:aws:s3:::{SOURCE_BUCKET_NAME}/{PREFIX}/*”
],
“Condition”: {
“StringEquals”: {
“aws:ResourceAccount”: [
“{ACCOUNT_ID}”
]
}
}
},
{
“Sid”: “S3VectorsAccessPermission”,
“Effect”: “Allow”,
“Action”: [
“s3vectors:GetIndex”,
“s3vectors:QueryVectors”,
“s3vectors:PutVectors”,
“s3vectors:GetVectors”,
“s3vectors:DeleteVectors”
],
“Resource”: “arn:aws:s3vectors:{REGION}:{ACCOUNT_ID}:bucket/{VECTOR_BUCKET_NAME}/index/{VECTOR_INDEX_NAME}”,
“Condition”: {
“StringEquals”: {
“aws:ResourceAccount”: “{ACCOUNT_ID}”
}
}
}
]
}

Cleanup
To clean up your resources, complete the following steps. To delete the knowledge base:

On the Amazon Bedrock console, choose Knowledge Bases
Select your Knowledge Base and note both the IAM service role name and S3 Vector index ARN
Choose Delete and confirm

To delete the S3 Vector as a vector store, use the following AWS Command Line Interface (AWS CLI) commands:

aws s3vectors delete-index –vector-bucket-name YOUR_VECTOR_BUCKET_NAME –index-name YOUR_INDEX_NAME –region YOUR_REGION
aws s3vectors delete-vector-bucket –vector-bucket-name YOUR_VECTOR_BUCKET_NAME –region YOUR_REGION

On the IAM console, find the role noted earlier
Select and delete the role

To delete the sample dataset:

On the Amazon S3 console, find your S3 bucket
Select and delete the files you uploaded for this tutorial

Conclusion
The integration between Amazon Bedrock Knowledge Bases and Amazon S3 Vectors represents a significant advancement in making RAG applications more accessible and economically viable at scale. By using the cost-optimized storage of Amazon S3 Vectors, organizations can now build knowledge bases at scale with improved cost efficiency. This means customers can strike an optimal balance between performance and economics, and you can focus on creating value through AI-powered applications rather than managing complex vector storage infrastructure.
To get started on Amazon Bedrock Knowledge Bases and Amazon S3 Vectors integration, refer to Using S3 Vectors with Amazon Bedrock Knowledge Bases in the Amazon S3 User Guide.

About the authors
Vaibhav Sabharwal is a Senior Solutions Architect with Amazon Web Services (AWS) based out of New York. He is passionate about learning new cloud technologies and assisting customers in building cloud adoption strategies, designing innovative solutions, and driving operational excellence. As a member of the Financial Services Technical Field Community at AWS, he actively contributes to the collaborative efforts within the industry.
Dani Mitchell is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS). He is focused on helping accelerate enterprises across the world on their generative AI journeys with Amazon Bedrock.
Irene Marban is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS), working with customers across EMEA to design and implement generative AI solutions to accelerate their businesses. With a background in biomedical engineering and AI, her work focuses on helping organizations leverage the latest AI technologies to drive innovation and growth. In her spare time, she loves reading and cooking for her friends.
Ashish Lal is an AI/ML Senior Product Marketing Manager for Amazon Bedrock. He has over 11 years of experience in product marketing and enjoys helping customers accelerate time to value and reduce their AI lifecycle cost.

Implementing on-demand deployment with customized Amazon Nova models o …

Amazon Bedrock offers model customization capabilities for customers to tailor versions of foundation models (FMs) to their specific needs through features such as fine-tuning and distillation. Today, we’re announcing the launch of on-demand deployment for customized models ready to be deployed on Amazon Bedrock.
On-demand deployment for customized models provides an additional deployment option that scales with your usage patterns. This approach allows for invoking customized models only when needed, with requests processed in real time without requiring pre-provisioned compute resources.
The on-demand deployment option includes a token-based pricing model that charges based on the number of tokens processed during inference. This pay-as-you-go approach complements the existing Provisioned Throughput option, giving users flexibility to choose the deployment method that best aligns with their specific workload requirements and cost objectives.
In this post, we walk through the custom model on-demand deployment workflow for Amazon Bedrock and provide step-by-step implementation guides using both the AWS Management Console and APIs or AWS SDKs. We also discuss best practices and considerations for deploying customized Amazon Nova models on Amazon Bedrock.
Understanding custom model on-demand deployment workflow
The model customization lifecycle represents the end-to-end journey from conceptualization to deployment. This process begins with defining your specific use case, preparing and formatting appropriate data, and then performing model customization through features such as Amazon Bedrock fine-tuning or Amazon Bedrock Model Distillation. Each stage builds upon the previous one, creating a pathway toward deploying production-ready generative AI capabilities that you tailor to your requirements. The following diagram illustrates this workflow.

After customizing your model, the evaluation and deployment phases determine how the model will be made available for inference. This is where custom model on-demand deployment becomes valuable, offering a deployment option that aligns with variable workloads and cost-conscious implementations. When using on-demand deployment, you can invoke your customized model through the AWS console or standard API operations using the model identifier, with compute resources automatically allocated only when needed. The on-demand deployment provides flexibility while maintaining performance expectations, so you can seamlessly integrate customized models into your applications with the same serverless experience offered by Amazon Bedrock—all compute resources are automatically managed for you, based on your actual usage. Because the workflow supports iterative improvements, you can refine your models based on evaluation results and evolving business needs.
Prerequisites
This post assumes you have a customized Amazon Nova model before deploying it using on-demand deployment. On-demand deployment requires newly customized Amazon Nova models after this launch. Previously customized models aren’t compatible with this deployment option. For instructions on creating or customizing your Nova model through fine-tuning or distillation, refer to these resources:

Fine-tuning Amazon Nova models
A guide to Amazon Bedrock Model Distillation

After you’ve successfully customized your Amazon Nova model, you can proceed with deploying it using the on-demand deployment option as detailed in the following sections.
Implementation guide for on-demand deployment
There are two main approaches to implementing on-demand deployment for your customized Amazon Nova models on Amazon Bedrock: using the Amazon Bedrock console or using the API or SDK. First, we explore how to deploy your model through the Amazon Bedrock console, which provides a user-friendly interface for setting up and managing your deployments.
Step-by-step implementation using the Amazon Bedrock console
To implement on-demand deployment for your customized Amazon Nova models on Amazon Bedrock using the console, follow these steps:

On the Amazon Bedrock console, select your customized model (fine-tuning or model distillation) to be deployed. Choose Set up inference and select Deploy for on-demand, as shown in the following screenshot.

Under Deployment details, enter a Name and a Description. You have the option to add Tags, as shown in the following screenshot. Choose Create to start on-demand deployment of customized model.

Under Custom model deployments, the status of your deployment should be InProgress, Active, or Failed, as shown in the following screenshot.

You can select a deployment to find Deployment ARN, Creation time, Last updated, and Status for the selected custom model.

The custom model is deployed and ready using on-demand deployment. Try it out in the test playground or go to Chat/Text playground, choose Custom models under Categories. Select your model, choose On demand under Inference, and select by the deployment name, as shown in the following screenshot.

Step-by-step implementation using API or SDK
After you have trained the model successfully, you can deploy it to evaluate the response quality and latency or to use the model as a production model for your use case. You use CreateCustomModelDeployment API to create model deployment for the trained model. The following steps show how to use the APIs for deploying and deleting the custom model deployment for on-demand inference.

import boto3
import json

# First, create and configure an Amazon Bedrock client:
bedrock_client = boto3.client(
service_name=”bedrock”,region_name=”<region-info>”)

# create custom model deployment
response = bedrock_client.create_custom_model_deployment(
modelDeploymentName=”<model-deployment-name>”,
modelArn=”<trained-model-arn>”,
description=”<model-deployment-description>”,
tags=[
{“key”:”<your-key>”,
“value”:”<your-value>”},
])

After you’ve successfully created a model deployment, you can check the status of the deployment by using GetCustomModelDeployment API as follows:

response = bedrock_client.get_custom_model_deployment(
customModelDeploymentIdentifier=”<custom-deployment-arn>”)

GetCustomModelDeployment supports three states: Creating , Active , and Failed. When the status in response is Active, you should be able to use the custom model through on-demand deployment with InvokeModel or Converse API, as shown in the following example:

# Define Runtime Client
bedrock_runtime = boto3.client(service_name=”bedrock-runtime”, region_name=”<region-info>”)
# invoke a deployed custom model using Converse API
response = bedrock_runtime.converse(
modelId=”<custom-deployment-arn>”,
messages=[
{
“role”: “user”,
“content”: [
{
“text”: “<your-prompt-for-custom-model>”,
}
]
}
]
)

result = response.get(‘output’)
print(result)

# invoke a deployed custom model using InvokeModel API
request_body = {
“schemaVersion”: “messages-v1”,
“messages”: [{“role”: “user”,
“content”: [{“text”: “<your-prompt-for-custom-model>”}]}],
“system”: [{“text”: “<system prompt>”}],
“inferenceConfig”: {“maxTokens”: 500,
“topP”: 0.9,
“temperature”: 0.0
}
}
body = json.dumps(request_body)
response = bedrock_runtime.invoke_model(
modelId=”<custom-deployment-arn>”,
body=body
)

# Extract and print the response text
model_response = json.loads(response[“body”].read())
response_text = model_response[“output”][“message”][“content”][0][“text”]
print(response_text)

By following these steps, you can deploy and use your customized model through Amazon Bedrock API and instantly use your efficient and high-performing model tailored to your use cases through on-demand deployment.
Best practices and considerations
Successful implementation of on-demand deployment with customized models depends on understanding several operational factors. These considerations—including latency, Regional availability, quota limitations, deployment option selections, and cost management strategies—directly impact your ability to deploy effective solutions while optimizing resource utilization. The following guidelines help you make informed decisions when implementing your inference strategy:

Cold start latency – When using on-demand deployment, you might experience initial cold start latencies, typically lasting several seconds, depending on the model size. This occurs when the deployment hasn’t received recent traffic and needs to reinitialize compute resources.
Regional availability – At launch, custom model deployment will be available in US East (N. Virginia) for Amazon Nova models.
Quota management – Each custom model deployment has specific quotas:

Tokens per minute (TPM)
Requests per minute (RPM)
The number of Creating status deployment
Total on-demand deployments in a single account

Each deployment operates independently within its assigned quota. If a deployment exceeds its TPM or RPM allocation, incoming requests will be throttled. You can request quota increases by submitting a ticket or contacting your AWS account team.

Choosing between custom model deployment and Provisioned Throughput – You can set up inference on a custom model by either creating a custom model deployment (for on-demand usage) or purchasing Provisioned Throughput. The choice depends on the supported Regions and models for each inference option, throughput requirement, and cost considerations. These two options operate independently and can be used simultaneously for the same custom model.
Cost management – On-demand deployment uses a pay-as-you-go pricing model based on the number of tokens processed during inference. You can use cost allocation tags on your on-demand deployments to track and manage inference costs, allowing better budget tracking and cost optimization through AWS Cost Explorer.

Cleanup
If you’ve been testing the on-demand deployment feature and don’t plan to continue using it, it’s important to clean up your resources to avoid incurring unnecessary costs. Here’s how to delete using the Amazon Bedrock Console:

Navigate to your custom model deployment
Select the deployment you want to remove
Delete the deployment

Here’s how to delete using the API or SDK:
To delete a custom model deployment, you can use DeleteCustomModelDeployment API. The following example demonstrates how to delete your custom model deployment:

# delete deployed custom model deployment
response = bedrock_client.delete_custom_model_deployment(
customModelDeploymentIdentifier=”<trained-model-arn>”
)

Conclusion
The introduction of on-demand deployment for customized models on Amazon Bedrock represents a significant advancement in making AI model deployment more accessible, cost-effective, and flexible for businesses of all sizes. On-demand deployment offers the following advantages:

Cost optimization – Pay-as-you-go pricing allows you only pay for the compute resources you actually use
Operational simplicity – Automatic resource management eliminates the need for manual infrastructure provisioning
Scalability – Seamless handling of variable workloads without upfront capacity planning
Flexibility – Freedom to choose between on-demand and Provisioned Throughput based on your specific needs

Getting started is straightforward. Begin by completing your model customization through fine-tuning or distillation, then choose on-demand deployment using the AWS Management Console or API. Configure your deployment details, validate model performance in a test environment, and seamlessly integrate into your production workflows.
Start exploring on-demand deployment for customized models on Amazon Bedrock today! Visit the Amazon Bedrock documentation to begin your model customization journey and experience the benefits of flexible, cost-effective AI infrastructure. For hands-on implementation examples, check out our GitHub repository which contains detailed code samples for customizing Amazon Nova models and evaluating them using on-demand custom model deployment.

About the Authors
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.
Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Koushik Mani is an associate solutions architect at AWS. He had worked as a Software Engineer for two years focusing on machine learning and cloud computing use cases at Telstra. He completed his masters in computer science from University of Southern California. He is passionate about machine learning and generative AI use cases and building solutions.
Rishabh Agrawal is a Senior Software Engineer working on AI services at AWS. In his spare time, he enjoys hiking, traveling and reading.
Shreeya Sharma is a Senior Technical Product Manager at AWS, where she has been working on leveraging the power of generative AI to deliver innovative and customer-centric products. Shreeya holds a master’s degree from Duke University. Outside of work, she loves traveling, dancing, and singing.

Getting Started with Mirascope: Removing Semantic Duplicates using an …

Mirascope is a powerful and user-friendly library that provides a unified interface for working with a wide range of Large Language Model (LLM) providers, including OpenAI, Anthropic, Mistral, Google (Gemini and Vertex AI), Groq, Cohere, LiteLLM, Azure AI, and Amazon Bedrock. It simplifies everything from text generation and structured data extraction to building complex AI-powered workflows and agent systems.

In this guide, we’ll focus on using Mirascope’s OpenAI integration to identify and remove semantic duplicates (entries that may differ in wording but carry the same meaning) from a list of customer reviews. 

Installing the dependencies

Copy CodeCopiedUse a different Browserpip install “mirascope[openai]”

OpenAI Key

To get an OpenAI API key, visit https://platform.openai.com/settings/organization/api-keys and generate a new key. If you’re a new user, you may need to add billing details and make a minimum payment of $5 to activate API access.

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[‘OPENAI_API_KEY’] = getpass(‘Enter OpenAI API Key: ‘)

Defining the list of customer reviews

Copy CodeCopiedUse a different Browsercustomer_reviews = [
“Sound quality is amazing!”,
“Audio is crystal clear and very immersive.”,
“Incredible sound, especially the bass response.”,
“Battery doesn’t last as advertised.”,
“Needs charging too often.”,
“Battery drains quickly — not ideal for travel.”,
“Setup was super easy and straightforward.”,
“Very user-friendly, even for my parents.”,
“Simple interface and smooth experience.”,
“Feels cheap and plasticky.”,
“Build quality could be better.”,
“Broke within the first week of use.”,
“People say they can’t hear me during calls.”,
“Mic quality is terrible on Zoom meetings.”,
“Great product for the price!”
]

These reviews capture key customer sentiments: praise for sound quality and ease of use, complaints about battery life, build quality, and call/mic issues, along with a positive note on value for money. They reflect common themes found in real user feedback.

Defining a Pydantic Schema

This Pydantic model defines the structure for the response of a semantic deduplication task on customer reviews. This schema helps structure and validate the output of a language model tasked with clustering or deduplicating natural language input (e.g., user feedback, bug reports, product reviews).

Copy CodeCopiedUse a different Browserfrom pydantic import BaseModel, Field

class DeduplicatedReviews(BaseModel):
duplicates: list[list[str]] = Field(
…, description=”A list of semantically equivalent customer review groups”
)
reviews: list[str] = Field(
…, description=”The deduplicated list of core customer feedback themes”
)

Defining a Mirascope @openai.call for Semantic Deduplication

This code defines a semantic deduplication function using Mirascope’s @openai.call decorator, which enables seamless integration with OpenAI’s gpt-4o model. The deduplicate_customer_reviews function takes a list of customer reviews and uses a structured prompt—defined by the @prompt_template decorator—to guide the LLM in identifying and grouping semantically similar reviews.

The system message instructs the model to analyze the meaning, tone, and intent behind each review, clustering those that convey the same feedback even if worded differently. The function expects a structured response conforming to the DeduplicatedReviews Pydantic model, which includes two outputs: a list of unique, deduplicated review sentiments, and a list of grouped duplicates.

This design ensures that the LLM’s output is both accurate and machine-readable, making it ideal for customer feedback analysis, survey deduplication, or product review clustering.

Copy CodeCopiedUse a different Browserfrom mirascope.core import openai, prompt_template

@openai.call(model=”gpt-4o”, response_model=DeduplicatedReviews)
@prompt_template(
“””
SYSTEM:
You are an AI assistant helping to analyze customer reviews.
Your task is to group semantically similar reviews together — even if they are worded differently.

– Use your understanding of meaning, tone, and implication to group duplicates.
– Return two lists:
1. A deduplicated list of the key distinct review sentiments.
2. A list of grouped duplicates that share the same underlying feedback.

USER:
{reviews}
“””
)
def deduplicate_customer_reviews(reviews: list[str]): …

The following code executes the deduplicate_customer_reviews function using a list of customer reviews and prints the structured output. First, it calls the function and stores the result in the response variable. To ensure that the model’s output conforms to the expected format, it uses an assert statement to validate that the response is an instance of the DeduplicatedReviews Pydantic model.

Once validated, it prints the deduplicated results in two sections. The first section, labeled “ Distinct Customer Feedback,” displays the list of unique review sentiments identified by the model. The second section, “ Grouped Duplicates,” lists clusters of reviews that were recognized as semantically equivalent.

Copy CodeCopiedUse a different Browserresponse = deduplicate_customer_reviews(customer_reviews)

# Ensure response format
assert isinstance(response, DeduplicatedReviews)

# Print Output
print(” Distinct Customer Feedback:”)
for item in response.reviews:
print(“-“, item)

print(“n Grouped Duplicates:”)
for group in response.duplicates:
print(“-“, group)

The output shows a clean summary of customer feedback by grouping semantically similar reviews. The Distinct Customer Feedback section highlights key insights, while the Grouped Duplicates section captures different phrasings of the same sentiment. This helps eliminate redundancy and makes the feedback easier to analyze.

Check out the full Codes. All credit for this research goes to the researchers of this project.

Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]
The post Getting Started with Mirascope: Removing Semantic Duplicates using an LLM appeared first on MarkTechPost.

Apple Introduces DiffuCoder: A 7B Diffusion LLM Tailored for Code Gene …

Diffusion LLMs as a Paradigm Shift in Code Generation

LLMs have revolutionized natural language processing with impressive results across tasks from dialogue to code generation. Masked diffusion models have emerged as an alternative and are scaled up into diffusion-based LLMs such as LLaDA and Dream. This model iteratively refines the entire sequence in parallel, allowing global planning of content. The diffusion LLM approach is a good fit for code generation because writing code often involves non-sequential back-and-forth refinement. However, it remains unclear how open-source diffusion LLMs perform on coding tasks. This is because existing post-training efforts show marginal gains or depend on semi-autoregressive decoding, which deviates from the global planning nature of diffusion.

Evolution of Text Diffusion Models and Their Impact on Code Synthesis

Early text diffusion models include mask diffusion models, with recent scaling efforts producing diffusion LLMs like DiffuLLaMA, LLaDA, and Dream. Block diffusion proposes a hybrid approach that applies diffusion within each block. Multimodal models such as LaViDa, MMaDA, and Dimple combine text diffusion models with vision models. In code generation, CodeFusion was the first to combine diffusion models with code generation, but it is limited to small-scale models and simple tasks. Recent commercial-scale diffusion LLMs such as Mercury and Gemini show comparable performance to leading autoregressive code models. However, current RL methods for dLLMs, such as d1 and MMaDA using GRPO, depend on block diffusion decoding during rollout and evaluation.

Apple and HKU Introduce DiffuCoder: A Specialized Diffusion Model for Code

Researchers from Apple and the University of Hong Kong proposed DiffuCoder, a 7B-scale masked diffusion model specialized for code generation, trained on 130B effective tokens. making it a valuable testbed for exploring diffusion-based LLM behaviors and advancing post-training methods. The researchers introduce local and global autoregressive-ness metrics to measure how closely generation follows a left-to-right pattern. The analysis reveals that diffusion LLMs exhibit an entropy sink effect, causing strong causal bias during conditional generation. DiffuCoder becomes more flexible in token generation order as sampling temperature increases from 0.2 to 1.2, freeing itself from strict left-to-right constraints and achieving higher pass@10 accuracy.

A Four-Stage Training Pipeline Leveraging RefineCode and Coupled-GRPO

Researchers adapt their model from Qwen-2.5-Coder as the base model and perform continual pre-training using a 400B-token code pre-training corpus from RefineCode and Stackv2. The training consists of four stages: adaptation pre-training, mid-training with 16B tokens of annealing code data, instruction tuning with 436K SFT samples, and post-training using coupled-GRPO with 21K hard samples from Acecoder-87K. Early stopping is applied in Stage 1 after processing 65B tokens. Stage 2 is trained for 4 epochs, resulting in a total of 65B tokens. The evaluation environments are constructed using three code benchmarks—HumanEval, MBPP, and EvalPlus—along with BigCodeBench. They include both full and hard subsets, covering completion and instruction-based query types.

Benchmark Results: DiffuCoder’s Performance and Optimization Insights

DiffuCoder trained on 130B code tokens, achieves performance on par with Qwen2.5-Coder and OpenCoder. However, all dLLMs show only marginal improvement over their base models after instruction tuning compared to Qwen2.5-Coder+SFT, which achieves significant improvements from instruction tuning on the same data. Moreover, the coupled-GRPO training shows strong effectiveness, whereas baseline variants such as d1, full-mask completion, and decoupled sampling tend to exhibit unstable reward learning behavior. RL fine-tuning increases the optimal sampling temperature during evaluation from 0.2 to higher values, suggesting that training sharpens the per-token distribution. This reduces the model’s reliance on strict autoregressive decoding and enhances its ability to generate tokens in parallel.

Coupled-GRPO and the Future of Diffusion-Based Code Models

In this paper, researchers present DiffuCoder, a 7B-scale open-source diffusion model for code with strong performance, along with its complete training recipe and detailed analysis of dLLMs for code generation. They further introduce coupled-GRPO, an RL algorithm that respects the non-autoregressive nature of dLLMs through a coupled-sampling technique for more accurate likelihood estimation. Coupled-GRPO improves DiffuCoder’s performance, showing the effectiveness of RL methods aligned with diffusion principles. This work offers the community a deeper insight into dLLMs and establishes a solid foundation for future research into their applications in complex reasoning and generative tasks.

Check out the Paper and Codes. All credit for this research goes to the researchers of this project.

Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]
The post Apple Introduces DiffuCoder: A 7B Diffusion LLM Tailored for Code Generation appeared first on MarkTechPost.

NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing …

Heard about Artificial General Intelligence (AGI)? Meet its auditory counterpart—Audio General Intelligence. With Audio Flamingo 3 (AF3), NVIDIA introduces a major leap in how machines understand and reason about sound. While past models could transcribe speech or classify audio clips, they lacked the ability to interpret audio in a context-rich, human-like way—across speech, ambient sound, and music, and over extended durations. AF3 changes that.

With Audio Flamingo 3, NVIDIA introduces a fully open-source large audio-language model (LALM) that not only hears but also understands and reasons. Built on a five-stage curriculum and powered by the AF-Whisper encoder, AF3 supports long audio inputs (up to 10 minutes), multi-turn multi-audio chat, on-demand thinking, and even voice-to-voice interactions. This sets a new bar for how AI systems interact with sound, bringing us a step closer to AGI.

The Core Innovations Behind Audio Flamingo 3

AF-Whisper: A Unified Audio Encoder AF3 uses AF-Whisper, a novel encoder adapted from Whisper-v3. It processes speech, ambient sounds, and music using the same architecture—solving a major limitation of earlier LALMs which used separate encoders, leading to inconsistencies. AF-Whisper leverages audio-caption datasets, synthesized metadata, and a dense 1280-dimension embedding space to align with text representations.

Chain-of-Thought for Audio: On-Demand Reasoning Unlike static QA systems, AF3 is equipped with ‘thinking’ capabilities. Using the AF-Think dataset (250k examples), the model can perform chain-of-thought reasoning when prompted, enabling it to explain its inference steps before arriving at an answer—a key step toward transparent audio AI.

Multi-Turn, Multi-Audio Conversations Through the AF-Chat dataset (75k dialogues), AF3 can hold contextual conversations involving multiple audio inputs across turns. This mimics real-world interactions, where humans refer back to previous audio cues. It also introduces voice-to-voice conversations using a streaming text-to-speech module.

Long Audio Reasoning AF3 is the first fully open model capable of reasoning over audio inputs up to 10 minutes. Trained with LongAudio-XL (1.25M examples), the model supports tasks like meeting summarization, podcast understanding, sarcasm detection, and temporal grounding.

State-of-the-Art Benchmarks and Real-World Capability

AF3 surpasses both open and closed models on over 20 benchmarks, including:

MMAU (avg): 73.14% (+2.14% over Qwen2.5-O)

LongAudioBench: 68.6 (GPT-4o evaluation), beating Gemini 2.5 Pro

LibriSpeech (ASR): 1.57% WER, outperforming Phi-4-mm

ClothoAQA: 91.1% (vs. 89.2% from Qwen2.5-O)

These improvements aren’t just marginal; they redefine what’s expected from audio-language systems. AF3 also introduces benchmarking in voice chat and speech generation, achieving 5.94s generation latency (vs. 14.62s for Qwen2.5) and better similarity scores.

The Data Pipeline: Datasets That Teach Audio Reasoning

NVIDIA didn’t just scale compute—they rethought the data:

AudioSkills-XL: 8M examples combining ambient, music, and speech reasoning.

LongAudio-XL: Covers long-form speech from audiobooks, podcasts, meetings.

AF-Think: Promotes short CoT-style inference.

AF-Chat: Designed for multi-turn, multi-audio conversations.

Each dataset is fully open-sourced, along with training code and recipes, enabling reproducibility and future research.

Open Source

AF3 is not just a model drop. NVIDIA released:

Model weights

Training recipes

Inference code

Four open datasets

This transparency makes AF3 the most accessible state-of-the-art audio-language model. It opens new research directions in auditory reasoning, low-latency audio agents, music comprehension, and multi-modal interaction.

Conclusion: Toward General Audio Intelligence

Audio Flamingo 3 demonstrates that deep audio understanding is not just possible but reproducible and open. By combining scale, novel training strategies, and diverse data, NVIDIA delivers a model that listens, understands, and reasons in ways previous LALMs could not.

Check out the Paper, Codes and Model on Hugging Face. All credit for this research goes to the researchers of this project.

Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]
The post NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing Audio General Intelligence appeared first on MarkTechPost.

Accenture scales video analysis with Amazon Nova and Amazon Bedrock Ag …

This post was written with Ilan Geller, Kamal Mannar, Debasmita Ghosh, and Nakul Aggarwal of Accenture.
Video highlights offer a powerful way to boost audience engagement and extend content value for content publishers. These short, high-impact clips capture key moments that drive viewer retention, amplify reach across social media, reinforce brand identity, and open new avenues for monetization. However, traditional highlight creation workflows are slow and labor-intensive. Editors must manually review footage, identify significant moments, cut clips, and add transitions or narration—followed by manual quality checks and formatting for distribution. Although this provides editorial control, it creates bottlenecks that don’t scale efficiently.
This post showcases how Accenture Spotlight delivers a scalable, cost-effective video highlight generation solution using Amazon Nova and Amazon Bedrock Agents. Amazon Nova foundation models (FMs) deliver frontier intelligence and industry-leading price-performance. With Spotlight, content owners can configure AI models and agents to support diverse use cases across the media industry while offering a human-in-the-loop option for quality assurance and collaborative refinement. This maintains accuracy, editorial oversight, and alignment with brand guidelines—without compromising on speed or scalability.
Real-world use cases
Spotlight has been applied across a range of industry scenarios, including:

Personalized short-form video generation – Spotlight’s specialized agents analyze popular short-form content (such as video reels and other social media) to identify patterns of high-performing content. The agents then apply this understanding to long-form video to generate personalized short clips, with built-in checks for brand alignment and content standards.
Sports editing and highlights – Spotlight automates creation of video highlights for sports like soccer, Formula 1, and rugby, tailoring them to specific user preferences and interests. It also validates each highlight’s quality and accuracy, streamlining editorial workflows as a result.
Content matching for stakeholders – Using enriched metadata, Spotlight matches archived or live video content to audience demographics, optimizing distribution strategies and maximizing advertiser value through precise targeting.
Real-time retail offer generation – In retail environments such as gas stations, Spotlight processes live CCTV footage to infer customer profiles using data (such as vehicle type or transaction history), and then dynamically generates personalized product offers. These offers consider contextual factors such as time of day and weather; and they are delivered with custom visuals in near real time.

Spotlight’s architecture
Spotlight’s architecture addresses the challenge of scalable video processing, efficiently analyzing and generating content while maintaining speed and quality. It incorporates both task-specific models and Amazon Nova FMs that are orchestrated by specialized Amazon Bedrock agents. Key architectural highlights include:

Task-driven model selection – Spotlight dynamically selects between traditional AI models and Amazon Nova FMs based on a given task’s complexity and latency requirements. This intelligent orchestration enables fast inference for time-sensitive operations while deploying deeper multimodal reasoning where sophisticated analysis is needed—balancing speed and intelligence across applications from real-time retail offers to complex video processing.
Agent orchestration – Specialized agents, each purpose-built for specific analysis tasks, operate across the end-to-end workflow under the direction of a central orchestrator agent. The orchestrator agent manages task breakdown, data flow, and inter-agent communication.
Scalable and adaptable – By using AWS capabilities, Spotlight’s architecture is configurable to support different workloads—from high-throughput video highlight generation to low-latency offer personalization at the edge.

Spotlight uses a multi-layered agent workflow to automate video processing and generation while maintaining quality control. For example, to generate dynamic video highlights, Spotlight uses three specialized “super agents” that work in coordination under a central orchestrator agent’s supervision. Each super agent is powered by Amazon Nova models, and is supported by a collection of utility agents (see the following diagram). These agents work together to understand video content, generate high-quality highlights, and maintain alignment with user requirements and brand standards.

The workflow consists of the following super agents and utility agents:

Video processing agent – This agent analyzes long-form video and generates detailed metadata to guide short-form video creation. It uses the following utility agents:

Research agent – Analyzes popular short-form videos to identify key components that create video virality, and creates recipes for successful short-form content. For example, in music videos, it can highlight choreographed dance sequences with the lead performer as essential segments and a recipe based on this insight.
Visual analysis agent – Applies the research agent’s findings to new long-form content. It identifies matching segments, tags key individuals, and timestamps relevant moments. It uses traditional AI models (such as person recognition and tracking) to capture fine-grained details for segment identification.
Audio analysis agent – Performs speech diarization and transcription to support both the research and visual analysis agents with deeper context from the video’s audio track.

Short video generation agent – This agent orchestrates the actual creation of the short-form video by integrating relevant segments and refining the sequence. Its utility agents include:

Section of interest (SOI) agent – Identifies potential segments based on video genre, target length, featured performers, and JSON metadata from the visual analysis agent. This agent prioritizes logical flow and viewer engagement.
Video generation agent – Constructs video using segment recommendations and component patterns from the video processing agent. For example, influencer videos might follow a structure of an attention-grabbing hook, key messages, and a call to action. The process will be iteratively improved based on feedback from the reviewer agent.
Video postprocessing agent – Refines the final output for publishing by performing tasks like cropping to mobile-friendly aspect ratios, or adding subtitles, background music, and brand overlays.

Reviewer agent – This agent works iteratively with the generation agent to maintain video quality and relevance. Its utility agents include:

Relevance check agent – Evaluates alignment with user-defined content guidelines, audience expectations, and desired themes.
Abruptness check agent – Provides smooth transitions between segments to avoid jarring cuts, enhancing viewer experience and professionalism.

See Spotlight in action:

Solution overview
To interact with Spotlight, users access a frontend UI where they provide natural language input to specify their objective. Spotlight then employs its agentic workflow powered by Amazon Nova to achieve its given task. The following diagram illustrates the solution architecture for video highlight generation.

The workflow consists of the following key components (as numbered in the preceding diagram):

Frontend UI for user interaction:

Users interact through a web portal secured by Amazon Cognito authentication and delivered using Amazon CloudFront.
Amazon API Gateway serves a restful endpoint for video processing and highlight generation services.

Live video stream processing:

AWS Elemental MediaLive processes incoming video stream and triggers AWS Lambda to initiate workflows. (Spotlight also accepts video archive content as media files for processing and highlight generation.)

Video processing workflow orchestrated with AWS Step Functions:

Open source models hosted on Amazon SageMaker enable speech analysis and computer vision for person and object detection.
The video processing agent powered by Amazon Nova Pro analyzes video and generates fine-grained metadata (for example, identifying patterns from viral videos).
The reviewer agent powered by Amazon Nova Premier maintains alignment with brand standards.
Open source utility tooling is used for pre-analysis tasks.

Highlight generation workflow orchestrated with Step Functions:

Amazon Nova Pro analyzes the user query for clips of interest to understand intent, and reformulates the query for downstream processing.
The short video generation agent powered by Amazon Nova Pro constructs a video highlight using segment recommendations.
The reviewer agent powered by Amazon Nova Premier makes sure the constructed highlight aligns with quality, brand, and contextual expectations.
AWS Elemental Media Convert and open source tooling enable video highlight construction and postprocessing (such as subtitle layover, aspect ratio change, and transitions).

Storage and monitoring:

Amazon Simple Storage Service (Amazon S3) stores metadata extracted from processing workflows, reference content (such as scripts and brand guidelines), and generated outputs.
Amazon CloudWatch maintains end-to-end system health and monitors performance.

Key benefits
Spotlight’s approach to video processing and generation creates dynamic value. Additionally, its technical design using Amazon Nova and an integrated agentic workflow helps content owners realize gains in their video processing and editorial operations. Key benefits for Spotlight include:

Cross-industry application – Spotlight’s modular design allows it to be applied seamlessly across industries—from media and entertainment to retail
Real-time processing – It supports both live stream feeds and pre-recorded video, with custom highlight generation happening in minutes, reducing from hours or days
Cost-efficient deployment – It is entirely serverless and on-demand, minimizing idle infrastructure costs and maximizing utilization
Efficiency – Accenture’s review of costs using Amazon Nova models showed that Amazon Nova-powered agents deliver over 10 times better cost savings over traditional highlight creation methods

The following table provides is a comparative analysis of Spotlight’s video processing approach to conventional approaches for video highlight creation.

Metric
Spotlight Performance
Conventional Approach

Video Processing Latency
Minutes for 2–3-hour sessions
Hours to days

Highlight Review Cost (3–5 minutes)
10 times lower with Amazon Nova
High cost using conventional approaches

Overall Highlight Generation Cost
10 times lower using serverless and on-demand LLM deployment
Manual workflows with high operational overhead

Deployment Architecture
Fully serverless with scalable LLM invocation
Typically resource-heavy and statically provisioned

Use Case Flexibility
Sports, media editing, retail personalization, and more
Often tailored to a single use case

Conclusion
Spotlight represents a cutting-edge agentic solution designed to tackle complex media processing and customer personalization challenges using generative AI. With modular, multi-agent workflows built on Amazon Nova, Spotlight seamlessly enables dynamic short-form video generation. The solution’s core framework is also extensible to diverse industry use cases that require multimodal content analysis at scale.
As an AWS Premier Tier Services Partner and Managed Services Provider (MSP), Accenture brings deep cloud and industry expertise. Accenture and AWS have worked together for more than a decade to help organizations realize value from their applications and data. Accenture brings its industry understanding and generative AI specialists to build and adapt generative AI solutions to client needs. Together with AWS, through the Accenture AWS Business Group (AABG), we help enterprises unlock business value by rapidly scaling generative AI solutions tailored to their needs—driving innovation and transformation in the cloud.
Try out Spotlight for your own use case, and share your feedback in the comments.

About the authors
Ilan Geller is a Managing Director in the Data and AI practice at Accenture. He is the Global AWS Partner Lead for Data and AI and the Center for Advanced AI. His roles at Accenture have primarily been focused on the design, development, and delivery of complex data, AI/ML, and most recently Generative AI solutions.
Dr. Kamal Mannar is a Global Computer Vision Lead at Accenture’s Center for Advanced AI, with over 20 years of experience applying AI across industries like agriculture, healthcare, energy, and telecom. He has led large-scale AI transformations, built scalable GenAI and computer vision solutions, and holds 10+ patents in areas including deep learning, wearable AI, and vision transformers. Previously, he headed AI at Vulcan AI, driving cutting-edge innovation in precision agriculture. Kamal holds a Ph.D. in Industrial & Systems Engineering from the University of Wisconsin–Madison.
Debasmita Ghosh is working as Associate Director in Accenture with 21 years of experience in Information Technology (8 years in AI/Gen AI capability), who currently among multiple responsibilities leads Computer Vision practice in India. She has presented her paper on Handwritten Text Recognition in multiple conferences including MCPR 2020, GHCI 2020. She has patent granted on Handwritten Text Recognition solution and received recognition from Accenture under the Accenture Inventor Award Program being named as an inventor on a granted patent. She has multiple papers on Computer Visions solutions like Table Extraction including non-uniform and borderless tables accepted and presented in the ComPE 2021 and CCVPR 2021 international conferences. She has managed projects across multiple technologies (Oracle Apps, SAP). As a programmer, she has worked during various phases of SDLC with experience on Oracle Apps Development across CRM, Procurement, Receivables, SCM, SAP Professional Services, SAP CRM. Debasmita holds M.Sc. in Statistics from Calcutta University.
Nakul Aggarwal is a Subject Matter Expert in Computer Vision and Generative AI at Accenture, with around 7 years of experience in developing and delivering cutting-edge solutions across computer vision, multimodal AI, and agentic systems. He holds a Master’s degree from the Indian Institute of Technology (IIT) Delhi and has authored several research papers presented at international conferences. He holds two patents in AI and currently leads multiple projects focused on multimodal and agentic AI. Beyond technical delivery, he plays a key role in mentoring teams and driving innovation by bridging advanced research with real-world enterprise applications.
Aramide Kehinde is Global Partner Solutions Architect for Amazon Nova at AWS. She works with high growth companies to build and deliver forward thinking technology solutions using AWS Generative AI. Her experience spans multiple industries, including Media & Entertainment, Financial Services, and Healthcare. Aramide enjoys building the intersection of AI and creative arenas and spending time with her family.
Rajdeep Banerjee is a Senior Partner Solutions Architect at AWS helping strategic partners and clients in the AWS cloud migration and digital transformation journey. Rajdeep focuses on working with partners to provide technical guidance on AWS, collaborate with them to understand their technical requirements, and designing solutions to meet their specific needs. He is a member of Serverless technical field community. Rajdeep is based out of Richmond, Virginia.

Deploy conversational agents with Vonage and Amazon Nova Sonic

This post is co-written with Mark Berkeland, Oscar Rodriguez and Marina Gerzon from Vonage.
Voice-based technologies are transforming the way businesses engage with customers across customer support, virtual assistants, and intelligent agents. However, creating real-time, expressive, and highly responsive voice interfaces still requires navigating a complex stack of communication protocols, AI models, and media infrastructure. To simplify this process, Vonage has integrated Amazon Nova Sonic, our speech-to-speech foundation model (FM), with the Vonage Voice API, part of their Communications Platform as a Service (CPaaS) offering.
With this integration, developers can deploy AI voice agents to enable more human-like voice conversations over phone calls, SIP connections, WebRTC, and mobile apps. The solution makes it straightforward to bring intelligent, real-time conversations into workflows for a variety of use cases, such as a small auto repair shop using voice AI to book appointments and track down parts, a global retail brand handling a high volume of customer service calls, or a developer building a scalable voice interface.
In this post, we explore how developers can integrate Amazon Nova Sonic with the Vonage communications service to build responsive, natural-sounding voice experiences in real time. By combining the Vonage Voice API with the low-latency and expressive speech capabilities of Amazon Nova Sonic, businesses can deploy AI voice agents that deliver more human-like interactions than traditional voice interfaces. These agents can be used as customer support, virtual assistants, and more.
Amazon Nova Sonic for real-time conversational AI
Amazon Nova Sonic is a speech-to-speech FM designed to build real-time conversational AI applications in Amazon Bedrock, with industry-leading price-performance and low latency. Its architecture unifies speech understanding and generation into a single model, to enable more human-like voice conversations in AI applications. The model can understand speech in different speaking styles and generate speech in expressive voices, including both masculine-sounding and feminine-sounding voices. Amazon Nova Sonic can adapt the intonation, prosody, and style of the generated speech response to align with the context and content of the speech input and gracefully handle interruptions. Additionally, Amazon Nova Sonic allows for function calling and knowledge grounding with enterprise data using Retrieval Augmented Generation (RAG).
Vonage Voice APIs, powered by AI
Vonage, an AWS partner, provides a developer-friendly platform for building voice, messaging, video, and authentication experiences. With its wide-ranging Voice APIs, Vonage offers WebRTC support, multi-channel communication tools, standard phone call integrations, in-app softphones, front-ending contact centers, and voice-over-browser functionality. The software also offers essential building blocks such as inbound and outbound voice call handling, voicemail support, and programmable logic for call routing and queuing. Vonage’s solution builder and SDKs allow for fast, low-code integration, while its interoperability with business applications and productivity tools enables teams to embed communication directly into their existing workflows.
Solution overview
Vonage collaborated with Amazon Nova Sonic to build low-latency, voice-first applications that can understand and respond like a human agent over standard telephony or WebRTC channels. This new tool can connect inbound and outbound Vonage calls directly to Amazon Nova Sonic for conversational AI processing, using expressive, real-time speech synthesis to deliver fluid, natural interactions. Amazon Nova Sonic’s integration into Vonage Voice API seamlessly manages audio buffering, custom media infrastructure, and protocol translation, so teams can focus on building engaging experiences.
With built-in conversation control logic and noise cancellation, Vonage’s integration with Amazon Nova Sonic makes it straightforward for businesses to rapidly build and deploy responsive AI voice agents. These agents can handle real-time voice conversations and scale voice interactions without relying on traditional contact centers.
Vonage is making this integration available as a GitHub repository for developers to deploy and customize to their needs.
“As an AWS Amazon Partner Network (APN) member, Vonage has a long history of working closely with the AWS innovation team to create new solutions to benefit enterprise customers,” said Christophe Van de Weyer, President and Head of Business Unit API for Vonage. “This latest collaboration with AWS enables organizations to transform how they engage with customers by adopting generative AI solutions that create added value for internal and external communication. By combining Vonage’s communications APIs with AWS’s advanced AI, this new voice AI agent technology enables businesses to streamline the adoption of intelligent agents, accelerate the modernization of legacy voice systems, and provide a robust service to deliver exceptional customer experiences with measurable improvements in satisfaction and operational efficiency.”
The following video showcases a demo of Diana, an AI voice agent built using Vonage’s integration with Amazon Nova Sonic.

The following architecture diagram provides an overview of Amazon Nova Sonic deployed as a voice agent in the Vonage Voice API framework on AWS.

The solution routes different types of incoming calls to Amazon Nova Sonic over a WebSocket connection. The architectural components include (left to right):

Calls – Incoming voice connections that can come from global phone numbers, SIP connections with contact centers or business systems, or WebRTC connections from web browsers and mobile apps.
Vonage Voice API – Provides programmatic control over these types of calls and voice connections, allowing them to be integrated with AI systems, routed elsewhere, or given speech and other treatments. Because Amazon Nova Sonic is a full speech-to-speech AI service, the real-time voice streams are connected directly, unlike other AI integrations that might use text-based integration.
Amazon Nova Sonic connector – A Vonage integration that connects calls to Amazon Nova Sonic over a WebSocket connection, providing low-latency, real-time, bi-directional voice streaming directly with Amazon Nova Sonic. The connector also manages voice isolation to better handle noisy environments, conversational elements like “barge in” where the caller interrupts the conversation, and fallback options if needed.
Amazon Nova Sonic – Part of the Amazon Nova family of FMs available in Amazon Bedrock. Amazon Nova Sonic unifies speech understanding and generation into a single model, streamlining development and reducing complexity when building conversational applications.
Retrieval Augmented Generation (RAG) – Tools within Amazon Bedrock that optimize the output of an underlying large language model (LLM). Amazon Nova Sonic can reference enterprise-authorized knowledge sources. Attribution and source visibility can be configured based on customer requirements.
Customizable prompt – Provided to the AI model and allows the voice agent’s personality and conversational capabilities to be defined and the right knowledge base to be used.
User context – Maintained by Amazon Nova Sonic throughout interaction sequences to allow a natural continuous conversation. Personally identifiable information (PII) is processed in real time and not retained by Amazon Nova Sonic. AWS safeguards your data through comprehensive security controls, encryption at rest and in transit, and compliance certifications, while also giving you the flexibility to configure additional logging, security, and compliance measures through AWS services.

These components work together to create a flexible, intelligent voice agent service that can dynamically adapt to different communication scenarios and business use cases with different knowledge bases and prompts.
Example use cases
The following are just a few of the high-impact ways businesses are already using this integration to transform voice interactions:

Customer support automation – Deploy voice agents that answer inbound customer queries, take appointments, and escalate calls only when necessary.
Proactive outbound calling – Generate dynamic, expressive outbound messages like reminders, confirmations, or follow-ups with voicemail fallback.
Multilingual voice assistants – Build voice experiences that seamlessly switch between English and Spanish depending on the caller, enabled by Vonage’s language detection and multilingual synthesis with Amazon Nova Sonic.

Conclusion
By combining Amazon Nova Sonic with Vonage’s flexible communication infrastructure, developers can build intelligent, responsive AI voice agents. With this solution, you can provide proactive voice engagement, create multilingual assistants, handle customer support, and more. This integration makes voice-first AI applications more accessible and scalable than ever.
To start building with Amazon Nova Sonic, visit the Amazon Bedrock console. For Vonage integration, explore the Vonage API Developer Portal or use the Vonage Solution Builder to configure your voice agent in minutes.
To learn more about Amazon Nova Sonic, check out the AWS News Blog, Amazon Nova Sonic product page, or Amazon Bedrock User Guide.

About the authors

Divyesha Malhotra is a Senior Product Manager Technical Intern on the AGI Nova Sonic team. She leads the customer adoption and integrations of cutting-edge speech-to-speech foundation models for next-generation voice-based technologies.
Mark Berkeland is a Senior Solutions Engineer in the API Business Unit at Vonage. He designs and implements technical solutions including demos and proofs of concept to help customers bring voice and messaging applications to life. With a professional programming career that began in 1979, his experience ranges from FORTRAN on punched cards to modern cloud-native stacks like React Native, combining deep technical expertise with a passion for making complex ideas accessible.
Oscar Rodriguez is Senior Director of Global Partner Solutions in the API Business Unit at Vonage, where he leads strategic initiatives to empower partners through scalable communications solutions. He brings deep technical expertise and a practical understanding of real-world application development with over 20 years experience in web technologies and the last 10 in CPaaS.
Marina Gerzon is a Partner Solutions Architect at Vonage with over 20 years of experience in real-time communications, specializing in Video and Voice over IP solutions. Known for her ability to bridge technical depth with business impact, her work spans Telecom, Education, Healthcare, Fintech, and Insurance industries, where she has consistently delivered enterprise-grade SaaS and PaaS architectures tailored to complex business needs.

Enabling customers to deliver production-ready AI agents at scale

AI agents will change how we all work and live. Our AWS CEO, Matt Garman, shared a vision of a technological shift as transformative as the advent of the internet. I’m energized by this vision because I’ve witnessed firsthand how these intelligent agent systems are already beginning to solve complex problems, automate workflows, and create new possibilities across industries. With agentic AI, AstraZeneca accelerated healthcare insight discovery, Yahoo Finance transformed financial research for millions of investors, and Syngenta revolutionized agriculture with AI-driven precision farming.
To expand these early successes into widespread adoption, organizations need a practical approach that addresses the inherent complexity of agentic systems. At AWS, we’re committed to being the best place to build the world’s most useful AI agents, empowering organizations to deploy reliable and secure agents at scale.
We’re focused on making our agentic AI vision accessible to every organization by combining rapid innovation with a strong foundation of security, reliability, and operational excellence. Our approach accelerates progress by building on proven principles while embracing new possibilities—creating systems that can adapt as models evolve, new capabilities emerge, and use cases expand across your business.
Today, I’m excited to share how we’re bringing this vision to life with new capabilities that address the fundamental aspects of building and deploying agents at scale. These innovations will help you move beyond experiments to production-ready agent systems that can be trusted with your most critical business processes.

A comprehensive foundation for building and deploying production-ready agentic AI systems at scale.

Guiding principles, evolved for agents
At AWS, our approach to agentic AI is shaped by our experience building agent systems internally and helping hundreds of thousands of customers accelerate their AI journeys. Four core principles guide everything we do in this space:
Principle 1: Embrace agility as a competitive edge
Organizations that thrive won’t be those who perfectly predict the future, but those who adapt quickly as it unfolds. Staying nimble requires an agentic architecture that embraces flexibility and openness rather than rigid frameworks or singular models. It means building systems that can incorporate new models as they emerge, connect to your proprietary data sources, and seamlessly integrate with your existing tools.
The dual need for stability and adaptability led us to create Amazon Bedrock AgentCore, a complete set of services for deploying and operating highly capable agents securely at enterprise scale. AgentCore provides a secure, serverless runtime with complete session isolation and the longest running workload available today, tools and capabilities to help agents execute workflows with the right permissions and context, and controls to operate trustworthy agents. Its capabilities can be used together or independently and work with popular open source frameworks such as CrewAI, LangGraph, LlamaIndex, and Strands Agents and with any model including those in (or outside of) Amazon Bedrock, so developers can stay agile as technology shifts. By reducing the undifferentiated heavy lifting, AgentCore helps organizations move beyond experiments to production-ready agent systems that can be trusted with your most critical business processes.
Customers like Itaú Unibanco, Innovaccer, Boomi, Box, and Epsilon are already experimenting with AgentCore and are excited about how it speeds their deployment of agents to production. These early adopters recognize that AgentCore helps eliminate the trade-off between open source flexibility and enterprise-grade security and reliability, allowing them to focus on creating business value rather than building security and operational foundations from scratch.
Principle 2: Evolve fundamentals for the agentic era
While the core principles of enterprise technology haven’t changed, how we implement them must evolve for the agentic era. These evolved fundamentals create the foundation that makes production-grade agents possible:

Security and Trust. Agents introduce new security considerations as they cross system boundaries, perform actions on behalf of users or act themselves with pre-authorized user consent. Trust requires transparency, guardrails, and verification. AgentCore Runtime helps address these with dedicated compute environments per session and memory isolation that helps prevent data leaks across agents, building on a decade of AWS Lambda serverless innovation in security and scalability.
Reliability and Scalability. Traditional approaches to scaling software fall short with agentic systems as they follow unpredictable execution paths and have variable resource requirements across interactions. AgentCore Runtime is highly reliable with checkpointing and recovery capabilities to help ensure graceful recovery in case of unexpected interruptions and failures, and it can automatically handle scaling from zero to thousands of concurrent sessions, eliminating capacity planning and infrastructure maintenance.
Identity. As agents act on behalf of users and systems, traditional identity models must evolve. Managing permissions of both the agent and the user as agents navigate complex workflows spanning multiple systems becomes critical to securing your data. AgentCore Identity delivers secure agent access across AWS services and third-party applications and tools with temporary, fine-grained permissions, and standards-based authentication. It works with leading identity providers such as Amazon Cognito, Microsoft Entra ID, and Okta, as well as popular OAuth providers such as GitHub, Google, Salesforce, and Slack.
Observability. Understanding agent decisions requires new approaches to monitoring. Observability becomes essential not just for troubleshooting, but for compliance and continuous improvement, representing a shift from periodic auditing to constant supervision. AgentCore Observability provides real-time visibility through built-in dashboards and standardized telemetry that integrates with your monitoring stack.
Data. Your proprietary data is more valuable than ever, enabling agents to understand your specific context. The ability to securely access, process, and learn from this data becomes a critical differentiator for agent performance and relevance. For example, with AgentCore Gateway, you can transform your data sources including Amazon Bedrock Knowledge Bases into agent-compatible tools so agents can access recent and relevant information.
Seamless Integration. Agents must work with everything in your environment: your systems, other clouds, SaaS applications, and other agents. AgentCore Gateway makes it possible by transforming APIs and services into agent-compatible tools with minimal code, eliminating months of integration work while enabling agents to discover and interact with your systems. Our open source Strands Agents SDK complements this with flexible orchestration patterns, and support for MCP and A2A to enable seamless coordination between multiple agents and tools across different environments. AWS API MCP Server gives agents a callable interface to AWS services, enabling foundation models to discover available operations, reason over input and output requirements, and generate plans that invoke AWS APIs to explore, configure, or manage resources with real-time AWS capabilities beyond model training cutoff.
Tooling and Capabilities. Agents need specialized tools to execute complex tasks and maintain context across interactions. AgentCore Memory makes it easy for developers to build context-aware agents by eliminating complex memory infrastructure management while providing full control over what the AI agent remembers. It provides industry-leading accuracy along with support for both short-term memory for multi-turn conversations and long-term memory that persists across sessions, with the ability to share memory stores across collaborating agents. Built-in tools include AgentCore Browser for web interactions, enabling agents to navigate websites and perform actions on your behalf, and AgentCore Code Interpreter for executing code securely, allowing agents to process data, generate visualizations, and solve complex problems programmatically. These capabilities extend what agents can do while maintaining security and reliability.

Together, these evolved fundamentals help organizations build secure, reliable, and scalable agent architectures that deliver consistent results in production environments. With AgentCore, we’re helping customers focus on creating value rather than reinventing infrastructure.
Principle 3: Deliver superior outcomes with model choice and data
At the heart of every effective agent system lies its foundation model, which powers an agent’s ability to understand, reason, and act. For agents to deliver transformative experiences, carefully selected and potentially tailored models need to interact with rich, context-specific knowledge that determines how effectively the model can make decisions on your behalf. This reality extends to all AI applications, which is why AWS gives customers both the freedom to choose the optimal model for each use case and the tools to enhance those models with their unique data. This approach delivers superior outcomes and the best price-performance for all AI implementations.
Model requirements vary widely—some applications demand sophisticated reasoning, others require fast responses, and many prioritize cost efficiency at scale. No single model excels across all dimensions, which is why we pioneered model choice with Amazon Bedrock in 2023. But the true differentiator is how you combine models with your organization’s proprietary data, transforming generic AI into systems with deep domain expertise.
To help you create models with this high level of expertise, today we’re expanding our model customization capabilities with the launch of Amazon Nova customization in Amazon SageMaker AI. Nova models now offer customers the flexibility to customize the model across the model development life cycle. This includes pre-training and post-training, including both fine-tuning and alignment, with support for parameter efficient fine-tuning (PEFT) and full fine-tuning. With these, Nova now offers the most comprehensive suite of model customization capabilities made available for any proprietary model family. Using techniques including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), reinforcement learning from human feedback using Proximal Policy Optimization (PPO), Continued Pre-Training (CPT), and Knowledge Distillation, customers can create Nova models optimized for their use-case. Once customized, these models can be deployed directly to Amazon Bedrock, allowing you to seamlessly integrate your custom models into your agent systems and other AI applications.
We are also training our own models optimized for specific agent use cases. Nova Act is an AI model trained to perform actions within a web browser. Customers can get started with building their own browser automation agents with the Nova Act SDK, purpose-built to enable reliable browser agents powered by the Nova Act model. The Nova Act SDK, available in research preview today, uses AgentCore Browser for scalable, cloud-based browser execution.
Once you have the right model, you need to ensure it can interact with your organization’s proprietary and current data. Vectors have emerged as the dominant and fastest way AI models can access your data. Until now, the cost of storing vector embeddings—the key to enabling this intelligence—has forced organizations to limit their AI systems to recent data only, constraining their potential. Today’s launch of Amazon S3 Vectors, the first cloud object store with native vector support, marks a fundamental change. By reducing vector storage costs by 90% while maintaining sub-second query performance, S3 Vectors enables agents that remember more, reason deeper, and maintain comprehensive context from every customer interaction, document, and business insight. S3 Vectors integrates directly with Amazon Bedrock Knowledge Bases for cost-effective RAG applications and Amazon OpenSearch Service for tiered vector strategies.
Principle 4: Deploy solutions that transform experiences
While models and infrastructure change how organizations build, agentic solutions transform how businesses operate. The true power of agentic AI lies in its ability to reshape workflows and human productivity across entire industries. These solutions free people from routine tasks and handle complex information flows, enabling teams to focus on creative thinking and strategic decisions. We’re making this transformation accessible to more organizations through pre-built agentic solutions. By combining foundational building blocks with pre-built solutions you can move beyond experiments to comprehensive AI strategies that deliver tangible business impact.
Today, we’re announcing that you can now buy AI Agents and Tools in AWS Marketplace, with streamlined procurement and multiple deployment options. In today’s fragmented AI landscape, AWS Marketplace offers a centralized catalog of curated agents, tools, and solutions from AWS Partners. Fast-track automation with pre-built agents from AWS Partners. Our new API-based deployment method helps you to streamline integrations with other agents and tools that support MCP and A2A. And these agents can run on trusted AWS services or in your AWS environment, where you maintain control over security and access. You can deploy select pre-built agents and tools on AgentCore.
We’re also continuing to give customers ready-to-deploy agent solutions that enable this transformation. Kiro is an AI IDE that helps developers go from concept to production with spec-driven development. From simple to complex tasks, Kiro works alongside you to turn prompts into detailed specs—then into working code, docs, and tests. So, what you build is exactly what you want and ready to share with your team. Kiro’s agents help you solve challenging problems and automate tasks like generating documentation and unit tests. With Kiro, you can build beyond prototypes while being in the driver’s seat every step of the way. AWS Transform deploys specialized AI agents to automate complex modernization tasks like code analysis, refactoring, and dependency mapping, dramatically reducing project timelines for enterprise workload migrations. Each solution shows our commitment to flexibility and choice, helping you innovate faster and realize business outcomes sooner. And Amazon Connect, a comprehensive customer experience solution, enables organizations to delight their customers with unlimited AI on every customer interaction across all channels.
These four principles guide our product strategy and are embedded in every innovation we’re announcing today: embracing agility, evolving fundamentals, combining model choice with proprietary data, and deploying transformative solutions. Together, they provide a comprehensive framework for successfully implementing agentic AI in your organization.
The path forward
The significant potential for our customers and our own diverse businesses has inspired us to focus on building the most trustworthy agentic AI capabilities on the planet. But the most important advice I can offer is simple: start now.
Don’t get trapped trying to boil the ocean or waiting for all the answers before you begin. Pick a specific business problem that matters and get building. The organizations seeing the greatest success aren’t those with the most ambitious plans, they’re those who have started the learning cycle, gathering real-world feedback that informs each iteration. To help our customers on their AI journey, we’re investing another 100 million dollars, doubling our investment, in the AWS Generative AI Innovation Center which has helped thousands of customers across industries including NFL, Yahoo Finance, BMW, and AstraZeneca achieve millions of dollars in productivity gains and transform customer experiences.
AWS set the standard for security, reliability, and data privacy for cloud computing, and we’re bringing these same principles to agentic AI. No matter your use case or requirements, AWS provides the right foundation to help you succeed. Together, we can reinvent what’s possible for your business through the power of agentic AI.

About the author
Swami Sivasubramanian is Vice President for Agentic AI at Amazon Web Services (AWS). At AWS, Swami has led the development and growth of leading AI services like Amazon DynamoDB, Amazon SageMaker, Amazon Bedrock, and Amazon Q. His team’s mission is to provide the scale, flexibility, and value that customers and partners require to innovate using agentic AI with confidence and build agents that are not only powerful and efficient, but also trustworthy and responsible. Swami also served from May 2022 through May 2025 as a member of the National Artificial Intelligence Advisory Committee, which was tasked with advising the President of the United States and the National AI Initiative Office on topics related to the National AI Initiative.

A Coding Implementation to Build a Multi-Agent Research and Content Pi …

In this tutorial, we set up an end-to-end AI agent system powered by CrewAI and Google’s Gemini models. We start by installing all required packages, configuring the Gemini key securely, and then building a suite of specialized agents, including research, data analysis, content creation, and quality assurance, each optimized for rapid, sequential collaboration. With clear utility classes and interactive commands, we streamline everything from quick one-off analyses to comprehensive multi-agent research projects right inside the notebook.

Copy CodeCopiedUse a different Browserimport subprocess
import sys
import os

def install_packages():
“””Install required packages in Colab”””
packages = [
“crewai”,
“crewai-tools”,
“google-generativeai”,
“python-dotenv”,
“langchain-google-genai”
]

for package in packages:
try:
print(f” Installing {package}…”)
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, package, “-q”])
print(f” {package} installed successfully!”)
except Exception as e:
print(f” Failed to install {package}: {e}”)

print(” Setting up Google Colab environment…”)
install_packages()
print(” All packages installed!”)

Check out the full Codes

We kick things off by auto-installing CrewAI, Gemini client libraries, and other helpers, ensuring that every dependency is ready within the Colab runtime. As the loop runs, we see each package installed quietly and verify its success before proceeding.

Copy CodeCopiedUse a different Browserimport warnings
warnings.filterwarnings(‘ignore’)

from crewai import Agent, Task, Crew, Process
from crewai_tools import FileReadTool
from langchain_google_genai import ChatGoogleGenerativeAI
import google.generativeai as genai
from google.colab import userdata
import time
import json
from datetime import datetime

Check out the full Codes

We silence warnings for a cleaner log, import the CrewAI core classes and Gemini wrappers, and pull in utility modules such as time and datetime; this provides us with all the building blocks we’ll utilize throughout the notebook.

Copy CodeCopiedUse a different Browserdef setup_api_key():
“””Setup Gemini API key in Colab”””
try:
api_key = userdata.get(‘GEMINI_API_KEY’)
print(” API key loaded from Colab secrets!”)
return api_key
except:
print(” Gemini API key not found in Colab secrets.”)
print(“Please follow these steps:”)
print(“1. Go to https://makersuite.google.com/app/apikey”)
print(“2. Create a free API key”)
print(“3. In Colab, go to (Secrets) in the left sidebar”)
print(“4. Add a new secret named ‘GEMINI_API_KEY’ with your API key”)
print(“5. Enable notebook access for the secret”)
print(“6. Re-run this cell”)

from getpass import getpass
api_key = getpass(“Or enter your Gemini API key here (it will be hidden): “)
return api_key

GEMINI_API_KEY = setup_api_key()

Check out the full Codes

We retrieve our Gemini API key from Colab Secrets, or, if it’s missing, we prompt ourselves to securely paste it. A quick test call confirms the key works, ensuring our LLM is authenticated before any real tasks begin.

Copy CodeCopiedUse a different Browserclass ColabGeminiAgentSystem:
def __init__(self, api_key):
“””Initialize the Colab-optimized Gemini agent system”””
self.api_key = api_key
self.setup_gemini()
self.setup_tools()
self.setup_agents()
self.results_history = []

def setup_gemini(self):
“””Configure Gemini API for Colab”””
try:
genai.configure(api_key=self.api_key)

model = genai.GenerativeModel(‘gemini-1.5-flash’)
response = model.generate_content(“Hello, this is a test.”)
print(” Gemini API connection successful!”)

self.llm = ChatGoogleGenerativeAI(
model=”gemini-1.5-flash”,
google_api_key=self.api_key,
temperature=0.7,
convert_system_message_to_human=True
)

except Exception as e:
print(f” Gemini API setup failed: {str(e)}”)
raise

def setup_tools(self):
“””Initialize available tools”””
self.file_tool = FileReadTool()
print(” Tools initialized successfully!”)

def setup_agents(self):
“””Create specialized agents optimized for Colab”””

self.researcher = Agent(
role=”Senior Research Analyst”,
goal=”Conduct comprehensive research and provide detailed insights”,
backstory=”””You are an expert research analyst with extensive experience in
gathering, analyzing, and synthesizing information. You excel at identifying
key trends, patterns, and providing actionable insights.”””,
llm=self.llm,
tools=[self.file_tool],
verbose=True,
allow_delegation=False,
max_iter=2,
memory=True
)

self.data_analyst = Agent(
role=”Data Analysis Expert”,
goal=”Analyze information and provide statistical insights”,
backstory=”””You are a skilled data analyst who excels at interpreting
complex information, identifying patterns, and creating actionable
recommendations based on data-driven insights.”””,
llm=self.llm,
tools=[self.file_tool],
verbose=True,
allow_delegation=False,
max_iter=2,
memory=True
)

self.content_creator = Agent(
role=”Content Strategy Expert”,
goal=”Transform research into engaging, accessible content”,
backstory=”””You are a creative content strategist who excels at
transforming complex research and analysis into clear, engaging
content that resonates with target audiences.”””,
llm=self.llm,
tools=[self.file_tool],
verbose=True,
allow_delegation=False,
max_iter=2,
memory=True
)

self.qa_agent = Agent(
role=”Quality Assurance Specialist”,
goal=”Ensure high-quality, accurate, and coherent deliverables”,
backstory=”””You are a meticulous quality assurance expert who ensures
all deliverables meet high standards of accuracy, clarity, and coherence.”””,
llm=self.llm,
tools=[self.file_tool],
verbose=True,
allow_delegation=False,
max_iter=1,
memory=True
)

print(” All agents initialized successfully!”)

def create_colab_tasks(self, topic, task_type=”comprehensive”):
“””Create optimized tasks for Colab environment”””

if task_type == “comprehensive”:
return self._create_comprehensive_tasks(topic)
elif task_type == “quick”:
return self._create_quick_tasks(topic)
elif task_type == “analysis”:
return self._create_analysis_tasks(topic)
else:
return self._create_comprehensive_tasks(topic)

def _create_comprehensive_tasks(self, topic):
“””Create comprehensive research tasks”””

research_task = Task(
description=f”””
Research the topic: {topic}

Provide a comprehensive analysis including:
1. Key concepts and definitions
2. Current trends and developments
3. Main challenges and opportunities
4. Future outlook and implications

Format your response in clear sections with bullet points.
“””,
agent=self.researcher,
expected_output=”Structured research report with clear sections and key insights”
)

analysis_task = Task(
description=f”””
Analyze the research findings for: {topic}

Provide:
1. Key insights and patterns
2. Statistical observations (if applicable)
3. Comparative analysis
4. Actionable recommendations
5. Risk assessment

Present findings in a clear, analytical format.
“””,
agent=self.data_analyst,
expected_output=”Analytical report with insights and recommendations”,
context=[research_task]
)

content_task = Task(
description=f”””
Create engaging content about: {topic}

Based on research and analysis, create:
1. Executive summary (2-3 paragraphs)
2. Key takeaways (5-7 bullet points)
3. Actionable recommendations
4. Future implications

Make it accessible and engaging for a general audience.
“””,
agent=self.content_creator,
expected_output=”Engaging, well-structured content for general audience”,
context=[research_task, analysis_task]
)

qa_task = Task(
description=f”””
Review and improve all content for: {topic}

Ensure:
1. Accuracy and consistency
2. Clear structure and flow
3. Completeness of information
4. Readability and engagement

Provide the final polished version.
“””,
agent=self.qa_agent,
expected_output=”Final polished content with quality improvements”,
context=[research_task, analysis_task, content_task]
)

return [research_task, analysis_task, content_task, qa_task]

def _create_quick_tasks(self, topic):
“””Create quick analysis tasks for faster execution”””

quick_research = Task(
description=f”””
Provide a quick but thorough analysis of: {topic}

Include:
1. Brief overview and key points
2. Main benefits and challenges
3. Current status and trends
4. Quick recommendations

Keep it concise but informative.
“””,
agent=self.researcher,
expected_output=”Concise analysis with key insights”
)

quick_content = Task(
description=f”””
Create a summary report for: {topic}

Format:
1. Executive summary
2. Key findings (3-5 points)
3. Recommendations (3-5 points)
4. Next steps

Make it actionable and clear.
“””,
agent=self.content_creator,
expected_output=”Clear summary report with actionable insights”,
context=[quick_research]
)

return [quick_research, quick_content]

def _create_analysis_tasks(self, topic):
“””Create analysis-focused tasks”””

deep_analysis = Task(
description=f”””
Perform deep analysis of: {topic}

Focus on:
1. Detailed examination of key components
2. Pros and cons analysis
3. Comparative evaluation
4. Strategic implications
5. Data-driven conclusions

Provide thorough analytical insights.
“””,
agent=self.data_analyst,
expected_output=”Deep analytical report with detailed insights”
)

return [deep_analysis]

def execute_colab_project(self, topic, task_type=”comprehensive”, save_results=True):
“””Execute project optimized for Colab”””

print(f”n Starting Colab AI Agent Project”)
print(f” Topic: {topic}”)
print(f” Task Type: {task_type}”)
print(“=” * 60)

start_time = time.time()

try:
tasks = self.create_colab_tasks(topic, task_type)

if task_type == “quick”:
agents = [self.researcher, self.content_creator]
elif task_type == “analysis”:
agents = [self.data_analyst]
else:
agents = [self.researcher, self.data_analyst, self.content_creator, self.qa_agent]

crew = Crew(
agents=agents,
tasks=tasks,
process=Process.sequential,
verbose=1,
memory=True,
max_rpm=20
)

result = crew.kickoff()

execution_time = time.time() – start_time

print(f”n Project completed in {execution_time:.2f} seconds!”)
print(“=” * 60)

if save_results:
self._save_results(topic, task_type, result, execution_time)

return result

except Exception as e:
print(f”n Project execution failed: {str(e)}”)
print(” Try using ‘quick’ task type for faster execution”)
return None

def _save_results(self, topic, task_type, result, execution_time):
“””Save results to history”””
result_entry = {
‘timestamp’: datetime.now().isoformat(),
‘topic’: topic,
‘task_type’: task_type,
‘execution_time’: execution_time,
‘result’: str(result)
}

self.results_history.append(result_entry)

try:
with open(‘colab_agent_results.json’, ‘w’) as f:
json.dump(self.results_history, f, indent=2)
print(” Results saved to colab_agent_results.json”)
except Exception as e:
print(f” Could not save results: {e}”)

def show_results_history(self):
“””Display results history”””
if not self.results_history:
print(” No results history available”)
return

print(“n Results History:”)
print(“=” * 50)

for i, entry in enumerate(self.results_history, 1):
print(f”n{i}. Topic: {entry[‘topic’]}”)
print(f” Task Type: {entry[‘task_type’]}”)
print(f” Execution Time: {entry[‘execution_time’]:.2f}s”)
print(f” Timestamp: {entry[‘timestamp’]}”)
print(“-” * 30)

def create_custom_agent(self, role, goal, backstory, max_iter=2):
“””Create a custom agent”””
return Agent(
role=role,
goal=goal,
backstory=backstory,
llm=self.llm,
tools=[self.file_tool],
verbose=True,
allow_delegation=False,
max_iter=max_iter,
memory=True
)
We architect the heart of the workflow: a ColabGeminiAgentSystem class that wires Gemini into LangChain, defines a file-reading tool, and spawns four specialized agents, research, data, content, and QA, each ready to collaborate on tasks.

print(” Initializing Colab AI Agent System…”)
try:
agent_system = ColabGeminiAgentSystem(GEMINI_API_KEY)
print(” System ready for use!”)
except Exception as e:
print(f” System initialization failed: {e}”)
print(“Please check your API key and try again.”)
We instantiate the agent system with our API key, watching for a success message that tells us the model handshake and agent initialization all land smoothly, our framework is officially alive.

def run_quick_examples():
“””Run quick examples to demonstrate the system”””

print(“n Quick Start Examples”)
print(“=” * 40)

print(“n1. Quick Analysis Example:”)
topic1 = “Machine Learning in Business”
result1 = agent_system.execute_colab_project(topic1, task_type=”quick”)

if result1:
print(f”n Quick Analysis Result:”)
print(result1)

print(“n2. Deep Analysis Example:”)
topic2 = “Sustainable Energy Solutions”
result2 = agent_system.execute_colab_project(topic2, task_type=”analysis”)

if result2:
print(f”n Deep Analysis Result:”)
print(result2)

Check out the full Codes

We demonstrate the workflow with two lightning-round projects: a “quick” analysis of machine-learning trends and a deeper dive into sustainable energy, printing each result so we can see the agents in action right away.

Copy CodeCopiedUse a different Browserdef interactive_agent_system():
“””Interactive interface for the agent system”””

print(“n Interactive AI Agent System”)
print(“=” * 40)
print(“Available commands:”)
print(“1. ‘research [topic]’ – Comprehensive research”)
print(“2. ‘quick [topic]’ – Quick analysis”)
print(“3. ‘analyze [topic]’ – Deep analysis”)
print(“4. ‘history’ – Show results history”)
print(“5. ‘help’ – Show this help”)
print(“6. ‘exit’ – Exit the system”)
print(“=” * 40)

while True:
try:
command = input(“n Enter command: “).strip().lower()

if command == ‘exit’:
print(” Goodbye!”)
break
elif command == ‘help’:
print(“nAvailable commands:”)
print(“- research [topic] – Comprehensive research”)
print(“- quick [topic] – Quick analysis”)
print(“- analyze [topic] – Deep analysis”)
print(“- history – Show results history”)
print(“- exit – Exit the system”)
elif command == ‘history’:
agent_system.show_results_history()
elif command.startswith(‘research ‘):
topic = command[9:]
agent_system.execute_colab_project(topic, task_type=”comprehensive”)
elif command.startswith(‘quick ‘):
topic = command[6:]
agent_system.execute_colab_project(topic, task_type=”quick”)
elif command.startswith(‘analyze ‘):
topic = command[8:]
agent_system.execute_colab_project(topic, task_type=”analysis”)
else:
print(” Unknown command. Type ‘help’ for available commands.”)

except KeyboardInterrupt:
print(“n System interrupted. Goodbye!”)
break
except Exception as e:
print(f” Error: {e}”)

Check out the full Codes

We build a mini command-line loop that lets us type “research,” “quick,” or “analyze” followed by a topic to spin up new projects on demand; this turns the notebook into an interactive sandbox without requiring extra coding.

Copy CodeCopiedUse a different Browserclass ColabUtils:
“””Utility functions for Colab”””

@staticmethod
def download_results():
“””Download results file”””
try:
from google.colab import files
files.download(‘colab_agent_results.json’)
print(” Results file downloaded!”)
except Exception as e:
print(f” Download failed: {e}”)

@staticmethod
def display_formatted_result(result):
“””Display result in formatted way”””
from IPython.display import display, Markdown

if isinstance(result, str):
display(Markdown(f”### AI Agent Resultnn{result}”))
else:
display(Markdown(f”### AI Agent Resultnn{str(result)}”))

@staticmethod
def save_to_drive():
“””Save results to Google Drive”””
try:
from google.colab import drive
drive.mount(‘/content/drive’)

import shutil
shutil.copy(‘colab_agent_results.json’, ‘/content/drive/MyDrive/agent_results.json’)
print(” Results saved to Google Drive!”)
except Exception as e:
print(f” Drive save failed: {e}”)

Check out the full Codes

We add finishing touches, helpers to download results, pretty-print Markdown summaries, and push our JSON history to Google Drive, so we can share or archive our findings with a single call.

Copy CodeCopiedUse a different Browserdef run_demo():
“””Run a comprehensive demo”””

print(“n AI Agent System Demo”)
print(“=” * 50)

demo_topics = [
(“Artificial Intelligence Ethics”, “quick”),
(“Climate Change Solutions”, “analysis”),
(“Future of Work”, “comprehensive”)
]

for topic, task_type in demo_topics:
print(f”n Demo: {topic} ({task_type})”)
result = agent_system.execute_colab_project(topic, task_type)

if result:
ColabUtils.display_formatted_result(result)

time.sleep(2)

print(“””
Google Colab AI Agent System Ready!

How to use:

1. **Quick Start**:
“`python
result = agent_system.execute_colab_project(“Your Topic”, task_type=”quick”)
“`

2. **Comprehensive Analysis**:
“`python
result = agent_system.execute_colab_project(“Your Topic”, task_type=”comprehensive”)
“`

3. **Deep Analysis**:
“`python
result = agent_system.execute_colab_project(“Your Topic”, task_type=”analysis”)
“`

4. **Interactive Mode**:
“`python
interactive_agent_system()
“`

5. **Run Demo**:
“`python
run_demo()
“`

6. **View History**:
“`python
agent_system.show_results_history()
“`

7. **Download Results**:
“`python
ColabUtils.download_results()
“`

**Example Usage**:
“`python
# Quick analysis
result = agent_system.execute_colab_project(“Machine Learning Trends”, “quick”)
print(result)

# Show formatted result
ColabUtils.display_formatted_result(result)
“`

**Tips for Colab**:
– Use “quick” task type for faster execution
– Results are automatically saved
– Use ColabUtils for better formatting
– Download results before closing the session

**Troubleshooting**:
– If you get API errors, check your rate limits
– For memory issues, restart runtime
– Use quick tasks for better performance
“””)

We script a showcase that cycles through three topics and task types, displaying formatted outputs between short pauses; this proves the system scales from rapid briefs to full, multi-agent studies.

In conclusion, we have a fully operational, reusable framework that lets us spin up research pipelines, generate polished outputs, and store our results with just a few commands. We can now run quick tests, deep dives, or interactive sessions on any topic, download the findings, and even mount them to Google Drive.

Check out the full Codes. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]
The post A Coding Implementation to Build a Multi-Agent Research and Content Pipeline with CrewAI and Gemini appeared first on MarkTechPost.

This AI Paper Introduces TableRAG: A Hybrid SQL and Text Retrieval Fra …

Handling questions that involve both natural language and structured tables has become an essential task in building more intelligent and useful AI systems. These systems are often expected to process content that includes diverse data types, such as text mixed with numerical tables, which are commonly found in business documents, research papers, and public reports. Understanding such documents requires the AI to perform reasoning that spans both textual explanations and table-based details—a process that is inherently more complicated than traditional text-based question answering.

One of the major problems in this area is that current language models often fail to interpret documents accurately when tables are involved. Models tend to lose the relationships between rows and columns when the tables are flattened into plain text. This distorts the underlying structure of the data and reduces the accuracy of answers, especially when the task involves computations, aggregations, or reasoning that connects multiple facts across the document. Such limitations make it challenging to utilize standard systems for practical multi-hop question-answering tasks that require insights from both text and tables.

To solve these problems, previous methods have attempted to apply Retrieval-Augmented Generation (RAG) techniques. These involve retrieving text segments and feeding them into a language model for answer generation. However, these techniques are insufficient for tasks that require compositional or global reasoning across large tabular datasets. Tools like NaiveRAG and TableGPT2 try to simulate this process by converting tables into Markdown format or generating code-based execution in Python. Yet, these methods still struggle with tasks where maintaining the table’s original structure is necessary for correct interpretation.

Researchers from Huawei Cloud BU proposed a method named TableRAG that directly addresses these limitations. Research introduced TableRAG as a hybrid system that alternates between textual data retrieval and structured SQL-based execution. This approach preserves the tabular layout and treats table-based queries as a unified reasoning unit. This new system not only preserves the table structure but also executes queries in a manner that respects the relational nature of data, organized in rows and columns. The researchers also created a dataset called HeteQA to benchmark the performance of their method across different domains and multi-step reasoning tasks.

TableRAG functions in two main stages. The offline stage involves parsing heterogeneous documents into structured databases by extracting tables and textual content separately. These are stored in parallel corpora—a relational database for tables and a chunked knowledge base for text. The online phase handles user questions through an iterative four-step process: query decomposition, text retrieval, SQL programming and execution, and intermediate answer generation. When a question is received, the system identifies whether it requires tabular or textual reasoning, dynamically chooses the appropriate strategy, and combines the outputs. SQL is used for precise symbolic execution, enabling better performance in numerical and logical computations.

During experiments, TableRAG was tested on several benchmarks, including HybridQA, WikiTableQuestions, and the newly constructed HeteQA. HeteQA consists of 304 complex questions across nine diverse domains and includes 136 unique tables, as well as over 5,300 Wikipedia-derived entities. The dataset challenges models with tasks like filtering, aggregation, grouping, calculation, and sorting. TableRAG outperformed all baseline methods, including NaiveRAG, React, and TableGPT2. It achieved consistently higher accuracy, with document-level reasoning powered by up to 5 iterative steps, and utilized models such as Claude-3.5-Sonnet and Qwen-2.5-72B to verify the results.

The work presented a strong and well-structured solution to the challenge of reasoning over mixed-format documents. By maintaining structural integrity and adopting SQL for structured data operations, the researchers demonstrated an effective alternative to existing retrieval-based systems. TableRAG represents a significant step forward in question-answering systems that handle documents containing both tables and text, offering a viable method for more accurate, scalable, and interpretable document understanding.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]
The post This AI Paper Introduces TableRAG: A Hybrid SQL and Text Retrieval Framework for Multi-Hop Question Answering over Heterogeneous Documents appeared first on MarkTechPost.